Why BCELoss instead of directly minimizing -output?

I’m trying to learn deep learning and am currently going though the part 1 lectures (well, I saw all of them, but am still trying to work out some things).

My question is: suppose you have a (say, binary) classifier that outputs 0 for cats and 1 for dogs and you want to train it. Why is the loss function given with the binary cross entropy function comparing dogs to 1’s and cats to 0’s and not just directly the output of the classifier (even before the sigmoid!)? I want the neural net to output a low number as possible with cats and a high number as possible with dogs. So why not use a loss function of mean(cats) - mean(dogs)?

Thank you.