I don‘t have a full or definite answer on this, but I have asked the same question before. In Lesson 10 (Part2 2019) Jeremy discussed some of this (Lesson 10 Discussion & Wiki (2019)) and a general suggestion is to not use a ‚background‘ class and not use softmax, but rather to use the probabilities and thresholds to classify the ‚background only‘ images basically by the absence of probable detections.
Re: the training practice I don‘t know, but it might be useful to experiment with the curriculum learning approaches here, maybe train only with animal images first and then add more and more pure background images to the training set for later epochs. But I have not tried this myself yet. If you do try, please share your results!