Help with model training strategy

I am in the process of creating a image classifier to identify mushrooms. I have about 3.5m images that are categorized down to the species level. I’m currently playing around with a few different ideas for improving my results.

The first thing I tried, which is similar to what this guy did:

He uses the taxonomy to creating a specialized loss function under the assumption that assumption that since breed is a subset of species, that if the classifier gets the species right, it shouldn’t be penalized as much. I’m using the genus → species hierarchy for my loss function. It’s seems to work ok but I feel I can do better.

I was then thinking about training a model on recognizing the “family”, then re-training on “genus”, then re-training again on the “species”. But I’m not sure of the best way to do this. Should I just train a model, swap out the learner.dls with the genus dataloaders, then do it again for the species dataloaders? Is there a better way to do this.

Also, if you have any other ideas to try, let me know as this is my first real deep learning project and would love any input.

Edit:
As a quick follow up, I’m also experimenting with changing the loss functions, i.e. training a model using a loss function based on the “family”, then training again with a “genus” one and finally one focused on “species”. This way I’m always using the same dataloaders and am not mixing up the images I am training on over time.

I’m also new here, not so sure my suggestion makes sense but…

What if the target classes encoded this relationship in a way that the error is different between choosing the wrong species vs genus?
Maybe not a useful one, but imagine if the species was encoded as a multiplier of 1000 (arbitrary magnitude here) and breed would be a smaller unit.

For example:
1010 - would mean species 1000 breed 10 (within that species)
1099 - would mean species 1000 breed 99 (within that species)
2001 - would mean species 2000 breed 1 (within that species)

The distance between failing to classify a species vs breed are orders of magnitude different.

Does that make any sense? This would mean that many existing loss functions would apply out of the box.