Okay, so here’s an update. Just by using a better type of spectrogram, I was able to achieve 80.5% accuracy across the cross validation folds. That is with no kind of augmentation at all.
According to the latest publication on the dataset’s website, the state-of-the-art mean accuracy achieved was 79%. It should be noted that is with extensive audio specific augmentation, and without augmentation their top accuracy was 74% .
It’s pretty cool that fastai out-of-the-box can produce these kind of results even on images distant from the kind found in ImageNet!