I had an idea for week1 homework to try to classify audio files, I got my hands on four 15 minutes audio files for different four persons saying the same speech.
I split and converted the audio files to 10 seconds mel-spectograms images (the dataset images are around 385 image with four classes).
This is one spectogram for example:
Then, I fed the images to a Resnet34 model (with 25% validation set), and this is the result of fit_once_cycle(4):
I see that the train loss is higher than the validation loss, so what do you think I did wrong?