I am getting 100% accuracy, is there something wrong?

I had an idea for week1 homework to try to classify audio files, I got my hands on four 15 minutes audio files for different four persons saying the same speech.

I split and converted the audio files to 10 seconds mel-spectograms images (the dataset images are around 385 image with four classes).
This is one spectogram for example:
Figure_1

Then, I fed the images to a Resnet34 model (with 25% validation set), and this is the result of fit_once_cycle(4):

|epoch|train_loss|valid_loss|accuracy|time|
|0|1.487757|1.364996|0.421875|00:02|
|1|0.714470|0.022866|0.989583|00:02|
|2|0.448643|0.000532|1.000000|00:02|
|3|0.316020|0.000151|1.000000|00:02|

I see that the train loss is higher than the validation loss, so what do you think I did wrong?