My experience with audio files in deep learning is completely opposite of what @colliewrangler recommends. I suggest you spend a great deal on the spectrogram conversion. Your files look broken.
Experiment whether different clip lengths, vertical resolutions, horizontal resolutions and log scaling the spectrograms.