your question can be best answered with "it depends". it depend how complex your problem is, it depends how good is good enough, it depends if its a classification or segmentation task, it depends how effectively you can use data augmentation. so theres no answer to give you "in general". imagenet consists of millions of samples. i work on a medical segmentation with 76 samples. it just really depends.
for your problem i simply guessed 100 songs, because they could be split in something like 50 10s snippets with some overlay between each other. 5000 samples seems reasonable to me. maybe you need a lot less because its a segmentation like problem. maybe you need something cleser to 50000.
what worked well for me when i was playing with audio data, was applying something called a STFT, which transforms the audio to another representation. this representation can then be used to train a CNN and also be converted back to audio using the inverse STFT. i would be very surprised if this sort of problem is fully solved.
are you looking for a solution or are you looking for an experiment to play around with?