Any lessons talk about how to remove vocal from a song by deep learning?


Do part 2 talk about how to separate vocal and songs or related techniques?

(Pietz) #2


  1. Get 100+ sample of songs both with and without vocals
  2. Chop them in something like 10s pieces
  3. Apply STFT
  4. Train a CNN on this data

I could go in more detail, but until you accomplished step 1, there’s really no point :slight_smile:


Thanks for your suggestions. I download MIR-1K and DSD100.

I key in “deep learning separate song and vocal”. Only a few of papers(I only found Deep karaoke) try to solve this problem, why?Possible situations

  1. This problem considered solved because the results are very good(I hope so)
  2. Only a few researchers interesting about it
  3. Do not have much value in commercial world

By the way, how many samples(rough number) do I need if I want to reach great results?

(Pietz) #4

hey tham,

your question can be best answered with “it depends”. it depend how complex your problem is, it depends how good is good enough, it depends if its a classification or segmentation task, it depends how effectively you can use data augmentation. so theres no answer to give you “in general”. imagenet consists of millions of samples. i work on a medical segmentation with 76 samples. it just really depends.

for your problem i simply guessed 100 songs, because they could be split in something like 50 10s snippets with some overlay between each other. 5000 samples seems reasonable to me. maybe you need a lot less because its a segmentation like problem. maybe you need something cleser to 50000.

what worked well for me when i was playing with audio data, was applying something called a STFT, which transforms the audio to another representation. this representation can then be used to train a CNN and also be converted back to audio using the inverse STFT. i would be very surprised if this sort of problem is fully solved.

are you looking for a solution or are you looking for an experiment to play around with?


This dataset is ultra small, I haven’t studied part 2 yet, do this mean image segmentation is less data hungry?

Looking for an experiment to play around with, this maybe the next project I would like to try after emotion classification are done. Maybe I would change my mind after I studied part 2(waiting for them to release).

(Pietz) #6

Pre-releases are already online for Part 2: Pre-release part 2 videos

My comment was a little confusion. It’s 76 3D samples, so in reality i have 24 times that in 2D images :slight_smile: yes, segmentations are what you refer to as “less data hungry”. the reason is that you have much more information to back propagate through the network. a segmentation can be seen as a classification for every pixel and as such you get more info from a single sample.