Any lessons talk about how to remove vocal from a song by deep learning?

Do part 2 talk about how to separate vocal and songs or related techniques?


  1. Get 100+ sample of songs both with and without vocals
  2. Chop them in something like 10s pieces
  3. Apply STFT
  4. Train a CNN on this data

I could go in more detail, but until you accomplished step 1, there’s really no point :slight_smile:

1 Like

Thanks for your suggestions. I download MIR-1K and DSD100.

I key in “deep learning separate song and vocal”. Only a few of papers(I only found Deep karaoke) try to solve this problem, why?Possible situations

  1. This problem considered solved because the results are very good(I hope so)
  2. Only a few researchers interesting about it
  3. Do not have much value in commercial world

By the way, how many samples(rough number) do I need if I want to reach great results?

hey tham,

your question can be best answered with “it depends”. it depend how complex your problem is, it depends how good is good enough, it depends if its a classification or segmentation task, it depends how effectively you can use data augmentation. so theres no answer to give you “in general”. imagenet consists of millions of samples. i work on a medical segmentation with 76 samples. it just really depends.

for your problem i simply guessed 100 songs, because they could be split in something like 50 10s snippets with some overlay between each other. 5000 samples seems reasonable to me. maybe you need a lot less because its a segmentation like problem. maybe you need something cleser to 50000.

what worked well for me when i was playing with audio data, was applying something called a STFT, which transforms the audio to another representation. this representation can then be used to train a CNN and also be converted back to audio using the inverse STFT. i would be very surprised if this sort of problem is fully solved.

are you looking for a solution or are you looking for an experiment to play around with?

1 Like

This dataset is ultra small, I haven’t studied part 2 yet, do this mean image segmentation is less data hungry?

Looking for an experiment to play around with, this maybe the next project I would like to try after emotion classification are done. Maybe I would change my mind after I studied part 2(waiting for them to release).

Pre-releases are already online for Part 2: Pre-release part 2 videos

My comment was a little confusion. It’s 76 3D samples, so in reality i have 24 times that in 2D images :slight_smile: yes, segmentations are what you refer to as “less data hungry”. the reason is that you have much more information to back propagate through the network. a segmentation can be seen as a classification for every pixel and as such you get more info from a single sample.

1 Like