Deep Learning with Audio Thread

Thanks for the notebooks, they have really helped a ton and I have been using them constantly as a reference :slight_smile:

However, I’m still unsure about one thing: what does ref do in power_to_db and amplitude_to_db? I tried reading the docs but I’m not sure exactly what it does. I’m also not sure what is the best value to put there because I have seen many leave it as the default (1.0), some use np.min, and some use np.max (this actually seems to be the most common). Additionally, I’ve seen many take the absolute value of the signal when using this function, what is the purpose of that?

Thanks again for your help!

Also, minor bug in the cells where you use Image.show(), the spectrograms are actually upside down. If you change it to:

Image.show(torch.from_numpy(sg2).flip(0).unsqueeze(0), figsize=(15, 5), cmap='magma')

Then it will flip the images the right way (and keep a consistent color map with librosa defaults and the rest of the notebook).

Decibels is a relative scale so values must be relative to some reference value, ref in this case. So after conversion a value equal to ref will be 0dB. The default of 1.0 would be based on the common case where values are floats in the range (-1,1), which is commonly used to avoid different values depending on input file bit-depth (i.e. 8-bit files have a range of 0 to 255, or -128 to 127 if signed, while 16 bit files have a range of 0 to 65536, or -32768 to 32767). This will give decibel values in a range of around (-70, 0), depending on signal resolution. That’s prior to processing which can give higher values, values aren’t actually limited to the range. np.max would use the actual maximum value of your signal as the reference giving a similar range of (-70,0), while np.min would give a range of around (0, 70). Don’t think it matters especially much which you use. In fact I don’t think any are especially well suited to neural nets as none are near the mean of 0 and SD of 1 that is ideal. Though subsequent normalisation (i.e. substract mean, divide by SD) should work with any and I think should produce much the same normalised values from them all. So it doesn’t really matter and the primary purpose is likely just to produce values that align with common ranges used for audio (going back to common conventions on analog equipment).

The outputs of a fourier transform are complex numbers, taking the absolute value of a complex number gives a real value, so:

>>> c = np.complex(real=3, imag=4)
>>> c, abs(c), np.sqrt(c.real**2 + c.imag**2)
((3+4j), 5.0, 5.0)

This is the mag component given by librosa’s magphase. In torchaudio this is done with complex_norm which uses torch.norm (computing a vector norm):

>>>c_t = torch.tensor([3.,4.])
>>> torch.norm(c_t, 2, -1)
tensor(5.)
1 Like

@MadeUpMasters has made a fix to do this I believe, should be present in the next release. But thanks for the colour map suggestion and feel free to make a PR yourself :slight_smile:

1 Like

Wow thank you for the great explanation! That really clears up any confusion I had, much appreciated :slight_smile:

Looking for some feedback on bringing the fastai audio application back into the main fastai library. If you have thoughts on this, really would like to get a conversation started over in the fastai dev section of the forums!

1 Like

Hey, you mentioned in the dev thread you’re interested in implementing Baidu’s DeepSpeech in fastai. I’d love to hear more about your work and why you’re interested in Speech Recognition. I’m about to start on something quite similar, building a phoneme based speech recognition system without a language model in order to help non-native English speakers improve their pronunciation.

3 Likes

Hey everyone, I build a Docker version for the fastai_audio library.
Thanks to @baz for helping me out with it

I still was trying to use alpine as the base, but I was running into library install issues. But this is a good starting point

2 Likes
1 Like

Hey fellow learners :slight_smile:
I’m trying to make an audio editing learner for some project, using the GAN architecture from lesson 7.
I’ve already used your fantastic fastai_audio library to train a critic (90% identification in a minute of learning!).
But, I’m having trouble creating a generator learner, with a unet_learner like in lesson 7 and a custom AudioAudioList instead of ImageImageList.

When running fit_one_cycle I get the following error:
RuntimeError: The size of tensor a (3078144) must match the size of tensor b (1280000) at non-singleton dimension 0
This error occurred during loss function calculation (MSE flat).

I think that this error means that fastai doesn’t understand the shape of my LabelClass (AudioItemList), and therefore the generated model has the wrong output size.
Any ideas on how to fix that? Anyone else working on generators/GANs for audio?

Thanks, everyone!

P.S. is the telegram group still active?

1 Like

That’s really amazing news. We haven’t tried to do anything like that with the library so would be great to see what your doing and to help you with your problem. Could you share notebook as a gist?

Yes the telegram is still active. PM @MadeUpMasters with your telegram deets

2 Likes

Hey guys, Jermey talked about doing work with audio, its in the git repo but theres nothing about it in the course that I noticed. Did we skip it or am I missing something ?

Thanks

I think the plan was changed during the course and some things like audio will instead be provided as future lessons, as noted at the very bottom of the course page.

1 Like

thanks

I’m currently working with a music dataset and did a little grid search on spectrogram params and figured I’d share my learnings here in case it is helpful to anyone :slight_smile:

Dataset: 10k songs, 3 spectrograms per song (12 sec each), 10 genres (~1k songs each, +/- 50 songs per genre)

Training: ResNet 34, Incremental training on 112px, 224px, 448px

Spectrogram Constants:

  • Generated Image Size: 448px
  • Sample Rate: 44,100
  • # FFT: 8,192
  • Hop Length: 128
  • Color Map: Magma

Results: As I somewhat expected, the spectrograms with the higher frequency resolution receive stronger performance. However, it was pretty surprising to see such strong performance from the 256 mel spectros with a Top dB of 50/60. It could be because I am trying to identify genres and something so broad benefits from noise reduction (both sound and data), but I’m not entirely sure.

Other Findings: I also tried using a ResNet 50 for a few of the best performing params, but saw no significant increase in performance (and a 75% increase in training time). Additionally, I tried the same with 896px spectrograms, but that also did not increase the performance (with 25% increase in spectro generation time, 100% increase in data transfer time, and 200% increase in training time… so definitely not worth it :grin:).

Overall: I’m very happy with performance, if I use a “voting evaluation” for each of the songs (i.e. since there are three spectrograms per song, let the final prediction be what the majority of spectros decide), then I get a 98.8% Accuracy which could be above human performance. I’m eventually going to do something a little more complicated than genre identification, but this definitely inspires enough confidence to continue.

2 Likes

Hey, I was also working on an audio generator, I also made an AudioAudioList, which just added ways to visualize the audio and the spectrogram. The problem I found was my lack of knowledge in audio, but I did come across a similar problem as you. One of the things I had to do was make sure that the sampling rates and lengths of the audios where all the same, since I was only using a slightly modified unet_learner and not a rnn structure.

Even then I had one pesty audio in my data set that for some reason didn’t want to transform into the same shape, I ended up just taking it out of the data set and never figured out how to fix it. Check to see where this error occurs, and that should help you figure out how to fix it, or if it’s worth fixing.

1 Like

This is really great thank you. I’m a bit surprised as the standard number of mels I’ve seen people use is generally in the 64-128 range. I’ve never even tried beyond 256 except for seeing what happens. Did you try playing around with the duration of clip you fed it? From your image size and hop it seems like you are using ~1.2s clips of the song at a time. In my experience you might get better results with longer clips and lower resolution. This would also partially explain the big jump youre getting by doing majority voting across 3 samples.

Have you ever done any work with bayesian hyperparameter optimization? If not, that article is fantastic and easy to follow.

@baz and I briefly implemented a grid search for fastai but stopped pursuing it when we realized it was suboptimal compared to random/bayesian search. It’s on my to-do list to use the libraries at the end of the article to build an easy to use bayesian hyperparam search for fastai in general that we could then apply to audio.

What type of audio are you interested in? Speech? Music? Have you made much progress since then?

1 Like

Mostly interested in speech separation, Originally I was interested in music and speech separation (dreaming of an automatic karaoke maker) but quickly changed to plain speech separation.

I’m currently doing a thesis on that but haven’t gotten very far in terms of practical code, mostly getting up to date with the literature, mostly looking at Googles Look-to-Listen and referenced papers, but I started testing about a week ago, so if all goes well I might have something to share soon.

2 Likes

Hi everyone,

I used the inofficial FastAi Audio (https://github.com/sevenfx/fastai_audio) to create a notebook similar to the" 03. NSynth (Audio Classification by Computing Spectrograms On-the-fly)" example. Which worked out really fine. I exported my learner and I’d like to use it on a test set to see how good I really did. My test set consists of 4000 .wav files. And I’d like to use the Mean F1-Score for evaluation. Has somebody experience how to do that? I tried various different ways and nothing worked out so far.

Thanks for your time.

Hi Claus, we maintain a more current version of fastai_audio here, sorry I hadn’t added it to the list of resources until just now. You can use it for inference but it isn’t fully tested so make sure everything seems functional before any type of deployment.

To do inference you load your test set along with the config (see the Getting Started Tutorial) you used for training. Then load your learner that you’ve exported, pass it to the predict function.

test = AudioList.from_folder(YOUR_PATH, config=YOUR_CONFIG)
learn = load_learner(YOUR_MODELS_PATH, YOUR_MODELS_FILE)
preds = audio_predict_all(learn, test)

If youre not importing your learner, skip the 2nd line. Hope this helps!

2 Likes