Deep Learning with Audio Thread

Hi everyone,

I used the inofficial FastAi Audio ( to create a notebook similar to the" 03. NSynth (Audio Classification by Computing Spectrograms On-the-fly)" example. Which worked out really fine. I exported my learner and I’d like to use it on a test set to see how good I really did. My test set consists of 4000 .wav files. And I’d like to use the Mean F1-Score for evaluation. Has somebody experience how to do that? I tried various different ways and nothing worked out so far.

Thanks for your time.

Hi Claus, we maintain a more current version of fastai_audio here, sorry I hadn’t added it to the list of resources until just now. You can use it for inference but it isn’t fully tested so make sure everything seems functional before any type of deployment.

To do inference you load your test set along with the config (see the Getting Started Tutorial) you used for training. Then load your learner that you’ve exported, pass it to the predict function.

test = AudioList.from_folder(YOUR_PATH, config=YOUR_CONFIG)
learn = load_learner(YOUR_MODELS_PATH, YOUR_MODELS_FILE)
preds = audio_predict_all(learn, test)

If youre not importing your learner, skip the 2nd line. Hope this helps!


I’m a bit surprised as the standard number of mels I’ve seen people use is generally in the 64-128 range.

Well I kinda tried this for two reasons. One, because there are many “layers” in music and I figured when trying to identify specific characteristics or styles, then the model would benefit from a high frequency resolution. And two, because the team who created MelNet used a very high-resolution spectro and saw great results (and while I’m doing nothing generative, I figured the foundation still stands).

Did you try playing around with the duration of clip you fed it? From your image size and hop it seems like you are using ~1.2s clips of the song at a time.

How are you calculating the 1.2 seconds? Unfortunately, I’m limited to 30 seconds (I’m using Spotify’s API and they restrict you to 30 sec previews). But I’m using an aspect ratio of 5:2 for the 30 sec spectro so that gives me 12 seconds of audio per slice. I would do more, but it adds computation expense (The majority of the time is spent generating the spectro, so I get more data faster if I generate one and crop it a few times).

Have you ever done any work with bayesian hyperparameter optimization ? If not, that article is fantastic and easy to follow.

I have thought about it but never actually did it. Thanks for the link! Looks like a great guide – I’ll definitely try that out. But I’m still somewhat new to audio data so it was also partially a learning experience to see what worked so I can understand how the network interprets spectros (and if there is any intuition).

I think I’ll run a few more tests on some spectros with 64 mels though if it is best practice. I just assumed that performance would continue to decrease so I cut it off at 128 (and I was up to 125 hours of GPU time so it was beginning to become an expensive experiment haha).

But next I’m also going to see if the model can identify which decade a song is from, something I figured might be a little more challenging because of the diversity each class will have.

I was thinking 128 hop * 448 px image divided by sample rate of 44100. I may have messed up the math though. What are you using to generate your spectrograms? Can you talk more about your process?

I’m trying it out now with audio data and this bayesian optimization library. Here is an example notebook for using it in fastai, shared by @muellerzr. I am currently having issues getting it working with non-tabular data. I’ll report back if I make progress.

I wouldn’t say 64 mels is best practice, but that 64-128 tend to be the ranges where most people find good results, but I wouldn’t hesitate to venture outside that if it’s improving your results.

This is a really interesting problem. Have you considered doing the target as a continuous variable (release date) instead of categorical (which decade)? Doing it by decade has the problem that a song from January 1980 is considered just as much “eighties” (100%) as a song from mid-1985, when in reality music from Jan 1980 is probably something closer to 50% 70s and 50% 80s. That’s just my intuition though.

I’m trying it out now with audio data and this bayesian optimization library. Here is an example notebookfor using it in fastai, shared by @muellerzr. I am currently having issues getting it working with non-tabular data. I’ll report back if I make progress.

@MadeUpMasters let me know any issues you have and if I can help out at all :slight_smile: I’ve used it on vision as well, maybe we can debug if you can’t get it working.

1 Like

Hey, thanks @muellerzr, I replied in your original thread on bayesian optimization so we can keep the convo going there, but I wanted to crosspost here in case any audio people see this and want to continue the conversation.

1 Like

@TomB We recently merged in your code for the audio learner that accepts multiple channels, removing the unnecessary 3 channel expansion that we had before that. With these changes it will now be possible for us to accept multichannel audio, as well as future alternate inputs like mag/phase. Thank you for the helpful guidance.

Great, haven’t had much time to work on audio for awhile so didn’t get to submit myself.

One issue I had with that code was that it just reuses the first input channel weights for all channels. This wasn’t an issue for me as I was using single channel input, but is perhaps not ideal for your library. Probably not terrible, there’s no real reason to think one of the other channels would be better, but does give less diversity of kernels. Also, not currently a huge issue as you currently only do 1 or 3 channels (in which case it won’t adapt the weights). But when you add multichannel support this wouldn’t be ideal.
I’ve added a commit (and couple of minor fixes after that) that makes it use all existing channels where appropriate and also allows providing a custom function to adapt the weights so you can easily experiment with this. It passes unit tests and worked in some basic tests so should hopefully be OK but I haven’t tested it extensively. Won’t be able to do further testing for a bit as my GPU is tied up so if I don’t get to it you may want to update when you add multichannel or if you wanted to look at adapting weights.


Just wanted to let you know we are currently working through multichannel audio. Not sure if you’ve done any work on this yet, but if not, the fastai_audio library is almost ready to handle your 8 channel audio!

Just a quick update - we’ve been busy preparing the fastai v2 lib so everything else has been put on hold. But the extra lessons will definitely happen… eventually! :slight_smile: You folks are way more advanced than me now on audio processing, so I’ll definitely want to pick your brains and take advantage of the great work you’re all doing.


Can you give us an idea of where the biggest changes are going to be with fastai v2? Looking forward to watching what you and the team have created and look forward to bringing fastai_audio back into core so it’s all under the same umbrella!

The v2 code is at and seems to be coming together now.
There are pretty major changes everywhere, with the biggest from the point of view of audio probably being the completely redesigned Data Blocks API. It doesn’t look like you would need an audio specific data source (AudioItemList) anymore. Instead you just have the item class (equivalent of AudioItem) and a method for opening files and the user links these in their own custom DataBlock class.
There’s also a more advanced transforms system with a notion of transform pipelines, though it’s not too dissimilar to the current system. The main relevant change there is better support for GPU transforms which are a bit tricky in v1.


Thanks @TomB that’s a good summary. In particular check out nb 08 to see the lower level API pieces that data blocks are now built on. They are way easier to use and more powerful than what we had before.

Judges award for the Freesound Audio Tagging 2019 has now been given, and the winner used - open source code available


Fastai Audio has an active pull request that will handle multichannel audio. Would love to hear of some datasets if anybody has any good ones that are multichannel!

Huge shout out to @hiromi for all of the work on this and for helping get me up to speed on Github!


I’m working on a speech separator but I currently have a couple of problems. The one problem I haven’t been able to fix (and might be due to my lack of knowledge in the area) is with the type of input/output. Normal Spectrograms (Only applying a STFT) don’t seem to work well, and although I still have other optimizations to make and I’m still testing I feel like this is the (or one of the) big problems I’m having.

With that said, does someone know how I might be able to revert a MelSpectrogram? Or is there another type of spectrogram that better reflects how we hear audio than just applying a STFT but at the same time can be reverted with little noticeable changes?

I’ve seen some works like Tractron use a synthesizer which could work but I still don’t want to go there. One idea I had is letting the input be a MelSpectrogram and have the output be a mask for the STFT, and I probably will test this and more, but it doesn’t sound like it will work well. I am using a UNet structure and I’m trying with different types of masks.

Any ideas?

1 Like

When you say speech separator, are there two different people talking and you want to put one in channel one and one in channel two?

Hi Robert, Being a new member of this forum, I’m unable to send you my telegram id. I would love to join the telegram group of Audio and AI because right now I’m working on audio using ai. I’m trying to build a recognition system. Kindly send me the telegram group link.

Thank you :slight_smile:

1 Like

To me reverting a spectrogram always seemed possible, but from what I’ve read of more advanced signals processing people, it isn’t, or is high effort/low quality. Here’s a decent discussion of it.

2nd paragraph outlines a potential path. It’s way beyond my understanding so not sure if it’s helpful.

Another (more correct?) option would be to extract the full FFT (not just the magnitude, but also the phase) and model those. But to have a chance to be able to model the phase components, you would need to extract the FFT pitch-synchronously, and then, resample to fixed frame rate. In other words, you would need to find the GCI (glottal closure instants) in the original wav (for example using reaper), and center your FFT window around those. I suppose once it is modeled, you could resample your full FFT to be pitch synchronous, and then recover a decent enough raw wav by ifft and using OLA.

Yes, my first goal is just to separate two voices from a mono-channel audio, then optimize from there trying to get better results.