Deep Learning with Audio Thread


cheers mrfabulous1:smiley::smiley::smiley:

Id love to work differentiable signal processing into fastai audio but unfortunately this is built on tensorflow and I doubt we’ll see a PyTorch implementation in time to use it. We may implement some related things such as learned filterbanks (as opposed to set frequency ranges in a linear or melspectrogram)

I’ve sorted out the colab notebook for people interseted in a quick way to get started with v1 of the library. We will create one of these for fastai2_audio as well


I was able to successfully convert my model to Caffe2 for deployment following this tutorial

I now realize, I also need to convert my preprocessing steps as well. Currently I’m generating spectrograms on the fly as part of an Audio Databunch.

Has anyone had experience with stripping the dependencies to be able to generate their spectrograms for inference in “pure” python (outside of in the same way that they are done in the databunch?

deep visual-semantic embedding model
identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. Has anybody played around with audio-semantic embedding ?.

Hello, I tried to implement the 03_Environmental_Sound_Classification.ipynb notebook on Colab.
But I got these errors.
“from audio import *” returned “ModuleNotFoundError: No module named ‘audio’”. “pip install audio” returned an other error: “ERROR: Command errored out with exit status 1: python egg_info Check the logs for full command output.”

Then with “sg_cfg= SpectrogramConfig(hop_length=512, n_mels=128, n_fft=1024, top_db=80, f_min=20.0, f_max=22050.0)”, I got the error “NameError: name ‘SpectrogramConfig’ is not defined”

Any leads to keep on testing the notebook? Thx

Since yesterday I have been unable to import audio

----> 2 from audio import *

/usr/local/lib/python3.6/dist-packages/torchvision/models/ in Inception3()
180 return x, aux
–> 182 @torch.jit.unused
183 def eager_outputs(self, x, aux):
184 # type: (Tensor, Optional[Tensor]) -> InceptionOutputs

AttributeError: module 'torch.jit' has no attribute 'unused'

Anyone else facing a similar issue?

This looks a torch versioning issue? Have you tried resetting and reinstalling your python environment?

Did you base your notebook off this one:

yes this is torch version issue. I am on colab, tried installing torch 1.4 etc, but think tourchaudio is of specific version and works with a specific version of torch.

(Sorry forgot which versions were compatible)

I followed this notebook to install audio. It’s working fine now!

Great I’m glad it worked for you. Yes currnently torchaudio has a requirement of torch 1.4 which is a bit annoying so make sure your installing torchaudio < 0.4.0

Hello! Over the past few weeks I have been developing a bird sound classifier, using the new fastai v2 library!

You can look at my notebook here:

I wanted to incorporate some of the fastai v2 audio library into this, but I wasn’t sure how best to do it.

The dataset I’m using is from the LifeCLEF 2018 Bird dataset, and I re-implemented the BirdCLEF baseline system into Jupyter notebooks with some refactoring done along the way with the fastai v2 library.

The basic idea of what I did was:

Take the dataset, and use the baseline system’s methodology of extracting spectrograms to get a large amount of spectrograms for each of the 1500 classes of bird species.

The interesting bit about extracting the spectrograms can be found here:

From there, I did the classic transfer learning technique of training my model against the spectrogram images, on a ResNet model pretrained on ImageNet. I got down to about a 27% error rate!

I just wanted to post this now as I begin to tie it up to see if anyone had any feedback or questions. I’m going to be presenting my work at Localhost, a talk in NYC on February 25th if anyone is around! I’ll be presenting fastai v2 and the audio library to a big audience, so hopefully it will get more people interested in the library :slight_smile:

I wasn’t able to think about how to use the audio library for this because of the weak labeling problem around finding where in the audio signal the bird sounds are. Much of my notebook is around re-implementing the baseline system’s approach which goes through all of the recordings, takes 1 second chunks, creates a spectrogram, and uses a signal to noise heuristic to determine if that section has a bird call inside of it.

I’d love to help implement some kind of approach like that within the audio library so I could use it for the entirety of my pipeline - I know that other datasets have the same issue was well, so its going to be something to think about.

Two papers that I came across that deal with this are:


Anyways, I wanted to share my progress with others to see if you had any feedback, questions, or suggestions on moving forward. My next main goals are to keep training (always be training), inference with the test data set, and then do some training on a smaller dataset that I’m interested in (birds from around my area), and do some inference testing on that.


Congratulations on your success! I’m glad that you found the library useful. Seems like you did a lot of extra pre-processing which is interesting and something that we might need to think about when extending the functionality of the library.

So as I understand it, you’re detecting whether a signal contains a bird call so that you can crop that particular area before you actually train on it later on and therefore reduce noise in the data set. Depending on how well that does its job, you could also be adding noise to your data set potentially?

Even though the notebook is in a fork, looks like your not actually using the library in the notebook you’ve shared, you’ve generated the spectrograms on your own and then using the core fastai2 to train. Was there any problems you had specifically?

I’d see if you could incorporate the code from fastai2_audio which could help boost your accuracy. SpecAugment in particular might allow you to train longer without over fitting and pre-processing such as silence removal could be useful aswell.

My training wav files were 16k sample rate. While my test file is 8k.

I implemented the following code

item = AudioItem(path='/content/test.wav') = 8000
al = AudioList([item],path=item.path,
ai =, item.path)
y, pred, raw_pred = learn.predict(ai)

But I get the following error

ValueError                                Traceback (most recent call last)
<ipython-input-44-a3f81a0fa976> in <module>()
      2 = 8000
      3 al = AudioList([item],path=item.path,
----> 4 ai =, item.path)
      5 y, pred, raw_pred = learn.predict(ai)

2 frames
/content/fastai_audio/audio/ in _validate_consistencies(self, item)
    310                                 does not match config sample rate {self.config._sr}
    311                                 this means your dataset has multiple different sample rates,
--> 312                                 please choose one and set resample_to to that value''')
    313         if(self.config._nchannels is not None and self.config._nchannels != item.nchannels):
    314             raise ValueError(f'''Multiple channel sizes detected. Channel size {item.nchannels} of file 

ValueError: Multiple sample rates detected. Sample rate 8000 of file /content/test.wav 
                                does not match config sample rate 16000 
                                this means your dataset has multiple different sample rates, 
                                please choose one and set resample_to to that value

This might help you to do inference. I’m not sure that if you train a model on 16000 sample rate sound you will be able to infer on an 8000 sample rate sound.

config = AudioConfig()
config.resample_to = 16000
config.cache = False

def predict_from_file(wav_file, leaner, verbose=True):  
    item = AudioItem(path=wav_file)
    if verbose: display(item)
    al = AudioList([item], path=item.path, config=config)
    ai =, item.path)
    y, pred, raw_pred = leaner.predict(ai)
    if verbose: print(y)
    if verbose: print(pred.item())
    if verbose: print(raw_pred)

If this doesn’t work you could try to upsample your signal from 8000 to 16000:

sig8 = torch.rand(1,8000)
sig16 = torch.nn.Upsample(scale_factor=2)(sig8[None,])
1 Like

Nice work!
I have been working on this dataset some time ago. The preprocessing of the data consumed most of the time spend. Also used the “kahst method” to extract spectrograms. I used hop-length and n_fft size to create more or less square spectrograms. I deleted the lower frequences to create square (256*256) spectrograms. These can be easily processed in “resnet-models”.
Most problematic (and still deserves more time) is the fact that from each sound file lots of spectrograms are created. When splitting the data in a trainingset and testset you have be aware not to use different parts of the same soundfile for training and testing or else you get really low error rates because training and test images are very similar. (I didn find an easy solution yet).

I wondered why you chose to use 1 second fragment to make make your specrograms. Most birdsongs are longer than 1 second so you might throw away valuable information.

Looking forward to see more of your work with this datafile.

Great work with fastai audio guys :slight_smile: I’m currently going through 02_tutorial at the moment. A few notes:

Currently trying to build will fail as the highest requirement is python 3.6.9. However after manually importing, everything works fine in Python 3.7 :slight_smile:

Also, on the 250 speakers you don’t need the freeze/unfreeze section as we never split and freeze our model. It doesn’t affect anything but it can be a bit misleading. (Also since we have only one layer group you don’t need slice in the learning rates). Great job again!!! :smiley:


Hi all, I just wanted to ask, is anyone here trying out the DeepFake competition on Kaggle? It covers both audio and visual modifications (per description) but most kernels focus on the visuals (as they’re far more obvious).

1 Like

Just starting on it, thinking about using the audio lib too :slight_smile: