Deep Learning with Audio Thread

I was able to get around this error but modifying the the audio_predict function

def audio_predict(learn, item:AudioItem):
‘’‘Applies preprocessing to an AudioItem before predicting its class’’’
al = AudioList([item], path=item.path,
ai =, item.path)
return learn.predict(ai)

removing split_none() and label_empty() seemed to work for me.

Hi Tony,
as far as I understood, this does not really solve the problem.
The label_empty() does start all the preprocessing, (in withc the errors occur)
So taking that out would send the original audios without preprocessing to the infer function. That should lead to wrong results. Maybe Im wrong here?
Thanks Wolfgang

1 Like

Thanks for this list of resources, Robert. This is amazing, I think this company can help with the Deep Learning technology. Check out their blog, I think there should be more information on this topic.

Thank you for the amazing work. I’m running into a problem with the audio notebook at
It doesn’t seem to use the same version of torchaudio.transforms as the one in the master branch at
Do I need to build torchaudio from another branch?
Thank you.

1 Like

@botkop Thanks for letting us know about this. Torchaudio has been updating and breaking stuff pretty rapidly so there’s a chance it may be dated. Our Unofficial FastAI Audio V1 uses an earlier version ( and while I’m not sure what version Jeremy built that notebook on, switching back to this build would probably help since it was around a similar time that we were working on V1.

Also, the major changes to torchaudio are listed here, so if you want to help out you can update the code in the notebook to work for the latest version, and then PR it. I’m very familiar with the torchaudio change and happy to help with that as well, feel free to post here or DM me if you try it and get stuck. Cheers.

Reposting an update from our fastai v2 Audio thread in case anyone here is interested in helping out. We’ve reached a point where we think we have a good working version, but before building on top of it further we feel we could use some feedback in case our implementation has major flaws. If anyone wants to play around with it, especially looking at the low-level implementation details and providing any feedback, @baz and I would appreciate it greatly. Thank you.

NBViewer Notebook Links:

  1. 70_audio_core
  2. 71_audio_augment
  3. 72_audio_tutorial

What we could really use feedback on before proceeding:

  1. The low-level implementation of AudioItem, AudioSpectrogram, AudioToSpec/AudioToMFCC and how we chose to wrap torchaudio and extract default + user-supplied values to be stored in spectrogram.
  2. How to best GPUify everything. We think using SignalCropping to get a fixed length is the only thing we need to do on the CPU, and all signal augments, conversion to spectrogram, and spectrogram augments can be done on GPU. @baz, could you please post your latest GPU nb and questions here to get feedback?
  3. Where we should be using RandTransform for our augments.

Known bugs:
-AudioToSpec used to tab-complete with all potential arguments, but stopped recently, we’re trying to trace it.
-Spectrogram display with colorbar + axes doesnt work for multichannel audio, or delta+accelerate (anything that is more than one image)
-Show_batch is currently broken, we know how to fix it but it breaks spectrogram display. There’s a detailed note in the nb.

Quick showcase of some high-level features:

AudioItems display with audio player and waveplot:

Spectrograms store the settings used to generate them in order to show themselves better

Spectrograms display with decibel colorbar (if db_scaled), time axis, frequency axis. Thanks TomB for suggesting this

Create regular or mel spectrograms, to_db or non_db easily from same function.

Warnings for missing/extra arguments. If you pass a keyword argument that won’t be applied to the type of spectrogram you’re generating (in this case non-mel spectrogram), you’ll get a warning.

AudioConfig class with optimized settings users can apply to their audio subdomain, e.g. AudioConfig.Voice, which will set the defaults to be good values for voice applications.

Easy MFCC generation, photo is a bad example as it currently stretches to plot, actual data is only 40px tall.

Features in NB71 audio_augment:

  • Preprocessing
    • Silence Removal: Trim Silence (remove silence at start and end) or remove all silence.
    • Efficient Resampling
  • Signal Transforms (all fast)
    • Signal Cropping/Padding
    • Signal Shifting
    • Easily add or generate different colors of noise
      e.g real_noisy = AddNoise(noise_level=1, color=NoiseColor.Pink)(audio_orig)
    • Augment volume (louder or quieter)
    • Signal cutout (dropping whole sections of the signal) and signal dropping (dropping a % of the samples, sounds like a bad analog signal, code for this is adapted from ste and zcaceres, thank you!)
    • Downmixing from multichannel to Mono
  • Spectrogram Transforms

Results from 72_audio_tutorial:
-99.8% accuracy on 10 speaker voice recognition dataset
-95.3% accuracy on 250 speaker voice recognition dataset


Hello everyone,
I am working on a project using the timit dataset.
I have a function which loads all the audio files, applies stft on it, takes the absolute value and appends the output in a list. I have 3 such lists and running this function takes about 15 mins.
Is there a way I can save these python lists on Google drive (I am using Google Colab) so that I can save the time of preprocessing them every time?

Can SpecAugment (frequency and spec masking) be used with current fast ai version? Can I call on this augment alone somehow within Lesson 1 for example?

Hi I am using audio to perform ASR for Hindi language.
While creating the databunch, my label is the text. However, I am getting the unicode error, whenever I try to create a databunch.

hi_db = (AudioList.from_folder(data_p/"clips", processor=processors)
.label_from_func(lambda x: data_csv[ == str(x)]['text'])

'utf-8' codec can't decode byte 0xa4 in position 4: invalid start byte

This is how the data_csv looks like:

                          |wav_path  |text  |audio|

/content/drive/My Drive/Colab Notebooks/data/t…|प्रायः हम इस भंडार को लौकिक विषयों पर और पत्नी…|2019-11-07-09-59-50-616679.wav|

1 Like

I’m not sure about specifics of google drive, but the way I would handle it to take the numpy arrays (this is how the absolute value of the stft is represented internally) and then use Numpy Save to disk in a special “.processed” folder and then keep that in your google drive.

If you specifically need to keep those lists of numpy arrays, I would mess around with “pickle” python’s library for saving objects to disk, and see if that works. If you post code I can probably help.

@baz shared this code in the fastai audio telegram but I wanted to make sure it was posted here so others could find it.

tfms = get_transforms()

def spec_augment(t, size=20):
    bsg =
    max_y = bsg.shape[-2]-size-1
    for i in range(bsg.shape[0]):
        s = bsg[i]
        m = s.flatten(-2).mean()
        r = torch.randint(0,max_y,(1,)).squeeze().int()
        s[:, r:r+size] = m
    return Image(bsg)

def spec_augment_freq(t, size=20):
    res = spec_augment(,-1))
    return Image(,-1))

tfms = [Transform(spec_augment)(), Transform(spec_augment_freq)()], []
1 Like

This is more of a unicode or general fastai issue, I think you’d have better luck making a separate thread if you still haven’t found a workaround.

1 Like

Hey audio people, we are looking for new datasets to use within fastai v2 audio that meet the following requirements:

  • Small enough as to not be a huge burden for memory/training
  • Difficult enough that we cant get > 98% accuracy
  • Published benchmarks so we can compare results with SOTA

We would like to have several datasets that span various audio subdomains (voice recognition, asr, diariazation, scene recognition, music). Please let us know if you have any ideas. Thanks.


Did it work @kodzaks

1 Like

Have not tried yet, but I will!

Can contribute some stuff for scene recognition, underwater snapping shrimp. This is a very common underwater sound in tropical waters.

Has anybody played around with VGGish? It’s an audio embedding model trained on Google AudioSet (huge compilation of youtube audios). It’s a pretrained model that reduces each second of audio to a vector of 128 values that can be used as features for training.

I tried on ESC50 and got 62.25% accuracy (with no data aug) resnets get 67% but we’ve gotten as high as 88.75% with densenets+mixup. I want to try more with data augmentation, and also to see if I can get mixup working on the embeddings.

Here’s a notebook if anyone is interested in trying out vggish embeddings. Ignore the first few parts, they pull in data v2 style. Also ignore this branch of audio v2, it’s just a bunch of messy experiments with various audio stuff (ROCKET, vggish, raw audio training)


Has the pyTorch audio library been updated recently? And the fastai audio library is not following the updated torchaudio?

The imports from torchaudio.transforms are not working e.g. ``‘SpectrogramToDB’ which I see is not present in torchaudio.transforms

Should I be cloning some other fastai_audio library instead?

1 Like

Since torchaudio is moving fast and breaking things, and we are not doing a ton of maintenance on V1 due to focus on V2, we chose to freeze torchaudio in a previous version. SpectrogramToDB is a rename of the previously used AmplitudeToDB which is called automatically when you set to_db to be in the config. I think everything you’re trying to do should be achievable using the old version of torchaudio. Let us know if there’s something you are having trouble doing and we can find a way to do it.

V2 is unfinished at this point, we are going to build the high level API and some nice usability features. This shouldn’t take very long but I wouldnt recommend using it for anything major until it’s released as code will continue to change somewhat rapidly until then.


Do we have a seperate discussion thread for V2 audio ?.