Deep Learning with Audio Thread

I did

!sudo apt-get update
!sudo apt-get -y install sox libsox-dev libsox-fmt-all


Thanks a lot, that helped. I get an assertion error now and was not able to put the data into a Test Dataset without lables. Is there a way to do that ?

~/steth_ai_fast_ai/audio/ in audio_predict_all(learn, al)
67 ‘’‘Applies preprocessing to an AudioList then predicts on all items’’’
68 al = al.split_none().label_empty()
—> 69 audioItems = [, ai[0].path) for ai in al.train]
70 preds = [learn.predict(ai) for ai in progress_bar(audioItems)]
71 return [o for o in zip(*preds)]

~/steth_ai_fast_ai/audio/ in (.0)
67 ‘’‘Applies preprocessing to an AudioList then predicts on all items’’’
68 al = al.split_none().label_empty()
—> 69 audioItems = [, ai[0].path) for ai in al.train]
70 preds = [learn.predict(ai) for ai in progress_bar(audioItems)]
71 return [o for o in zip(*preds)]

~/steth_ai_fast_ai/audio/ in open(self, fn)
295 if self.path is not None and not fn.exists() and str(self.path) not in str(fn): fn = self.path/item
296 if self.config.use_spectro:
–> 297 item=self.add_spectro(fn)
298 else:
299 func_to_add = self._get_pad_func() if self.config.max_to_pad or self.config.segment_size else None

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/ in _inner(*args, **kwargs)
471 def _inner(*args, **kwargs):
472 self.train = ft(*args, from_item_lists=True, **kwargs)
–> 473 assert isinstance(self.train, LabelList)
474 kwargs[‘label_cls’] = self.train.y.class
475 self.valid = fv(*args, from_item_lists=True, **kwargs)


1 Like

It needs to be a LabelList but not necessarily labeled. This is done by .label_empty(), but that should be taken care of by audio_predict_all, can you share your full code for the inference and I’ll try to take a look at it in the next day or two?

I have the same issue as well… I’m working back through the fastai commits to see if something there broke it.
My same exact code was working a month or so ago.

Hey everyone,
I am new to audio. As a way of getting started, I tried running this sample Colab notebook however I get the following error.

I searched the forum but couldn’t find anything. I tried updating torch audio but that didn’t work as well.

Hey, sorry about that, the library has been in flux and that appears to be an older notebook. The fastai V2 version is under development, and the mostly stable (inference currently version) here is the version that is compatible w fastai V1.

Since you’re new to audio, definitely check out the following notebooks

  • 00_Getting_Started.ipynb
  • 01_Intro_to_Audio.ipynb
  • 02_Features.ipynb

I was able to get around this error but modifying the the audio_predict function

def audio_predict(learn, item:AudioItem):
‘’‘Applies preprocessing to an AudioItem before predicting its class’’’
al = AudioList([item], path=item.path,
ai =, item.path)
return learn.predict(ai)

removing split_none() and label_empty() seemed to work for me.

Hi Tony,
as far as I understood, this does not really solve the problem.
The label_empty() does start all the preprocessing, (in withc the errors occur)
So taking that out would send the original audios without preprocessing to the infer function. That should lead to wrong results. Maybe Im wrong here?
Thanks Wolfgang

1 Like

Thanks for this list of resources, Robert. This is amazing, I think this company can help with the Deep Learning technology. Check out their blog, I think there should be more information on this topic.

Thank you for the amazing work. I’m running into a problem with the audio notebook at
It doesn’t seem to use the same version of torchaudio.transforms as the one in the master branch at
Do I need to build torchaudio from another branch?
Thank you.

1 Like

@botkop Thanks for letting us know about this. Torchaudio has been updating and breaking stuff pretty rapidly so there’s a chance it may be dated. Our Unofficial FastAI Audio V1 uses an earlier version ( and while I’m not sure what version Jeremy built that notebook on, switching back to this build would probably help since it was around a similar time that we were working on V1.

Also, the major changes to torchaudio are listed here, so if you want to help out you can update the code in the notebook to work for the latest version, and then PR it. I’m very familiar with the torchaudio change and happy to help with that as well, feel free to post here or DM me if you try it and get stuck. Cheers.

Reposting an update from our fastai v2 Audio thread in case anyone here is interested in helping out. We’ve reached a point where we think we have a good working version, but before building on top of it further we feel we could use some feedback in case our implementation has major flaws. If anyone wants to play around with it, especially looking at the low-level implementation details and providing any feedback, @baz and I would appreciate it greatly. Thank you.

NBViewer Notebook Links:

  1. 70_audio_core
  2. 71_audio_augment
  3. 72_audio_tutorial

What we could really use feedback on before proceeding:

  1. The low-level implementation of AudioItem, AudioSpectrogram, AudioToSpec/AudioToMFCC and how we chose to wrap torchaudio and extract default + user-supplied values to be stored in spectrogram.
  2. How to best GPUify everything. We think using SignalCropping to get a fixed length is the only thing we need to do on the CPU, and all signal augments, conversion to spectrogram, and spectrogram augments can be done on GPU. @baz, could you please post your latest GPU nb and questions here to get feedback?
  3. Where we should be using RandTransform for our augments.

Known bugs:
-AudioToSpec used to tab-complete with all potential arguments, but stopped recently, we’re trying to trace it.
-Spectrogram display with colorbar + axes doesnt work for multichannel audio, or delta+accelerate (anything that is more than one image)
-Show_batch is currently broken, we know how to fix it but it breaks spectrogram display. There’s a detailed note in the nb.

Quick showcase of some high-level features:

AudioItems display with audio player and waveplot:

Spectrograms store the settings used to generate them in order to show themselves better

Spectrograms display with decibel colorbar (if db_scaled), time axis, frequency axis. Thanks TomB for suggesting this

Create regular or mel spectrograms, to_db or non_db easily from same function.

Warnings for missing/extra arguments. If you pass a keyword argument that won’t be applied to the type of spectrogram you’re generating (in this case non-mel spectrogram), you’ll get a warning.

AudioConfig class with optimized settings users can apply to their audio subdomain, e.g. AudioConfig.Voice, which will set the defaults to be good values for voice applications.

Easy MFCC generation, photo is a bad example as it currently stretches to plot, actual data is only 40px tall.

Features in NB71 audio_augment:

  • Preprocessing
    • Silence Removal: Trim Silence (remove silence at start and end) or remove all silence.
    • Efficient Resampling
  • Signal Transforms (all fast)
    • Signal Cropping/Padding
    • Signal Shifting
    • Easily add or generate different colors of noise
      e.g real_noisy = AddNoise(noise_level=1, color=NoiseColor.Pink)(audio_orig)
    • Augment volume (louder or quieter)
    • Signal cutout (dropping whole sections of the signal) and signal dropping (dropping a % of the samples, sounds like a bad analog signal, code for this is adapted from ste and zcaceres, thank you!)
    • Downmixing from multichannel to Mono
  • Spectrogram Transforms

Results from 72_audio_tutorial:
-99.8% accuracy on 10 speaker voice recognition dataset
-95.3% accuracy on 250 speaker voice recognition dataset


Hello everyone,
I am working on a project using the timit dataset.
I have a function which loads all the audio files, applies stft on it, takes the absolute value and appends the output in a list. I have 3 such lists and running this function takes about 15 mins.
Is there a way I can save these python lists on Google drive (I am using Google Colab) so that I can save the time of preprocessing them every time?

Can SpecAugment (frequency and spec masking) be used with current fast ai version? Can I call on this augment alone somehow within Lesson 1 for example?

Hi I am using audio to perform ASR for Hindi language.
While creating the databunch, my label is the text. However, I am getting the unicode error, whenever I try to create a databunch.

hi_db = (AudioList.from_folder(data_p/"clips", processor=processors)
.label_from_func(lambda x: data_csv[ == str(x)]['text'])

'utf-8' codec can't decode byte 0xa4 in position 4: invalid start byte

This is how the data_csv looks like:

                          |wav_path  |text  |audio|

/content/drive/My Drive/Colab Notebooks/data/t…|प्रायः हम इस भंडार को लौकिक विषयों पर और पत्नी…|2019-11-07-09-59-50-616679.wav|

1 Like

I’m not sure about specifics of google drive, but the way I would handle it to take the numpy arrays (this is how the absolute value of the stft is represented internally) and then use Numpy Save to disk in a special “.processed” folder and then keep that in your google drive.

If you specifically need to keep those lists of numpy arrays, I would mess around with “pickle” python’s library for saving objects to disk, and see if that works. If you post code I can probably help.

@baz shared this code in the fastai audio telegram but I wanted to make sure it was posted here so others could find it.

tfms = get_transforms()

def spec_augment(t, size=20):
    bsg =
    max_y = bsg.shape[-2]-size-1
    for i in range(bsg.shape[0]):
        s = bsg[i]
        m = s.flatten(-2).mean()
        r = torch.randint(0,max_y,(1,)).squeeze().int()
        s[:, r:r+size] = m
    return Image(bsg)

def spec_augment_freq(t, size=20):
    res = spec_augment(,-1))
    return Image(,-1))

tfms = [Transform(spec_augment)(), Transform(spec_augment_freq)()], []
1 Like

This is more of a unicode or general fastai issue, I think you’d have better luck making a separate thread if you still haven’t found a workaround.

1 Like

Hey audio people, we are looking for new datasets to use within fastai v2 audio that meet the following requirements:

  • Small enough as to not be a huge burden for memory/training
  • Difficult enough that we cant get > 98% accuracy
  • Published benchmarks so we can compare results with SOTA

We would like to have several datasets that span various audio subdomains (voice recognition, asr, diariazation, scene recognition, music). Please let us know if you have any ideas. Thanks.


Did it work @kodzaks

1 Like