Fastai v2 audio

(Harry Coultas Blum) #14

Yes if you could do that I can create a shared environment for us all to contribute and work from.

1 Like

(Jeremy Howard (Admin)) #15

I’m not sure I have a strong opinion on that either way. Happy for you folks to do whatever you prefer!


(Hiromi Suenaga) #16

So, I was thinking “how could I possibly utilize torchaudio's MelSpectrogram cleanly?” and came up with this still-work-in-progress notebook.

I didn’t want to subclass MelSpectrogram but I wanted the tab completion to still work. I then noticed the to argument in @delegates, and it works like a charm:

Anyway, I just wanted to share my small victory :slight_smile:


(Harry Coultas Blum) #17

We’ve decided to work on a shared fork of fastai_dev and hopefully will eventually make a PR to the fastai_dev repo that you guys can review.


(Harry Coultas Blum) #18

I am hosting some small datasets to be used for development. I assume that these will eventually be hosted on the but for now you can access them here:

from import untar_data

data_path = untar_data("")

Let me know if there are any problems with downloading.


(Robert Bracco) #19

We’ve started to add testing and I ran into this issue when running the tests (for audio nbs 70/71 only) automatically from the command line using the command provided in the readme. The notebook runs fine on it’s own, any idea why it can’t import properly?

Are there any good resources besides the readme for testing? Maybe a code walkthrough that discussed it? Or is it as simple as just adding tests from 00_test throughout nbs? Thanks!


(Jeremy Howard (Admin)) #20

I haven’t seen that before, sorry!

For testing I can’t think of anything to add other than to just add tests to your notebooks… But let me know if anything is unclear as you do it, and we can start adding any testing docs that are needed based on your feedback and questions.



Is it possible that is’ using an older version of torchaudio when running the tests? AmplitudeToDB was previously named SpectrogramToDB


(Robert Bracco) #22

Very possible, I’ll check now. Thanks for the suggestion!


(Robert Bracco) #23

Our version of audio has come a long way, I’ll post notebooks at the end of the week. @baz and I are working on getting all the spectrogram transforms onto the GPU

Most have been straightforward, but I’m having some trouble with SpecAugment and before I dump hours into trying to figure it out I want to make sure I’m on the right track. Here’s the version that works on individual spectrograms:

def MaskFreq(num_masks=1, size=20, start=None, val=None, **kwargs):
    def _inner(spectro:AudioSpectrogram)->AudioSpectrogram:
        '''Google SpecAugment time masking from'''
        nonlocal start
        sg = spectro.clone()
        channel_mean = sg.contiguous().view(sg.size(0), -1).mean(-1)
        mask_val = channel_mean if val is None else val
        c, y, x = sg.shape
        for _ in range(num_masks):
            mask = torch.ones(size, x) * mask_val 
            if start_ is None: start = random.randint(0, y-size)
            if not 0 <= start_ <= y-size:
                raise ValueError(f"Start value '{start}' out of range for AudioSpectrogram of shape {sg.shape}")
            sg[:,start:start+size,:] = mask
            start = None
        return AudioSpectrogram.create(sg, settings=spectro.settings)
    return _inner

What the transform looks like:

Am I correct in assuming that for a batch of 64, I will need to grab the channel mean of each of those 64 spectrograms, so then I have a 64 layer mask of different values (mask is batch_size by mask_height by sg_width) and then I need to insert that mask at 64 random start positions (otherwise the mask will be in the same position for every image in the batch)? This is bending my brain a little bit and just want to confirm I’m on the right track and there’s not a simpler way. Thanks!

1 Like

(Robert Bracco) #24

So I got the channel mean part working, but I’m still not sure how to make sure the mask is inserted at a different position for each image in the batch. Looking at how RandomResizedCropGPU works in 09_vision_augment, am I right in thinking I need a class to achieve this? I was originally thinking that if I had an array of random insertion points, that I could use that to index into the batch in the same way we did in the batchless version, but I keep running into TypeError: only integer tensors of a single element can be converted to an index. Is there no way to index into a 64x1x128x128 tensor so that each of the 64 images has the mask applied at a different start point in the 3rd dim? Thanks.


(Jeremy Howard (Admin)) #25

It’s difficult, but possible. As well as rrc, which you already noted, there’s also an example in fastai2.medical.imaging.


(Harry Coultas Blum) #26

I’ve managed to get SpecAugment for batches working with two different methods:

Method 1

Uses a for loop but is decorated as a torch script to speed it up

def spec_aug_loop(batch:Tensor, size:int=20):
    bsg = batch.clone()
    max_y = bsg.shape[-2]-size-1
    for i in range(bsg.shape[0]):
        s = bsg[i, :]
        m = s.flatten(-2).mean()
        r = torch.randint(0,max_y,(1,)).squeeze().int()
        s[:, r:r+size, :] = m
    return bsg

2.83 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Method 2

Using masks to replace the values in random places

def spec_augment(batch:Tensor, size:int=20):
    bsg = batch.clone()
    bs, _, x, y = bsg.shape
    max_y = y-size-torch.tensor(1)
    m = torch.arange(y).repeat(x*bs).view(bs,-1)
    rs = torch.randint(0,max_y,(1,bs)).squeeze()[None].t()
    gpumask = ((m > rs)) & (m < (rs+size))
    gpumask = gpumask.view(bs,x,-1)[:,None,:]
    bsg[gpumask] = torch.tensor(0)
    return bsg

7.26 ms ± 81.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The clear winner is method 1 but I believe that there must be something I can do to speed up method 2


(Robert Bracco) #27

Update: We’ve reached a point where we think we have a good working version, but before building on top of it further we feel we could use some feedback in case our implementation has major flaws. We’re hoping @jeremy, @sgugger, @muellerzr, @arora_aman or any others who have been really active in the v2 chat and dev repo can take a quick look and make suggestions. We know everyone is extremely busy with getting v2 ready to ship, so if you don’t have time we understand, but feedback of any kind (especially critical) would be greatly appreciated. Thank you.

NBViewer Notebook Links:

  1. 70_audio_core
  2. 71_audio_augment
  3. 72_audio_tutorial

What we could really use feedback on before proceeding:

  1. The low-level implementation of AudioItem, AudioSpectrogram, AudioToSpec/AudioToMFCC and how we chose to wrap torchaudio and extract default + user-supplied values to be stored in spectrogram.
  2. How to best GPUify everything. We think using SignalCropping to get a fixed length is the only thing we need to do on the CPU, and all signal augments, conversion to spectrogram, and spectrogram augments can be done on GPU. @baz, could you please post your latest GPU nb and questions here to get feedback?
  3. Where we should be using RandTransform for our augments.

Known bugs:
-AudioToSpec used to tab-complete with all potential arguments, but stopped recently, we’re trying to trace it.
-Spectrogram display with colorbar + axes doesnt work for multichannel audio, or delta+accelerate (anything that is more than one image)
-Show_batch is currently broken, we know how to fix it but it breaks spectrogram display. There’s a detailed note in the nb.

**Showcase of some high-level features:**

AudioItems display with audio player and waveplot:

Spectrograms store the settings used to generate them in order to show themselves better

Spectrograms display with decibel colorbar (if db_scaled), time axis, frequency axis. Thanks @TomB for suggesting this

Create regular or mel spectrograms, to_db or non_db easily from same function.

Warnings for missing/extra arguments. If you pass a keyword argument that won’t be applied to the type of spectrogram you’re generating (in this case non-mel spectrogram), you’ll get a warning.

AudioConfig class with optimized settings users can apply to their audio subdomain, e.g. AudioConfig.Voice, which will set the defaults to be good values for voice applications.

Easy MFCC generation, photo is a bad example as it currently stretches to plot, actual data is only 40px tall.

Features in NB71 audio_augment:

  • Preprocessing
    • Silence Removal: Trim Silence (remove silence at start and end) or remove all silence.
    • Efficient Resampling
  • Signal Transforms (all fast)
    • Signal Cropping/Padding
    • Signal Shifting
    • Easily add or generate different colors of noise
      e.g real_noisy = AddNoise(noise_level=1, color=NoiseColor.Pink)(audio_orig)
    • Augment volume (louder or quieter)
    • Signal cutout (dropping whole sections of the signal) and signal dropping (dropping a % of the samples, sounds like a bad analog signal, code for this is adapted from @ste and @zcaceres, thank you!)
    • Downmixing from multichannel to Mono
  • Spectrogram Transforms

Results from 72_audio_tutorial:
-99.8% accuracy on 10 speaker voice recognition dataset
-95.3% accuracy on 250 speaker voice recognition dataset


Deep Learning with Audio Thread
(Harry Coultas Blum) #28

Transforming from Signal to Spectrogram is much faster as show in the graph below. When I baked it into the v2 api, I got strange results with CPU epochs trippling. More details of this can be found here.

GPU vs CPU Audio To Spectrogram

I crop the signals before allowing the gpu and cpu convert them to spectrograms.


Spec Augment


In the post above you can see the methods I have created for specaugment. The vectorised one is slower for some reason.

Shift / Roll

I haven’t tried these yet but but now that I’ve done Spec Augment I think that they are probably vectorisable


  • As @MadeUpMasters has said, where should the transforms be going and as they are random, should they be extending the class RandTransform.
  • Generally where would you see the GPU Batch Transforms being placed.
  • @MadeUpMasters has created pre-setup audio configurations for different classifcation problems. How can we further improve the lives of beginners when approaching audio classification?
  • Many of the pre-processing transforms (Cropping, Removing Silence, Resampling) are not randomised and could be preserved. Considering that they took a long time before we decided to cache them in fastai_audio v1 . This was quite messy in implementation with the old API but I believe that it coudl be much cleaner with the new API. What are you thoughts on this?
  • We’ve mainly focused on classification and therefore Spectogram generation as that seems to be the SOTA approach but there is a huge potential for using RNN’s for ASR etc. In your mind what is the scope of the fastai audio module, how many audio problems should we tackle?
1 Like

(Harry Coultas Blum) #29

I’ve managed to improve method 2 to <700us which is faster than the for loop

def spec_augment(sgs:Tensor, size:int=20):
    bsg =
    device = sgs.device
    bs, _, x, y = bsg.shape
    max_y = y-size-1
    rsize = torch.tensor(x*bs)
    m = torch.arange(y,device=device).repeat(rsize).view(bs, -1)
    rs = torch.randint(0, max_y,(bs,1), device=device)
    gpumask = ((m < rs) | (m > (rs+size))).view(bs, 1, x, -1)
    return bsg*gpumask
646 µs ± 4.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

(Aman Arora) #30

Thanks @MadeUpMasters, would love to give it a try. I will get back to you if I find anything! :slight_smile:


(Vadim Shalts) #31

Thanks @MadeUpMasters, it looks awesome!

I am not active in the v2 chat or dev repo and can’t give any deep feedback about the technical part (except it’s much better than my bicycles and I start throwing mine away and replacing them with your code :slight_smile: ). I think I have a few thoughts about other stuff that can be useful not only for me when working with audio:

  1. Support for other types of noise that based on actual audio samples (from audio noise datasets or urban noises datasets). Example of a dataset is something like -

Plus some logic/tranformers to mix such noise on different levels

  1. Support for perceptual audio loss functions. Loses that bases on PESQ/POLQA give huge improvement for task-related with speech quality evaluation and can be a huge advantage for the fastai if it will be supported out of the box.

A few interesting works in this area:

At the moment I don’t think I am ready to share any code or to do any useful PRs related to those problems (maybe in the future, who knows…). Therefore, for now only ideas, but I hope they maybe useful somehow. And thanks again, your work is awesome and I already start adopting it for my experiments!


(Robert Bracco) #32

Thank you for looking over it for us, I really appreciate it.

This is coming, we’ve mostly messed around with voice in v2, but in the v1 library we did a lot of urban sounds stuff including getting a new SOTA on the ESC-50 Dataset, something that @KevinB and @hiromi have successfully replicated in v2. v2 will have preset defaults for the various audio problems (voice rec, ASR, scene classification, music…etc) so that people with no audio experience can have great defaults.

This is quite cool (and advanced!), thanks for sharing, I wasn’t aware of audio perceptual loss, just image feature loss from part 1 of the course. I read the papers and these are implemented quite differently from feature loss in fastai. The approach is to train a model to approximate PESQ since PESQ isnt differentiable, and models are, then use PESQ as the critic for a second model which is trying to generate or denoise audio.

I don’t see any reason we couldn’t implement this (train the critic and make it available as a loss function), but since we haven’t ventured into any generative audio stuff yet, it will probably be some time before adding it.

Thanks again for the feedback and feel free to reach out with more as your experiments progress. Cheers.

1 Like

(Vadim Shalts) #33


Seems I found a small issue:

def from_raw_data(cls, data, sr):
    data = tensor(data)
    if data.dim() == 1:
        data = data.view(1,-1)
    return cls((data, sr, Path('raw')))

AudioItem.from_raw_data = classmethod(AudioItem.from_raw_data)

freq = 440

sr = 16000
item = AudioItem.from_raw_data(data = librosa.tone(frequency=freq, sr=sr, length=2000), sr = sr)
a2s = AudioToSpec.from_cfg(AudioConfig.Voice())



Looks like it shows results upside-down or at least left Hz bar is not consistent with the picture.

With kind regards,