Fastai v2 audio

Transforming from Signal to Spectrogram is much faster as show in the graph below. When I baked it into the v2 api, I got strange results with CPU epochs trippling. More details of this can be found here.

GPU vs CPU Audio To Spectrogram

I crop the signals before allowing the gpu and cpu convert them to spectrograms.


Spec Augment


In the post above you can see the methods I have created for specaugment. The vectorised one is slower for some reason.

Shift / Roll

I haven’t tried these yet but but now that I’ve done Spec Augment I think that they are probably vectorisable


  • As @MadeUpMasters has said, where should the transforms be going and as they are random, should they be extending the class RandTransform.
  • Generally where would you see the GPU Batch Transforms being placed.
  • @MadeUpMasters has created pre-setup audio configurations for different classifcation problems. How can we further improve the lives of beginners when approaching audio classification?
  • Many of the pre-processing transforms (Cropping, Removing Silence, Resampling) are not randomised and could be preserved. Considering that they took a long time before we decided to cache them in fastai_audio v1 . This was quite messy in implementation with the old API but I believe that it coudl be much cleaner with the new API. What are you thoughts on this?
  • We’ve mainly focused on classification and therefore Spectogram generation as that seems to be the SOTA approach but there is a huge potential for using RNN’s for ASR etc. In your mind what is the scope of the fastai audio module, how many audio problems should we tackle?
1 Like

I’ve managed to improve method 2 to <700us which is faster than the for loop

def spec_augment(sgs:Tensor, size:int=20):
    bsg =
    device = sgs.device
    bs, _, x, y = bsg.shape
    max_y = y-size-1
    rsize = torch.tensor(x*bs)
    m = torch.arange(y,device=device).repeat(rsize).view(bs, -1)
    rs = torch.randint(0, max_y,(bs,1), device=device)
    gpumask = ((m < rs) | (m > (rs+size))).view(bs, 1, x, -1)
    return bsg*gpumask
646 µs ± 4.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Thanks @MadeUpMasters, would love to give it a try. I will get back to you if I find anything! :slight_smile:


Thanks @MadeUpMasters, it looks awesome!

I am not active in the v2 chat or dev repo and can’t give any deep feedback about the technical part (except it’s much better than my bicycles and I start throwing mine away and replacing them with your code :slight_smile: ). I think I have a few thoughts about other stuff that can be useful not only for me when working with audio:

  1. Support for other types of noise that based on actual audio samples (from audio noise datasets or urban noises datasets). Example of a dataset is something like -

Plus some logic/tranformers to mix such noise on different levels

  1. Support for perceptual audio loss functions. Loses that bases on PESQ/POLQA give huge improvement for task-related with speech quality evaluation and can be a huge advantage for the fastai if it will be supported out of the box.

A few interesting works in this area:

At the moment I don’t think I am ready to share any code or to do any useful PRs related to those problems (maybe in the future, who knows…). Therefore, for now only ideas, but I hope they maybe useful somehow. And thanks again, your work is awesome and I already start adopting it for my experiments!


Thank you for looking over it for us, I really appreciate it.

This is coming, we’ve mostly messed around with voice in v2, but in the v1 library we did a lot of urban sounds stuff including getting a new SOTA on the ESC-50 Dataset, something that @KevinB and @hiromi have successfully replicated in v2. v2 will have preset defaults for the various audio problems (voice rec, ASR, scene classification, music…etc) so that people with no audio experience can have great defaults.

This is quite cool (and advanced!), thanks for sharing, I wasn’t aware of audio perceptual loss, just image feature loss from part 1 of the course. I read the papers and these are implemented quite differently from feature loss in fastai. The approach is to train a model to approximate PESQ since PESQ isnt differentiable, and models are, then use PESQ as the critic for a second model which is trying to generate or denoise audio.

I don’t see any reason we couldn’t implement this (train the critic and make it available as a loss function), but since we haven’t ventured into any generative audio stuff yet, it will probably be some time before adding it.

Thanks again for the feedback and feel free to reach out with more as your experiments progress. Cheers.

1 Like


Seems I found a small issue:

def from_raw_data(cls, data, sr):
    data = tensor(data)
    if data.dim() == 1:
        data = data.view(1,-1)
    return cls((data, sr, Path('raw')))

AudioItem.from_raw_data = classmethod(AudioItem.from_raw_data)

freq = 440

sr = 16000
item = AudioItem.from_raw_data(data = librosa.tone(frequency=freq, sr=sr, length=2000), sr = sr)
a2s = AudioToSpec.from_cfg(AudioConfig.Voice())



Looks like it shows results upside-down or at least left Hz bar is not consistent with the picture.

With kind regards,

1 Like

Could I please ask if the work on v2 is still taking place? It doesn’t seem to be happening in the fastai_audio repo? BTW the tutorials in the repo are top notch, they were very helpful to me and are really well written! Thank you for putting them together.


Work on v2 is still taking place, the latest update was posted here: Fastai v2 audio

Really excited to hear you found the tutorials useful! I plan to go back at some point and add a lot more as I’ve learned more about signal processing, as well as experimented with a lot of new things like training on raw audio, audio embeddings…etc. What I have so far barely scratches the surface.


Thank you very much for the update! :slight_smile:

Would love to read this.

My plan is this:

  • continue to learn about fastai v2
  • study the materials you share above
  • use fastai audio in the freesound starter pack (I first want to train on spectograms using the setup I already have, but the next step IMO would be fastai audio, I am using fastai v2)

I am not sure how things will go but maybe there will be a chance for me to do something useful for the project. It could be as little as test driving the functionality you create or maybe helping in some way with fastai v2 integration. Anyhow, just a thought at this point in time, look forward to getting up to speed with the v2 port :slight_smile:

1 Like

Awesome, a great place for audio resources (and asking questions) is the Deep Learning with Audio Thread. Please let me know if you find anything helpful that isnt listed there and I can get it added.

1 Like

An update on v2 audio, it is mostly functional, we just need some pieces of the high-level API put on top, and to fix a few places we feel like we could be doing things better/faster/more v2-like but aren’t sure exactly how to get there. I’m going to make a series of posts about each issue, so if you have ideas/questions or want to participate in the discussion, even if you know nothing about audio, please jump in. Also any feedback on making the posts more useful/readable is appreciated.

Issue 1: Getting transforms to use TypeDispatch, RandTransform, and be GPU-friendly

We have a lot of transforms for AudioSignals and AudioSpectrograms that were originally made as simple independent functions. Many transforms for signals are the same as those for spectrograms, just with an extra dimension. For instance, we may want to cut out a section of the data, for AudioSignals this is “cutout”, for AudioSpectrograms it “time masking”, but they’re the same idea just on 2D (channels x samples) and 3D (channels x height x width) tensors respectively. Given the functionality of v2, we would like to refactor many of these to use RandTransform, TypeDispatching and be easily transferable to GPU.

I am having trouble figuring out exactly how to design a transform that works on signals and spectrograms, individual items and batches, implements RandTransform, and is fast. If I get one, I should be able to copy the pattern to the others. I tried looking in 09_data_augment but it wasn’t that clear to me how to make it work for my code.

Here is my best attempt so far. It’s for shifting the data horizontally (roll adds wraparound) It works properly, but is much slower than my original transforms, 557µs for a single item and 2.35ms for a batch of 32 signals (CPU), compared to 54µs for a single item previously.

Questions @jeremy @baz:

  • Is it more or less correctly implemented, and how can I improve it with respect to…
    • TypeDispatch - Making it work for both signal/spectrogram simultaneously?
    • GPU - Making it work on both batches and individual items?
    • RandTransform - Am I using it properly here?
  • Any ideas on how to make it faster? Code of the original transform included at bottom.


class SignalShifter(RandTransform):
    def __init__(self, p=0.5, max_pct=0.2, max_time=None, direction=0, roll=False):
        if direction not in [-1, 0, 1]: raise ValueError("Direction must be -1(left) 0(bidirectional) or 1(right)")
        store_attr(self, "max_pct,max_time,direction,roll")
        super().__init__(p=p, as_item=True)

    def before_call(self, b, split_idx):
        super().before_call(b, split_idx)
        self.shift_factor = random.uniform(-1, 1)
        if self.direction != 0: self.shift_factor = self.direction*abs(self.shift_factor)
    def encodes(self, ai:AudioItem):        
        if self.max_time is None: s = self.shift_factor*self.max_pct*ai.nsamples
        else:                     s = self.shift_factor*self.max_time*
        ai.sig[:] = shift_signal(ai.sig, int(s), self.roll)
        return ai
    def encodes(self, sg:AudioSpectrogram):
        if self.max_time is None: s = self.shift_factor*self.max_pct*sg.width
        else:                     s = self.shift_factor*self.max_time*
        return shift_signal(sg, int(s), self.roll)

def _shift(sig, s):
    samples = sig.shape[-1]
    if   s == 0: return torch.clone(sig)
    elif  s < 0: return[sig[...,-1*s:], torch.zeros_like(sig)[...,s:]], dim=-1)
    else       : return[torch.zeros_like(sig)[...,:s], sig[...,:samples-s]], dim=-1

def shift_signal(t:torch.Tensor, shift, roll):
    #refactor 2nd half of this statement to just take and roll the final axis
    if roll: t[:] = torch.from_numpy(np.roll(t.numpy(), shift, axis=-1))
    else   : t[:] = _shift(t[:], shift)
    return t

Here’s the original code that works only on a signal:

def _shift(sig, s):
    channels, samples = sig.shape[-2:]
    if   s == 0: return torch.clone(sig)
    elif  s < 0: return[sig[...,-1*s:], torch.zeros_like(sig)[...,s:]], dim=-1)
    else       : return[torch.zeros_like(sig)[...,:s], sig[...,:samples-s]], dim=-1)

def ShiftSignal(max_pct=0.2, max_time=None, roll=False):
    def _inner(ai: AudioItem)->AudioItem:
        s = int(random.uniform(-1, 1)*max_pct*ai.nsamples if max_time is None else random.uniform(-1, 1)*max_time*
        sig = torch.from_numpy(np.roll(ai.sig.numpy(), s, axis=1)) if roll else _shift(ai.sig, s) 
        return AudioItem((sig,, ai.path))
    return _inner

Issue 2: Should transforms mutate in place, or return altered copies?

Am I correct in noticing that the RandTransforms inside 09_vision_augment are not cloning the Image data but are mutating it in place? Is this just because in the Pipeline stuff is constantly being created and destroyed each batch? Any reason we shouldn’t follow this pattern for audio? @jeremy @baz

Example v2 vision transform for reference:

1 Like

Yes, there is no need to retain a copy of the original batch. On the contrary, the goal is to save GPU memory as much as possible.

1 Like

I don’t know if it’s correctly implemented - you’ll need to create tests to convince yourself of that.

RandTransform looks fine. Not sure what you’re passing as_item=True. Do you need that?

I don’t know exactly what you’re asking about TypeDispatch. What do you want to do? What have you tried? What happened when you tried? Please provide code for anything that didn’t work the way you hoped, as appropriate.

To make GPU transforms work on items or batches you need to use broadcasting carefully. There’s nothing fastai specific about that. Using ellipses, e.g. t[...,c,x,y], can help. Otherwise, just create separate versions for each with different names.

You should use a profiler to see where the time is being spent.

Thanks for the feedback, and I’m sorry, I should have made this post more clear. I haven’t had many eyes on the code, especially from people extremely comfortable with the v2 codebase, and GPU transforms, so I was looking more for a quick look over the code to see if anything jumps out as either a bad practice, or something that could be done better another way. I didn’t mean to ask you, ‘will this code work?’, we do have tests in place, but more along the lines of “before we copy this pattern out for all the transforms we have, does everything look more or less okay?”. I’ll try to be more explicit in the future about what would be helpful, and I’ll do my best to make sure all the needed info is there without overwhelming the post.

Sorry, on rereading my question about TypeDispatch was extremely unclear. I was going to ask if combining separate signal and spectrogram transforms into one, when they do the same type of operation, was a good idea, but I’m sure it is and it’s a key reason you implemented it in Python in the first place, so I’ve got my answer :slight_smile:

Yes I think you could say def encodes(self, item:(AudioItem,AudioSpectrogram)) and then cover both with basically the same code. I just try refactorings and see whether what comes out is (to me) clearer, or less clear, than what I started with. And only keep it if I think it’s an improvement.

Nothing jumps out at me as obviously problematic with your code. But if there’s design decisions that are going to end up appearing in lots of different transforms, you might want to think about ways to factor them out anyway, so you only have to change things in one place later if needed.

Personally, my approach to development is very iterative. Once I’ve built 3 transforms, for instance, I’ll go back and look for ways to share code and simplify them as a group. And I’ll keep doing that as I add more. I’m not very good at looking at a single piece of code and knowing whether it’ll end up a good pattern in the long run.


I’m working on a new audio project (bird calls ) and hoping to use v2 audio for it. However is v2 still progressing in terms of audio as this thread appears to have gone quiet? Or would it be better to go Pytorch native at this point?

Also I followed the setup directions in Rbraccos github but see nothing about audio in the resulting folders…are there updated config directions?
(from Fastai v2 audio)

Thanks for any updates.


Hello! I’d love to get involved in this. Please let me know how/where I can help!

@MadeUpMasters can you add me to the Telegram chat for Audio ML and working on the library?

So happy to see this happening!

1 Like

Hey Less, sorry we are in a bit of a transition state and that plus vacation holidays has made it appear quiet, but we are very much working on it.

V2 Updates are posted here

After fastai_dev became fastai2 (and a host of other repos), our fork moved to a new repo to fastai2_audio

If you can tell me a bit more about your project (will it be for production? research? just you?) I’d be happy to give you an honest opinion about the best option for you. Each one has it’s drawbacks:

  • Pytorch only, you’ll have to do all your own preprocessing, spectrogram generation/cropping, and transforms.
  • Fastai audio v2, it will be changing a lot over the next 3 months, I wouldn’t use it for anything you need to count on right now
  • Fastai audio v1, it has nice features and documentation, and is stable, but some things don’t work well (inference and model export) and we aren’t doing a ton to support them at the moment.

Hope this rundown helps, let me know if there’s any way I can help you get started.



+1 for telegram chat. My username is: @madhavajay on telegram. Would it be best to start with audio v1 to do exploration on Audio Classification to determine the data quality and problem approach or does v2 provide superior results and easier tooling?

1 Like