Fastai v2 audio

Thank you for looking over it for us, I really appreciate it.

This is coming, we’ve mostly messed around with voice in v2, but in the v1 library we did a lot of urban sounds stuff including getting a new SOTA on the ESC-50 Dataset, something that @KevinB and @hiromi have successfully replicated in v2. v2 will have preset defaults for the various audio problems (voice rec, ASR, scene classification, music…etc) so that people with no audio experience can have great defaults.

This is quite cool (and advanced!), thanks for sharing, I wasn’t aware of audio perceptual loss, just image feature loss from part 1 of the course. I read the papers and these are implemented quite differently from feature loss in fastai. The approach is to train a model to approximate PESQ since PESQ isnt differentiable, and models are, then use PESQ as the critic for a second model which is trying to generate or denoise audio.

I don’t see any reason we couldn’t implement this (train the critic and make it available as a loss function), but since we haven’t ventured into any generative audio stuff yet, it will probably be some time before adding it.

Thanks again for the feedback and feel free to reach out with more as your experiments progress. Cheers.

2 Likes

Hi!

Seems I found a small issue:

@patch_to(cls=AudioItem)
def from_raw_data(cls, data, sr):
    data = tensor(data)
    if data.dim() == 1:
        data = data.view(1,-1)
    return cls((data, sr, Path('raw')))

AudioItem.from_raw_data = classmethod(AudioItem.from_raw_data)

freq = 440

sr = 16000
item = AudioItem.from_raw_data(data = librosa.tone(frequency=freq, sr=sr, length=2000), sr = sr)
a2s = AudioToSpec.from_cfg(AudioConfig.Voice())

a2s(item).show()

Gives:

Looks like it shows results upside-down or at least left Hz bar is not consistent with the picture.

With kind regards,
Vadim.

1 Like

Could I please ask if the work on v2 is still taking place? It doesn’t seem to be happening in the fastai_audio repo? BTW the tutorials in the repo are top notch, they were very helpful to me and are really well written! Thank you for putting them together.

2 Likes

Work on v2 is still taking place, the latest update was posted here: Fastai v2 audio

Really excited to hear you found the tutorials useful! I plan to go back at some point and add a lot more as I’ve learned more about signal processing, as well as experimented with a lot of new things like training on raw audio, audio embeddings…etc. What I have so far barely scratches the surface.

3 Likes

Thank you very much for the update! :slight_smile:

Would love to read this.

My plan is this:

  • continue to learn about fastai v2
  • study the materials you share above
  • use fastai audio in the freesound starter pack (I first want to train on spectograms using the setup I already have, but the next step IMO would be fastai audio, I am using fastai v2)

I am not sure how things will go but maybe there will be a chance for me to do something useful for the project. It could be as little as test driving the functionality you create or maybe helping in some way with fastai v2 integration. Anyhow, just a thought at this point in time, look forward to getting up to speed with the v2 port :slight_smile:

1 Like

Awesome, a great place for audio resources (and asking questions) is the Deep Learning with Audio Thread. Please let me know if you find anything helpful that isnt listed there and I can get it added.

1 Like

An update on v2 audio, it is mostly functional, we just need some pieces of the high-level API put on top, and to fix a few places we feel like we could be doing things better/faster/more v2-like but aren’t sure exactly how to get there. I’m going to make a series of posts about each issue, so if you have ideas/questions or want to participate in the discussion, even if you know nothing about audio, please jump in. Also any feedback on making the posts more useful/readable is appreciated.

Issue 1: Getting transforms to use TypeDispatch, RandTransform, and be GPU-friendly

We have a lot of transforms for AudioSignals and AudioSpectrograms that were originally made as simple independent functions. Many transforms for signals are the same as those for spectrograms, just with an extra dimension. For instance, we may want to cut out a section of the data, for AudioSignals this is “cutout”, for AudioSpectrograms it “time masking”, but they’re the same idea just on 2D (channels x samples) and 3D (channels x height x width) tensors respectively. Given the functionality of v2, we would like to refactor many of these to use RandTransform, TypeDispatching and be easily transferable to GPU.

I am having trouble figuring out exactly how to design a transform that works on signals and spectrograms, individual items and batches, implements RandTransform, and is fast. If I get one, I should be able to copy the pattern to the others. I tried looking in 09_data_augment but it wasn’t that clear to me how to make it work for my code.

Here is my best attempt so far. It’s for shifting the data horizontally (roll adds wraparound) It works properly, but is much slower than my original transforms, 557µs for a single item and 2.35ms for a batch of 32 signals (CPU), compared to 54µs for a single item previously.

Questions @jeremy @baz:

  • Is it more or less correctly implemented, and how can I improve it with respect to…
    • TypeDispatch - Making it work for both signal/spectrogram simultaneously?
    • GPU - Making it work on both batches and individual items?
    • RandTransform - Am I using it properly here?
  • Any ideas on how to make it faster? Code of the original transform included at bottom.

Code:

class SignalShifter(RandTransform):
    def __init__(self, p=0.5, max_pct=0.2, max_time=None, direction=0, roll=False):
        if direction not in [-1, 0, 1]: raise ValueError("Direction must be -1(left) 0(bidirectional) or 1(right)")
        store_attr(self, "max_pct,max_time,direction,roll")
        super().__init__(p=p, as_item=True)

    def before_call(self, b, split_idx):
        super().before_call(b, split_idx)
        self.shift_factor = random.uniform(-1, 1)
        if self.direction != 0: self.shift_factor = self.direction*abs(self.shift_factor)
        
    def encodes(self, ai:AudioItem):        
        if self.max_time is None: s = self.shift_factor*self.max_pct*ai.nsamples
        else:                     s = self.shift_factor*self.max_time*ai.sr
        ai.sig[:] = shift_signal(ai.sig, int(s), self.roll)
        return ai
    
    def encodes(self, sg:AudioSpectrogram):
        if self.max_time is None: s = self.shift_factor*self.max_pct*sg.width
        else:                     s = self.shift_factor*self.max_time*sg.sr
        return shift_signal(sg, int(s), self.roll)

def _shift(sig, s):
    samples = sig.shape[-1]
    if   s == 0: return torch.clone(sig)
    elif  s < 0: return torch.cat([sig[...,-1*s:], torch.zeros_like(sig)[...,s:]], dim=-1)
    else       : return torch.cat([torch.zeros_like(sig)[...,:s], sig[...,:samples-s]], dim=-1

def shift_signal(t:torch.Tensor, shift, roll):
    #refactor 2nd half of this statement to just take and roll the final axis
    if roll: t[:] = torch.from_numpy(np.roll(t.numpy(), shift, axis=-1))
    else   : t[:] = _shift(t[:], shift)
    return t

Here’s the original code that works only on a signal:

def _shift(sig, s):
    channels, samples = sig.shape[-2:]
    if   s == 0: return torch.clone(sig)
    elif  s < 0: return torch.cat([sig[...,-1*s:], torch.zeros_like(sig)[...,s:]], dim=-1)
    else       : return torch.cat([torch.zeros_like(sig)[...,:s], sig[...,:samples-s]], dim=-1)

#export
def ShiftSignal(max_pct=0.2, max_time=None, roll=False):
    def _inner(ai: AudioItem)->AudioItem:
        s = int(random.uniform(-1, 1)*max_pct*ai.nsamples if max_time is None else random.uniform(-1, 1)*max_time*ai.sr)
        sig = torch.from_numpy(np.roll(ai.sig.numpy(), s, axis=1)) if roll else _shift(ai.sig, s) 
        return AudioItem((sig, ai.sr, ai.path))
    return _inner
2 Likes

Issue 2: Should transforms mutate in place, or return altered copies?

Am I correct in noticing that the RandTransforms inside 09_vision_augment are not cloning the Image data but are mutating it in place? Is this just because in the Pipeline stuff is constantly being created and destroyed each batch? Any reason we shouldn’t follow this pattern for audio? @jeremy @baz

Example v2 vision transform for reference:

1 Like

Yes, there is no need to retain a copy of the original batch. On the contrary, the goal is to save GPU memory as much as possible.

1 Like

I don’t know if it’s correctly implemented - you’ll need to create tests to convince yourself of that.

RandTransform looks fine. Not sure what you’re passing as_item=True. Do you need that?

I don’t know exactly what you’re asking about TypeDispatch. What do you want to do? What have you tried? What happened when you tried? Please provide code for anything that didn’t work the way you hoped, as appropriate.

To make GPU transforms work on items or batches you need to use broadcasting carefully. There’s nothing fastai specific about that. Using ellipses, e.g. t[...,c,x,y], can help. Otherwise, just create separate versions for each with different names.

You should use a profiler to see where the time is being spent.

Thanks for the feedback, and I’m sorry, I should have made this post more clear. I haven’t had many eyes on the code, especially from people extremely comfortable with the v2 codebase, and GPU transforms, so I was looking more for a quick look over the code to see if anything jumps out as either a bad practice, or something that could be done better another way. I didn’t mean to ask you, ‘will this code work?’, we do have tests in place, but more along the lines of “before we copy this pattern out for all the transforms we have, does everything look more or less okay?”. I’ll try to be more explicit in the future about what would be helpful, and I’ll do my best to make sure all the needed info is there without overwhelming the post.

Sorry, on rereading my question about TypeDispatch was extremely unclear. I was going to ask if combining separate signal and spectrogram transforms into one, when they do the same type of operation, was a good idea, but I’m sure it is and it’s a key reason you implemented it in Python in the first place, so I’ve got my answer :slight_smile:

Yes I think you could say def encodes(self, item:(AudioItem,AudioSpectrogram)) and then cover both with basically the same code. I just try refactorings and see whether what comes out is (to me) clearer, or less clear, than what I started with. And only keep it if I think it’s an improvement.

Nothing jumps out at me as obviously problematic with your code. But if there’s design decisions that are going to end up appearing in lots of different transforms, you might want to think about ways to factor them out anyway, so you only have to change things in one place later if needed.

Personally, my approach to development is very iterative. Once I’ve built 3 transforms, for instance, I’ll go back and look for ways to share code and simplify them as a group. And I’ll keep doing that as I add more. I’m not very good at looking at a single piece of code and knowing whether it’ll end up a good pattern in the long run.

2 Likes

I’m working on a new audio project (bird calls ) and hoping to use v2 audio for it. However is v2 still progressing in terms of audio as this thread appears to have gone quiet? Or would it be better to go Pytorch native at this point?

Also I followed the setup directions in Rbraccos github but see nothing about audio in the resulting folders…are there updated config directions?
(from Fastai v2 audio)

Thanks for any updates.
Less

2 Likes

Hello! I’d love to get involved in this. Please let me know how/where I can help!

@MadeUpMasters can you add me to the Telegram chat for Audio ML and working on the library?

So happy to see this happening!

1 Like

Hey Less, sorry we are in a bit of a transition state and that plus vacation holidays has made it appear quiet, but we are very much working on it.

V2 Updates are posted here

After fastai_dev became fastai2 (and a host of other repos), our fork moved to a new repo to fastai2_audio

If you can tell me a bit more about your project (will it be for production? research? just you?) I’d be happy to give you an honest opinion about the best option for you. Each one has it’s drawbacks:

  • Pytorch only, you’ll have to do all your own preprocessing, spectrogram generation/cropping, and transforms.
  • Fastai audio v2, it will be changing a lot over the next 3 months, I wouldn’t use it for anything you need to count on right now
  • Fastai audio v1, it has nice features and documentation, and is stable, but some things don’t work well (inference and model export) and we aren’t doing a ton to support them at the moment.

Hope this rundown helps, let me know if there’s any way I can help you get started.

Best,
Rob

2 Likes

+1 for telegram chat. My username is: @madhavajay on telegram. Would it be best to start with audio v1 to do exploration on Audio Classification to determine the data quality and problem approach or does v2 provide superior results and easier tooling?

1 Like

Hi Rob,
Awesome thanks a ton for the feedback and update. Glad to see that audio work is still underway!

Re: project - it’s a prototype right now with their planning to roll it out into commercial use next year if it goes well (for environmental monitoring basically).

I did look at 1.0 and it looked good but some of the stuff in the notebooks is broken (i.e. librosa has updated and changed) so I was assuming 2.0 would likely be the way to go.

Thanks for the link to the v2 audio repro - now I can see the v2 audio work there so that’s a big help.

Their timeline is somewhat flexible as they are still gathering field data to buuild out the datasets we’ll need so my preference ideally at least would be to work with 2.0 and ideally help contribute to it as v2 grows and this project grows.

I’ll setup with the v2 that’s there tomorrow though and try to get up to speed on it as it is for now, so thanks again!

Best regards,
Less

2 Likes

Given those conditions I would recommend v1 for the time being.

Thanks let us know how it goes, things are quite messy at the moment (show batch is broken, there’s a bug for autocompleting arguments to AudioSpectrogram constructor where it doesnt show all the available kwargs, which is annoying as there are a lot of them. These should be relatively easy fixes but I won’t be back to working on this until Monday Jan 6. We have a great group now and would love any feedback/contribution.

2 Likes

Hi Folks,

Not a code level question but one about the direction of the library.
Is the main goal of the library to classify discrete audio eg. single words or snippets (sound classification)? Or a more generalised ASR like Kaldi where longer audio is processed?

Best regards,
JP

1 Like

Hey JP, we would like to do both, but discrete audio classification (acoustic scene recognition, voice recognition, command/word recognition) is the easier of the two and is already working. We aim to add support for full ASR using CTC-loss but haven’t actually integrated it yet.

As for Kaldi, torchaudio (pytorch audio library that we use in fastai audio) wraps Kaldi so I think we can pretty easily access their functionality for things like audio alignment, but full grapheme/phoneme level ASR is a bigger leap. Hope that answers your question.

3 Likes