Deep Learning with Audio Thread

Has anybody here worked with learned filterbanks (instead of using a melspectrogram with bins decided by the mel scale, you start with the mel scale and allow the model to learn which ranges of frequencies are the best inputs to your normal CNN based model, and then extract those for training). The mel scale seems cool but there’s no way it’s optimal for all applications (or even for the one’s it is specifically designed for).

I’m reading Learning Filterbanks From Raw Speech For Phone Recognition and might try to implement in a week or so once the other demonstration notebooks are out.

I would note this shouldn’t hurt performance too much (apart from the fact there doesn’t necessarily seem to be much attention to optimisation in librosa). If you use input_tensor.numpy() and torch.from_numpy() (or torch.as_tensor()) then the returned ndarray/tensor shares the data of the input ndarray/tensor so should be a cheap op. Obviously in moving to v2 and GPU transforms these would have to be properly ordered and some things should be PyTorch. There’s also some limits on operations to such shared tensors, such as no resizing, which I think is done by marking it as non-contiguous. A call to contiguous is a no-op on an already contiguous tensor so could be added to all transforms that might get such tensors from here (avoiding it where possible as some ops will work fine).

I’ve mainly tested on urbansound so they’re all <4s so just used zero padding so far. With a parameter for the padding mode, so can be changed but haven’t experimented with this much. Does seem like reflection (or circular) padding could be worth trying for various tasks.

No haven’t tried that or seen it used. It seems odd to me. I wouldn’t have though appending metadata to the actual data was sensible (outside of certain specific cases perhaps), but could easily be a failure in my intuition. Not clear to me how a convolutional network would really make use of metadata if just appended to the signal. And in some cases it seems like it would just encourage overfitting. But again, could easily be a failure in my intuition.

I thought about trying something along those lines when I saw how mel was implemented (just a dot product of the FFT) and was happy to see my vague inkling backed up by use in literature. So definitely on my list, but haven’t yet tried it.
That article looks like a rather complicated version of the idea, implementing it in terms of 1D convolutions on time-domain data. You also might have performance issues. If I’m not misreading it looks like they’re doing a 1D convolution with a kernel size of 400. PyTorch’s convolutions don’t perform too great on larger kernels, though 400 might be OK (I was mainly looking at the stupidly long multi-thousand kernels used in polyphase resampling). I’ve actually got some code (still at testing stage) for efficiently doing time-domain convolutions using FFTs which should perform pretty well with that sort of kernel size, so might be useful. I’ve only ever used that for non-learnable stuff so haven’t tried a backward pass, but I presume the underlying FFTs should be decently optimised for backwards passes.
The Filterbank learning for deep neural network based polyphonic sound event detection paper looks like it’s suggesting learning parameters for the simple dot product based method which should be easier to implement and more performant.

One issue, aside from implementation, will be the effect on achievable batch sizes. It’s going to pretty dramatically increase the input size (as now the input is n_fft not n_mels, so 20x larger in your default config). But not sure how much that will matter overall. The larger inputs and large new first layer (or I think two in the implementation you linked) may not matter much compared to all the subsequent layers.
I did have a vague inkling about perhaps doing the mel learning on the CPU, perhaps also using gradient accumulation to further reduce the impact. But I don’t understand the backward pass calculation enough to know how feasible this would be, both in terms of general feasibility and of implementing it in fastai’s training loop.
Certainly an interesting thing to look at.

Yeah as soon as you get variation in file length you need to move either to chopping into smaller chunks and zero-padding the remainder, or taking random sections of the whole thing at train time, the latter being the best option I’ve found so far (setting duration in our config does this). I’ll try repeat padding on Monday and report back how it affects results.

Well for some datasets we are stripping away a really important piece of info (length of clip). For example in the recent Freesound 2019 challenge, there was a decently strong correlation between the length of a clip and it’s class. It’s hard to have a 30s clip be sigh. So if you use our method and grab a random 4s section of the spectro at train time, you are throwing away the clip length for all clips over 4s.

I think somehow telling the model “this comes from a 29s clip” would be really useful info to have, I just don’t have strong ML foundations and don’t know a functional way to “append” this info. My naive approach would just be to add 1+ more columns on the end of a spectrogram where the color info is a factor of clip length. i.e. on a 0-255 scale w data from 0-30s, maybe a 1s clip is an 8, 2s clip is 16, 10s clip is 80, 30s is 240. Then the model is able to notice that ‘sighs’ don’t ever have a light bar on the end, they have a dark bar, and use that info. But it’s possible my not super deep understanding of convolutions/pooling is causing me to misunderstand how this would work. Let me know how this sounds to you and I’ll test it either way. Also would love to hear from other experienced audio people on this.

Edit: Couldn’t you bucket the clip lengths to avoid overfitting? E.G. if you have 5000 images, append clip length rounded to nearest second. For 0-30s that’s 150 avg per bucket. No idea if this would work, or what a reasonable dataset:“bucket size and number of classes” ratio would be.

Sorry, again got ahead of myself by posting before reading the paper, and you’re totally right. It caught my eye because of phone recognition, the thing I’m most interested in learning.

About the performance, points are all well taken, batch size would be a major problem, especially as they already tend to be low for the image resolutions and sizes that get the best results. For phone recognition this would be fine as we are mostly looking at 25ms windows. For other models, I imagine not using it as a fully connected network, but learning the filters as one step, and then writing an equivalent function to MelSpectrogram that extracts the frequencies best suited to analyzing the data you have. Maybe a fully connected network is actually coming up with unique filterbanks for each example and just having one learned one wouldn’t be optimal, but I imagine it has to be better than using stadard mel filterbanks.

I’ve tentatively confirmed this surprising result. I need to do more explicit testing to really be sure, but it’s had a strong positive effect for me in tensorflow speech. I haven’t tried 128x256 -> 256x256 but 128x128->256x256 w bilinear interpolation (not using a pretrained model) gave a strong performance increase.

In relation to the library, setting size in .transforms() doesn’t work for us since we aren’t an ImageList, but it seems like fastai has many dependent functions, so it has been hard for us to copy it in and extend the functionality (@baz has been working on this). In the branch where I tested it I just wrote a transform that uses torch’s torch.nn.functional.interpolate to upsample to 256x256, it’s a 1-liner and gives good results.

Really nice find by @mnpinto, it goes a bit against my intuition but it had a really strong positive effect.

1 Like

Nice to know you also got a good improvement with the upsampling! By the way, I have released my solution to Freesound competition on github: https://github.com/mnpinto/audiotagging2019

That seems fairly specific to the sort of hyper-optimisation you do for Kaggle. And also for that exact task. That’s probably not going to help even for environmental sound classification when multiple classes can be present. I’d also have tended to think even in such a context you might be better to integrate that in a different way. Feeding that sort of metadata to a random forest or KNN along with NN model predictions.
In terms of a model being able to use it, the information would seem to have to be passed fairly unchanged through all the convolutional layers, requiring kernels that pass only a few inputs largely unchanged which may not be much use for other things, thus reducing model computation available for other learning. I would at least have thought you’d be better passing it to the linear layers of the model rather than into the convolutional section.
I’d still think you may then see a focus on not very robust methods of detection that don’t help overall learning. Such a simple, non-generalisable link for certain classes would eliminate much of the error from those classes, reducing the models ability to learn more generally from them. But of course that’s all guesswork and neural nets are often hard to understand in such terms.
I think you’d at least want to be investigate whether a model trained with such information could adapt well to less artificial settings. To see that it had learnt more general things not just tricks that aren’t generally applicable. Say training on a subset of classes, then seeing how it dealt with the addition of new classes. Not that a failure here would mean it didn’t have some applicability, but it may be so specific that it isn’t really suited as a general method, and may be more likely to lead beginners astray rather than help.

How exactly would you learn it as a separate step? Were you thinking in terms of using NN learning or using some non-NN method?
Not sure if you’ve looked at the implementation of MelSpectorgram, but it just generates a tensor of shape (n_mels, n_fft) to calculate each mel band as a weighted sum of the FFT bands, which is applied by matrix multiplication. So, I think it’s just a Linear layer with n_fft input features and n_mels out features.

Thanks for taking the time to reply. Everything you said about trying to include it in a conv layer makes perfect sense, it was just a gap in my understanding about the best way to include the info. In the future I’ll play around with passing it in via a linear layer, or random forest/KNN and see how it goes, and if it generalizes. I think it should generalize fairly well if you apply it correctly (in cases where there is a real life correlation between clip-length and class). You’re right that it definitely shouldn’t work for multi-label. Thanks again for helping me think through it.

I’ve been using Jeremy’s notebook which builds a classifier from scratch, but I’ll check our your repo as well.

Which notebook is it? Is it audio specific or just one of the nbs from part 2. If it is audio do you mind linking me? Cheers.

Here’s the notebook

It uses lots of functions from previous notebooks, so it wont work on google colab.

3 Likes

Added some new features today that are under review.

  • Repeat padding option for spectrograms
  • Can pass a size argument to get_spectro_transforms() that will resize spectros using bilinear interp. (Same basic functionality as fastai vision, but making it completely compatible with fastai.vision will cause us to break a lot, so we are going to wait for the API v2 release when we also will need to break things)
  • Can now see the size of your cache by calling .cache_size() on your AudioConfig
  • Can safely clear the cache (with progress bar) by calling .clear_cache() on your AudioConfig

Tomorrow we hope to have inference working cleaned up and working (sorry to those who have had issues with it), and I aim to release a nb showing great results on ESC-50 with very little code.

Hey, I tried copying in the audio_cnn_learner and adapt_model from your code to see if I could get it working. I used the config to determine number of channels and then passed that to adapt_model but I’m getting the error AttributeError: 'Conv2d' object has no attribute 'padding_mode'. Here’s the full stack trace. Any ideas?

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-b0d5368ed7ab> in <module>
----> 1 learn = audio_cnn_learner(db)
      2 learn.lr_find()
      3 learn.recorder.plot()

~/rob/fastai_audio/audio/learner.py in audio_cnn_learner(data, base_arch, metrics, cut, pretrained, lin_ftrs, ps, custom_head, split_on, bn_final, init, concat_pool, padding_mode, **kwargs)
     48                         concat_pool=concat_pool, **kwargs)
     49     channels = _calc_channels(data.config)
---> 50     adapt_model(learn.model, channels, pretrained=pretrained, init=init, padding_mode=padding_mode)
     51     learn.unfreeze() # Model shouldn't be frozen, unlike vision
     52     return learn

~/rob/fastai_audio/audio/learner.py in adapt_model(model, n_channels, name, pretrained, init, padding_mode)
     38         update = partial(setattr, model, name)
     39     else: raise TypeError(f"Could not locate first convolution layer. If it is a named layer then pass it's name, otherwise use adapt_conv.")
---> 40     update(adapt_conv(conv1, n_channels, pretrained=pretrained, init=init, padding_mode=padding_mode))
     41 
     42 def audio_cnn_learner(data:AudioDataBunch, base_arch:Callable=models.resnet18, metrics=accuracy, cut:Union[int,Callable]=None, pretrained:bool=False, lin_ftrs:Optional[Collection[int]]=None, ps:Floats=0.5, custom_head:Optional[nn.Module]=None,

~/rob/fastai_audio/audio/learner.py in adapt_conv(conv, n_channels, pretrained, init, padding_mode)
     13     args = {n: getattr(conv, n) for n in ['kernel_size','stride','padding','dilation','groups']}
     14     bias = conv.bias is not None
---> 15     pm = ifnone(padding_mode, conv.padding_mode)
     16     new_conv = Conv2d(n_channels, conv.out_channels, bias=bias, padding_mode=pm, **args)
     17     if pretrained:

/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __getattr__(self, name)
    533                 return modules[name]
    534         raise AttributeError("'{}' object has no attribute '{}'".format(
--> 535             type(self).__name__, name))
    536 
    537     def __setattr__(self, name, value):

AttributeError: 'Conv2d' object has no attribute 'padding_mode'

And here is the code I tried (I removed the .expand call so anything that doesn’t use delta stacking should be 1 channel)

def audio_cnn_learner(data:AudioDataBunch, base_arch:Callable=models.resnet18, metrics=accuracy, cut:Union[int,Callable]=None, pretrained:bool=False, lin_ftrs:Optional[Collection[int]]=None, ps:Floats=0.5, custom_head:Optional[nn.Module]=None,
                      split_on:Optional[SplitFuncOrIdxList]=None, bn_final:bool=False, init=nn.init.kaiming_normal_,
                      concat_pool:bool=True, padding_mode:str='zeros', **kwargs:Any)->Learner:
    '''Create a learner to apply a CNN model to audio spectrograms.'''
    learn = cnn_learner(data, base_arch, cut=cut, pretrained=pretrained, lin_ftrs=lin_ftrs, ps=ps,
                        custom_head=custom_head, split_on=split_on, bn_final=bn_final, init=init,
                        concat_pool=concat_pool, **kwargs)
    channels = _calc_channels(data.config)
    adapt_model(learn.model, channels, pretrained=pretrained, init=init, padding_mode=padding_mode)
    learn.unfreeze() # Model shouldn't be frozen, unlike vision
    return learn

def _calc_channels(cfg):
    channels = 3 if cfg.delta else 1
    return channels

Ah, OK, padding_mode was added in PyTorch 1.1 so I guess you haven’t updated in a bit.
You could replace that line in adapt_conv with:

if 'padding_mode' in Conv2d.__constants__: # Padding mode added in PyTorch 1.1
    args['padding_mode'] = ifnone(padding_mode, conv.padding_mode)

and then it should work on 1.* (only tested on 1.1 still so just based on looking at code for 1.0). That looks like it will break with pre-1.0 as __constant__ was only added then, but that’s fine, no intention to support 0.4.

It should actually be fine with expanded clips. It does no special handling for them but treating them as 3 channel should work.

2 Likes

Thanks man you’re a huge help. Would you like to PR it if we do use it?

You can just copy it over. Or if you’d prefer a PR then let me know. Note if you copy over the tests they use pytest-mock for patching (it’s on pip and the default conda channel).

2 Likes

@TomB Thank you very much for this code this is super useful. We’d really like you to do the PR yourself because we’d love to have your help on the codebase.

It requires a fair bit of adjustment in other parts of our code to make sure that 1 channel is displaying right and all the dimensions work for the different settings. Definitely don’t worry about any of that but if you want to PR just the changes to the learner and set channels = 3 in audio_cnn_learner, that would be awesome, and then I will go in and revert stuff to 1 channel later.

Is this the notebook for the promised upcoming fast.ai audio lecture(s)?

I think so.

1 Like

How exciting! Thank you so much to all who contributed to it, it looks great!