Deep Learning with Audio Thread

Yes. Audio will be the first one. Not sure when though. It’ll be added after the MOOC is released.

11 Likes

Sorry, hopefully it was clear I wasn’t in any way attacking. Took me a fair bit of digging and adapting to get code that seems to work (with the basic ideas all taken from forum posts). Totally understand going with the easier option which should by and large be fine and focusing on other bits. I also think the expand method is still a nice option to have.in case people are doing something weird where having expanded three channel tensors helps.

Yeah, currently I’ve only really tested on 1 channel inputs, so need to test on other variations. Also need to look at automatically picking up the correct number of channels in your code. In my fork I track the number of channels and other metadata so it uses this, but not entirely happy with the effort required for that. I’ll look at those and then submit a PR.

I did also see some evidence that others were better than resnet (vgg16_bn got a little better). Though that was before mixup which I just started playing with and which help quite a lot. I was having terrible overfitting without mixup on Urbansound8K. SpecAugment helped a bit but was still hitting training losses of <0.01 within 10 or so cycles (even dialing SpecAugment up to 4 time and 4 freq masks and bumping dropout up a bit). Haven’t tried densenets yet so will try them, thanks for the tip. I did note that resnets are pretty memory efficient compared to some others. I was able to use batch sizes of 128 on resnets but switching to vgg16_bn I could only do 32 with ~3x longer epochs (though I need to investigate and make sure it’s not related to doing all transforms on GPU, encouragingly though I’ve never had an out-of-memory after the first epoch).
And yeah, so many different things to try. Next task is to setup some experiment tracking solution so hopefully not just randomly trying things. I did see Facebook has Ax which uses Bayesian optimisation to tune hyper-parameters so might have a play with that at some point. Don’t really understand it (‘try things and somehow quantify what you do and don’t know from those trials’ is about the limit of my knowledge of Bayesian stuff) but doesn’t look too hard from the tutorials, but obviously the devil is in the details, and in particular actually understanding when you’re doing something stupid and just getting out garbage.

Yes. The ultimate goal is to be able to do what Shazam does.

I have seen CNN + RNN approaches to solve video problems… Can we carry the same analogy to audio?

About the audio chat, please do send in your info and I would love to be a part of that group.

1 Like

Hi Harry

I am facing the same issue, and I followed your instructions and I still have the problem

Try and start from scratch
conda env remove -n fastai-audio

1 Like

I’d suspect a mismatch between the pytorch version torchaudio is built against and the installed version (not quite clear how that’d happen though given you’re building torchaudio from source). You don’t list any command for pytorch so not clear what version of that you have but you might want to try updating it.

Or, while it doesn’t seem to be official yet and may get broken, there is now a torchaudio conda package. So you might try conda install -c pytorch -c conda-forge torchaudio (ideally in a new environment). I haven’t tested it extensively but seems to pass basic tests.

Hello, how can I generate audio using fastai?

something like wavenet implementation link

Hi Everyone,
Using the Audio Classification notebook, has anyone tried to make predictions using the model?
Since its done from scratch, there isn’t a learn.predict function to make predictions on

Hey, what are you working on? our version has working prediction for single items, full get_preds style functionality is coming soon.

We are getting really good results (near state of the art) on both open datasets and past kaggle competitions using simple methods, very little code, no ensembling. I’m making demonstration notebooks and @baz and I are working out the last bugs.

I’m super excited about it. The library makes it really easy to train and view what’s going on with your Dataset and model, and I have lots of ideas for how to add on but need user feedback. Check out the getting started and features notebooks for now. I’m going to add nbviewer links and share them here before the week is up.

2 Likes

No idea on this yet, what’s your level and experience like? Audio generation seems to be cutting edge and different from so much of the stuff we are doing. If you’re experienced I’d say read some of the latest papers by Google. They’ve done interesting work beyond wavenet on generating audio and even have voice transfer now, it’s very cool but I’m unfortunately not knowledgeable enough to help

Hey sorry for the delay in replying here, I actually missed this part when reading the response the other day, but no I didn’t interpret it that way in the slightest. You’ve been super helpful with suggestions and helping to push us in the right direction and I really appreciate it. It would be great to have you as a contributor as well. But if you don’t have the time, please keep pointing stuff out that we could be doing better, the critical stuff is the most helpful.

Auto accepting both 1 channel and 3 channel input when the 3 channel is actually unique info (aka delta/accelerate) is on our (short) to-do list, would love a PR on this. I’m going to take a crack at it myself soon though because it is really important to not slow down training.

Yeah in my experience specaugment does very little for non speech data, whereas mixup is great.

[quote] I did note that resnets are pretty memory efficient compared to some others. I was able to use batch sizes of 128 on resnets but switching to vgg16_bn I could only do 32 with ~3x longer epochs (though I need to investigate and make sure it’s not related to doing all transforms on GPU, encouragingly though I’ve never had an out-of-memory after the first epoch).
[/quote]

Yea densenets are the same, they are incredibly slow. Resnets do good enough on most things but they just massively underfit with mixup. In my experience my top results for scene classification are densenet + mixup, and speech are resnet18+specaugment, or resnet18+mfcc/delta/accel. The MFCC is a bit vulnerable to noise though it seems. I read a bit on tuning them but havent gotten around to it.

I feel like I haven’t made it easy to see our notebooks, so here are nbviewer links for our

  1. Getting Started Notebook
  2. Features Notebook
  3. Intro to Audio Notebook

More notebooks are coming soon to demonstrate the features of the library on various kaggle competitions and also the ESC-50 dataset.

Also the following features have been added (a few are in a PR pending review)

  • Efficient stats method to plot audio lengths in your dataset and show sample rates.
  • Function to give you a list of files that are outliers in terms of clip length (you specify how many std devs defines an outlier). Thanks @ouimet51 for the suggestion!
  • Preprocessors for trimming silence (from start and end), splitting silence (split the clip into multiple clips at points of silence), and removing silence (from start, end, and points silence in the middle)
  • Warnings that prevent you from confusing seconds and milliseconds (all our config settings are ms)
  • SpecAugment now uses channel mean instead of global mean (not sure how much this matters)
  • Made changes to Intro to Audio notebook. Thanks @TomB for the corrections.
1 Like

Last post then I’ll stop spamming the thread. :grin: Does anyone with audio experience know if, for datasets with wide variation in clip length, like freesound (300ms-30s), benefit from repeat padding the short-clips instead of zero-padding? Currently the best way of handling these datasets is to set the duration in the AudioConfig which, at train time, grabs duration milliseconds of the cached spectrogram, so that we can have equal sized images without throwing away data from the longer clips.

There are two problems with this.

  1. The sgs of the short clips are then mostly zeros that have been padded on.
  2. The long clips are full of data, but we’ve thrown away the length of the original clip which, for scene analysis, is definitely correlated with the label, and is thus useful info.

The possible solutions for this are

  1. Don’t pad with zeros for the short clips but repeat them until they fill the window. (e.g. 300ms clip, 4s window, just copy the 300ms sg 13.33 times)
  2. Append the time info to the spectrogram.

My questions are.

  1. Does this work? (I’m going to test it myself, but would like to hear others experience too since I have limited time).
  2. How do we do this in practice? My thought is to basically round to the nearest second, half-second, or some unit of time that preserves the information but doesnt tell the model the exact length as that could lead to overfitting, right? So say a clip falls in the 5s bucket, do we just make the last column of the sg all fives? Do we need more than one column? What’s a good way to do this?

Thanks for reading.

Hi everyone,

This is exactly what I was looking for. Can anybody add me to a slack channel currently in operation? It’d be really helpful to me.

I’m trying to work on noise suppression.

@kbandi we have a telegram group, PM me if you would like to join. Also this thread is quite active if you have any questions or want to share specifics of what you’re working on.

1 Like

Yeah, unfortunately I created by own fork of the original repo before you made yours public (like you wanting to play around so I didn’t do it in such a way as to be able to submit back to the original). So, I’ve been doing stuff in that rather than contributing to yours. Seems like the best option might be to focus on combining all the various ideas and rationalising things as part of a move to fastai v2 (with I guess a slight issue being if there was a desire to also have support for audio in v1).
I’ve made my repo public (https://github.com/thomasbrandon/fastai_audio-test*) but as noted it’s not really intended for public use, particularly so as to not split users between the two. It lacks many of the niceties and features of yours but does have support for a full GPU transform pipeline which performs fairly well I think. Need to do a bit of performance testing to see what’s bottlenecking and optimise a few things, but as you can see in the one example it does an epoch of UrbanSound8K in ~35secs with a resnet18 with reasonably high settings (n_fft=2048,n_mels=256), that’s also with only a single CPU worker as there’s some issues with GPU processing and multiple workers I need to look at (though not sure how much that’ll matter as it’s only loading files on CPU).

I think the stuff in learner should work without much modification in your repo. The main change needed is that audio_cnn_learner reads the number of channels from the AudioDataBunch which relies on stuff I was playing with that tracks metadata through transforms (this didn’t end up being especially useful given the effort, though I still need to implement using it for display to have correct axes labels). So would need to either add a parameter to audio_cnn_learner for the number of channels (as in the lower level adapt_model) or pull this from the databunch (either read a batch or expose the config through that).

Haven’t been working on audio much this week so hadn’t done that PR but will look to do so in the next day or so.

* Note that while repo is called fastai_audio-test the module is called fastai_audio, so don’t try and install alongside the existing version

Thanks for sharing. I looked over the code and we do seem to have relatively complementary implementations (albeit totally incompatible). We are going to need to refactor for API v2, and we are in need of a refactor anyway as the code is ballooning a bit so maybe when API v2 is released we can look at combining the best of both. This is also pending what they release for the audio lesson as well.

My goal is to make it as easy for people to train audio models as it was for me to train the dog breed detector when I started the course, despite having zero knowledge of computer vision. This is in part why I’ve been more focused on developing features. I love the idea that someone can see that delta/accelerate were stacked on an MFCC in whatever paper, and replicate with 2 keywords.

Also happy to announce we finally fixed the upside down spectrograms. We also handle multi-channel audio now (naively by downmixing to mono and warning the user if they don’t have downmix set to True). I am interested in removing that at some point though and handling more complex input.

Yeah, we’d diverged such that when you released your fork there wasn’t that much that could be easily ported over, hence my lack of contribution (apart from pointing out issues). I’ve then avoided as much as possible duplicating stuff.

Yeah, that seems like the best option. I’ve been looking in at the progress but no data block stuff released last I checked (a few days ago). When that comes out I might start having an early play.

You might note that I just used librosa.display, it gives reasonable matplotlib compatible display, so can integrate with the ItemList/DataBynch stuff, with nice support for axis labels and legends. Since then I think I’ve removed all other dependencies on librosa though so not necessarily worth it just for display (which is basically just wrapping a couple of matplotlib functions). Think you have a couple of transforms that use it though.

Hmm that’s interesting, thanks for the tip, I’ll play around with it as having axis would be nice for people. Especially when they start playing with stuff like f_max and ref. We only use librosa where needed because we have to go torch->np->torch whenver there’s a librosa function. I imagine at some point if this library is used that someone more experienced then me will come through and make torchified versions of the few we use librosa for (or that torchaudio will get around to it and we can use their version). Honestly I’m sure I could torchify those functions by just copying over from librosa but it doesn’t seem like there’s all that much benefit.

What method does your code use to ensure all inputs to the databunch are the same size? Do you zero pad? Have you done experimentation with repeat padding instead of zero padding and compared results? I’m really interested in the effects of repeat padding, especially with mixup. It seems like having a sg that is mostly zeros would negatively effect mixup.

I’m also really interested in appending the original time data from long clips that we have to cut short. Have you done any work on this? Or come across papers that specify in detail their approach? Most papers just refer really obliquely to these things (appending delta/accel, concatting on metadata), I’d like to test it and make it easier for ppl to try without having to reinvent the wheel every time.