Deep Learning with Audio Thread

TomB · June 14, 2019, 6:17pm

Yeah, I considered that, could be a reasonable addition (and perhaps primary method for spectrograms, obviously not helpful when using a time domain network).
Think you’d need to do some performance testing to know if it was better, and how much. It’d mean you’d be using less efficient FFT window lengths on longer chunks for various of the rates, i.e. if you were ‘resampling’ to 16000 via that method with nfft=1024 at a rate of 16000 it would mean a 22050 sample would need to be STFT’d with nfft=1411 (I think, quick calc). Whether the repeated FFTs of less efficient and longer windows would outweigh the advantage of just resampling in the first place, in particular for smaller hops where it will matter more, I don’t know. Given the FFT method is the most efficient way to resample on GPU i’d suspect it would probably be a net performance gain (except maybe at very small hop lengths), but it is a somewhat complex performance tradeoff.
Then there’s the question of whether the performance gain is worth the extra complexity. It also doesn’t just affect the spectrogram creation, you’d have to adapt things like the mel filterbank for the differing numbers of FFT bins.

Edit: Oh, and when thinking it through I realised that this only really works with mel filtering. Otherwise you are just left with different FFT resolutions for each sample rate and need to basically resample the FFT frequency data which seems unlikely to end up being a worthwhile method. With mel filtering (or similar frequency re-binning) you can just adapt the mel filterbank to each nfft.
Edit2: Oops, that’s wrong, you don’t need mel filtering, given appropriate scaling of nfft you just need to select the first x bins where x is the nfft at your standardised rate. So in the above example, the first 1024 bins of the 1411 bins from the 22050 clip should be identical to the ones from 16000 clip. So just the performance issue of whether the larger number (due to overlap with hop) of less efficient lengths are better than one longer (perhaps also inefficient) FFT to resample in the first place.

mnpinto · June 14, 2019, 9:34pm

Nice write-up I got good results using square images (7th place on public LB), I will share my solution in more detail later but I used images with size 256x256. An interesting observation: The original images had size 128xN depending on the audio clip, I could then do the crop in two ways. 1) Random crop 128x128 → upscale 256x256; 2) Random crop 128x256 → upscale 256x256. I was expecting better results with the second approach as it would capture a longer sequence. Yet I got better results with the first approach. Why? I’m not sure but it could just be that with approach 1) I have more different possible samples. Another option that I was thinking about is that the upscaling may make the image smoother, could that help the convolutions? I had no time to further test these ideas.

mansoorkhansitar · June 17, 2019, 7:26am

Hey Robert,

Thanks for creating this amazing thread to connect various people who work in Deep Learning for audio. I was also amongst the winners of the Making Sense of Sounds Challenge 2018(GTCTM_MAHSA, MSOS:https://cvssp.org/projects/making_sense_of_sounds/site/challenge/#results).

I would like to be a part of the Telegram group you mentioned about. I work at an audio tech startup in India and would like to contribute and learn from you guys.

Best,
Mansoor

MadeUpMasters · June 17, 2019, 5:29pm

Awesome Mansoor, congratulations on the competition. Unfortunately the telegram group died off (still exists but there haven’t been posts in a month or so), but people post here pretty frequently with questions, ideas…etc. If you want to contribute, just share here the types of stuff you’re working on, and anything you learn in your work that you think others might benefit from. Also it’s a great place to ask questions, we have regular posters who are really helpful here.

Right now I’m working on old kaggle competitions for audio, trying to get state of the art results or close to it, and then transfer that knowledge to our fork of the fastai audio library to try to make it really easy for non-audio experts to train audio models. It’s been a great learning experience and we are going to release some new features soon. Anyways welcome to the thread

Best,
Rob

MadeUpMasters · June 17, 2019, 5:33pm

Congratulations on the high LB score and good luck in the final standings. Did you try non square images at all? So you took 128xN images that used 128 mel bins and then upscaled it 256x256? I’m surprised the upscaling would help as you’re not adding more info right? Just doing some type of interpolation between each mel bin. The only thing I can think of is that the smoothing helped the convolutions like you said, but I’d be surprised because if that worked I feel like I would’ve heard about it.

I look forward to reading your write up, thanks for reading mine! Cheers.

mnpinto · June 17, 2019, 6:27pm

Thanks, I did some experiments with non square images earlier in the competition but I would need to try it on the final setup. Yes, I re-scaled 128x128 crops to 256x256, just that. Changing from 128 to 256 usually improves results in image classifications tasks using these models (I forgot to mention, I used fastai xresnets), but it was quite surprising that converting 128x256 crops to 256x256 was not as good as 128x128 to 256x256. I also used max_zoom=1.5, I didn’t expected it to be a good idea but it improved the results. I’m not sure by how much, I will need to run some experiments after the competition is over and late submissions are available.

Meanwhile I need to finish my write up, and I will also share the code! Cheers.

MadeUpMasters · June 19, 2019, 1:42pm

Hey guys, we’ve gotten back to work on the fastai audio fork @baz and I are maintaining and have some cool new features that might be of interest.

First off the old code altered the head of resnets to accept 1 channel input, but the more I play around the more it seems resnets are not optimal for audio, so we removed that and instead now use `torch.expand(3,-1,-1) to change 1 channel inputs to 3 channels via shared memory. This also doesn’t affect the cache size for saved files. Now you can use any architecture that accepts images as input.

Also I’ve added MFCC (mel-frequency cepstral coefficient) as an alternative to melspectrogram as an input, all you have to do to switch is add “mfcc=True” to your config, right now the number of coefficients (n_mffc) defaults to 20 and I haven’t added a param to the config for that yet. MFCC is mostly used in speech recognition.

Another feature is that you now have the option to add the delta/accelerate (1st and 2nd derivatives of your image, a somewhat common practice in audio ML) as your 2nd and 3rd channel, instead of a copy of your original image. This will consume 3x the memory in both the cache and during training but can improve results and looks pretty cool.

@baz got show_batch() working so you can now hear the audio and see your spectrogram/MFCC/delta alongside it. Below are some examples.

Normal 1 channel spectrogram, expanded in memory to 3 channels but we just show the first

3 channel spectrogram, 1st is normal melspec, 2nd is delta, 3rd is accelerate

MFCC, expanded in memory to 3 channels but we just show the first
mfcc-no-delta

3 channel MFCC, 1st is normal melspec, 2nd is delta, 3rd is accelerate

MadeUpMasters · June 19, 2019, 1:44pm

Also we are totally open to new contributors (or old ones ) so if you find yourself implementing something for audio AI and think it would be a cool feature for the library, or if you think the API sucks and want to try a refactor, go ahead.

mansoorkhansitar · June 20, 2019, 9:11am

Thanks for welcoming me to the group Robert!

Hello all, I’m currently working on a task where I need to implement an audio recognition algorithm kind of like the one of Google sound search algorithm
(https://ai.googleblog.com/2018/09/googles-next-generation-music.html)
which is different from Shazam’s audio fingerprinting algorithm
(https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf).

I need some help from you guys here. The way I’m approaching this problem is by implementing a VGGnet architecture and applying a Triplet loss function to the 128 embeddings generated from my model. My dataset comprises of a song and 100 different variations of the same song which are mixed with different types of noise at different intensity levels. I take the spectrogram of these 10 second song samples and train it on the model but however I am seeming to take bad results.

And considering the fact that I want to make this work in real time where in I record a song and my model predicts what song it is. I want to look at some different approaches of looking at this problem. If any of you guys can suggest some ideas it would be great. Thanks.

Best,
Mansoor

MadeUpMasters · June 21, 2019, 5:12pm

Hey Mansoor, can you give more detail about what exactly you are trying to do? What is the model trying to learn exactly? Also what is your ultimate goal? To be able to do what shazam does?

One thing that may help is taking small pieces of the spectrogram instead of the whole 10s, or in your case for audio fingerprinting, finding a way to reduce the spectrogram information to something more compact. Audio is tricky and is more like video than images because it has a time component where things are changing. Treating it like one big image is a bit like training on video data by just concatenating all the frames of the video together (it’s a bit difference as audio is continuous, not discrete like video, but it’s been helpful to me to think of it that way, especially for problems like scene classification)

Also I’ve gotten some more interest in the audio chat, so I’ll PM you my info and then add you there and we can try to revive it.

baz · June 21, 2019, 5:59pm

ouimet51 · June 23, 2019, 3:36am

FYI I ran in to this same issue when trying to configure the colab notebook tonight, I was able to solve it by updating the my torch audio version. It looks like ilovescience didn’t put in that PR so I just put one in

ouimet51 · June 23, 2019, 3:38am

hey - i think there is a small typo in the notebook. It looks like you are cding in to the directory fastai-audio when it should read fastai_audio

It’s a pretty small thing that most people can probably sort out but a bit annoying so probably good to update.

If you are getting the error

fatal: not a git repository (or any of the parent 
directories): .git
bash: install.sh: No such file or directory

It’s probably due to this typo

MadeUpMasters · June 23, 2019, 12:50pm

Thanks for the PR and pointing that out, I’m going to go over the whole repo this week and fix stuff like that. For example, a lot of the example notebooks have errors because we have been moving fast to make API changes and aren’t updating the notebooks every time, especially because I am going to rehaul the notebooks with new audio competitions and results.

yaegerknight · June 23, 2019, 7:54pm

@MadeUpMasters sorry for the delay. I’ve been busy as of late and could not reply to your questions.
I apologize for that.

What are the lengths of the audio clips you’re using? - I am using clips whose lengths vary between 3 seconds to 6 seconds.
are they all the same length, or varied? - No they are not of the same length
What library are you using to generate the spectrogram before plotting - I am using the librosa to generate the spectrograms.

The dataset I am using is the IEMOCAP dataset provided by the University of Southern California. It contains roughly 12 hours of audio-visual data.

The library used to process the audio is again librosa.

Spectrogram pictures -

These are 2 spectrogram pictures of resolution 690x271 for anger emotion. I do not want to spam the thread with many pictures thus I have not uploaded them. If need be, I will do so.

Thanks in advance for the help

MadeUpMasters · June 23, 2019, 8:08pm

Hey thanks, that’s exactly the info I needed to help.

This is almost certainly what is causing problems for you. When you are saving your images using matplotlib, it is making them all the same width, even though they contain differing amounts of data. (Imagine resizing the width of images without resizing height to maintain the correct aspect ratio).

The spectrograms librosa generates are just 2D numpy arrays, but fastai needs all of your images to be the same size and shape, so you will need to pad them, convert them to tensors, and then convert them to be 3 channels. Our library does all this for you, and then caches the spectrograms so that you don’t have to regenerate them every epoch.

The library is still pretty beta and the notebooks might not be 100% current. You’re welcome to give it a shot and can post any problems you have here and we will help. If you’d rather wait, I’m going to be adding some final features then overhauling the notebooks this week and making sure everything works smoothly.

TomB · June 24, 2019, 8:04am

Any reason you saw not to extend the previous method of adapting the model to work with other models? I’ve got code for this that should work for pretty much any convolutional model. It provides an audio_cnn_learner method that should work with all the models that fastai.vision.cnn_learner does (the main issue there is cnn_learner calls __init__(pretrained) on the model class which often won’t work). There’s also adapt_model that should work with most pre-created models, the issue being finding the initial conv layer (handles nesting of sequentials and a named layers in a custom Module which seems to cover all the models in fastai.vision that cnn_learner doesn’t). Or there’s also adapt_conv that just modifies a Conv2D layer leaving it to the user to replace it in the model.
Currently it just uses the convolutional kernels of the first channel in the model, I will look at other methods to see if they are better (was thinking maybe you could run an input through and try and select a set of kernels that maximise diversity of activations across inputs).
Compared to expand, this method should work with arbitrary input channels (e.g. stereo or mag+phase) for which I think you’d need to copy data with expand as it only works with singleton dimensions. It should also be more efficient in memory and processing by avoiding convolutions and separate gradients for the cloned channels (not sure how much this matters and would suspect not too much in the standard quite deep vision models). It also avoids potential issues with the expand method if subsequent processing is performed. With expanded data if you modify in-place you will get weird results as you apply the operation.multiple times to the 3 clones of the data.
Unless you saw some issue with this method (beyond the obvious trouble of implementing it) I can submit a PR.

And on resnet, what led you to think it isn’t a good model for audio? Any better ones? Just starting some tests on this now.

Unrelatedly, looking at your code I notice a lot of your transforms clone the input. They don’t seem to be using in-place operations so won’t modify the inputs (not that I can see any issue with that given your reloading your cached inputs every time). So this just seems to be needlessly copying data, or am I missing some issue?

MadeUpMasters · June 24, 2019, 2:54pm

TomB:

Any reason you saw not to extend the previous method of adapting the model to work with other models? I’ve got code for this that should work for pretty much any convolutional model. It provides an audio_cnn_learner method that should work with all the models that fastai.vision.cnn_learner does (the main issue there is cnn_learner calls __init__(pretrained) on the model class which often won’t work). There’s also adapt_model that should work with most pre-created models, the issue being finding the initial conv layer (handles nesting of sequentials and a named layers in a custom Module which seems to cover all the models in fastai.vision that cnn_learner doesn’t). Or there’s also adapt_conv that just modifies a Conv2D layer leaving it to the user to replace it in the model.
Currently it just uses the convolutional kernels of the first channel in the model, I will look at other methods to see if they are better (was thinking maybe you could run an input through and try and select a set of kernels that maximise diversity of activations across inputs).
Compared to expand, this method should work with arbitrary input channels (e.g. stereo or mag+phase) for which I think you’d need to copy data with expand as it only works with singleton dimensions. It should also be more efficient in memory and processing by avoiding convolutions and separate gradients for the cloned channels (not sure how much this matters and would suspect not too much in the standard quite deep vision models). It also avoids potential issues with the expand method if subsequent processing is performed. With expanded data if you modify in-place you will get weird results as you apply the operation.multiple times to the 3 clones of the data.
Unless you saw some issue with this method (beyond the obvious trouble of implementing it) I can submit a PR.

Hey, honestly I’m just not at this level yet with my understanding of stuff. I think there’s lots of stuff in the library that could be refactored to perform better and be cleaner, and you seem really knowledgeable so it would be awesome if you want to contribute and PR. We recently added stacking the delta/accelerate in the additional two channels as well so it would need to be compatible with that.

Ehh I probably spoke too soon on this, but using mixup I was having better results using densenets. There’s so much work to be done to determine what’s good given all the options, parameters, and the different types of problems we might be addressing (scene recognition, voice, speech, music…etc) .

No, you’re totally right, I was just following what was done in the library we forked, but for the most part we aren’t modifying anything in place. Feel free to fix and PR.

MadeUpMasters · June 24, 2019, 3:03pm

Added a new PR this morning. It adds a duration attribute to the config. This is a (mostly better) alternative to pad_to_max, and instead of PadTrimming the audio (throwing away tons of data from the longer clips in order to have reasonably sized spectrograms), or segmenting (then each round of training is done on the same base spectrogram), it computes/caches the whole spectrogram, and then crops the x-axis at random to create spectrograms that are duration ms long.

If you have a 25s clip, and want spectrograms that are 4000ms in duration, PadTrim will give you the first 4s of that and throw away the rest, segmentation will give you seven 4s spectrograms (before applying transforms), and the new feature will just grab 4s chunks of the spectro on the fly, leading to more variation.

I’m going to stop with features for now and refactor the notebooks, including making some for the audio kaggle competitions showing how to get good results on various problems with no domain expertise.

jcatanza · June 24, 2019, 7:12pm

At some point during course 2 v3, @Jeremy mentioned that he intended to produce supplemental lectures on audio, to be given after the course completed. Does anyone know if this is still the plan, and if so, when?