Deep Learning with Audio Thread

Interesting paper via the Import AI newsletter, Detecting Kissing Scenes in a Database of Hollywood Films.

The first component is a binary classifier that predicts a binary label (i.e. kissing or not) given a features exctracted from both the still frames and audio waves of a one-second segment. The second component aggregates the binary labels for contiguous non-overlapping segments into a set of kissing scenes.

Emphasis mine. Haven’t read the paper yet, but it’s interfering for 2 reasons: they’ve effectively done audio diarisation, and they used the VGGish architecture (which I’d never heard of) for the audio feature extraction. Probably worth a look!

2 Likes

Hey, sorry I’ve been away from the thread for a while, was busy getting crushed in the freesound comp. I joined super late but worked really intensely, had fun, and learned an incredible amount. Ultimately I think I was on the right track but data leakage did me in. Here’s the write-up: https://www.kaggle.com/c/freesound-audio-tagging-2019/discussion/95202#latest-550317

Unfortunately with the short time and annoyance of it being a kernels-only competition, I thought it was best to use just normal fastai vision library to avoid headaches in the kernel stage, so I didn’t get to test fastai audio. One thing I know though is that I still don’t know too much, and that we really need someone with audio ML expertise to help us with design choices. For instance, we have been pushing for larger (224x224) images, and focused on keeping them square, but most of the solutions coming out are using 128xN, and the best I’ve seen so far (11th place) used 64xN, and said they tried 32xN and only experienced a loss of .005 on the metric (lwlrap, scale 0-1)!!!

Squarification doesn’t seem to do anything when not using pretrained models, for instance 64x256 images train well. It may have something to do with imagenet and pretrained stuff. Also, many people used much simpler, shallower models than even a resnet18, with one person using a very simple 4-layer base model to get into the top 10%.

Specaugment didn’t seem to do much at all outside of the speech domain (time and frequency masking increased local CV but most people reported no effect on LB, with time-warping reported by some to cause decreased performance).

Mixup seemed to be extremely important (I didn’t have time to implement as I was too focused on trying to do semi-supervised learning for the first time and hoped SpecAugment would be enough). I’m still not totally sure how people implemented, but a version of that seems essential. Adding signals together is an option but just using log math to add two spectrograms together seems pretty promising.

For me, I plan to go back to the drawing board and work on the 3 audio kaggle comps (freesound2018, freesound2019, google tensorflow speech challenge) until I can produce SOTA or near SOTA results. Then hopefully I can come back to fastai audio and help make a version that is compatible with the new fastai 2.0 API. I’ll still probably work on it here and there in the meantime, but I’d like it to be able to not only be fast/convenient for people, but also to train at a world-class level.

5 Likes

Hey Naveen, it isn’t necessary, typically we turn most image transforms off. The way I try to think about which ones could be useful is by asking if a transformed spectrogram could possibly end up in my test set. A vertical flip of a spectrogram for instance would just be changing the frequencies present from the normal patterns of noise, into something random and most likely unidentifiable, thus I’d skip it. A horizontal flip is akin to reversing the audio, and would not be useful for speech, but maybe for something like scene recognition, but maybe not, you just have to try it if it’s plausible and see how your results are affected).

In short, rotations are likely never useful, vertical flips are likely never useful, horizontal could be. I think @ThomM reported though that very small rotations actually gave performance increase, but I could be mistaken.

1 Like

This is awesome, you’ve gone super deep on this! Really nice work. Have you looked into what FFT algorithm PyTorch uses? I’ve read that it only Cooley-Tukey that has problems with large primes and that alternates (Bluestein’s (chirp-z), Rader’s) don’t.

You should consider getting in conversation with the torchaudio team and possibly submitting a PR. Then we can integrate it into fastai audio from there. We’re happy to have it as a PR as well, but I’d say go one level up the chain and make it more widely available

Gone a bit deeper now. Just updated the notebook with some new stuff, some clean up and a bit of added explanation (not much still though).
I tried re-implementing polyphase with dot products rather than convolution, that way you can skip calculating the outputs that will just be dropped while decimating anyway. But it performs horribly, like 50x slower than the convolution approach, even while doing around 1/160th of the calculations (in that case of 48000->44100). As I thought might be the case it look like the overhead of spinning up a CUDA kernel for every output sample swamps any advantage (by comparison the convolution approach uses ~300 kernels, about half of which are convolutions and then some small data shuffling).
Can’t really see a way to improve that without a custom kernel. Have been playing around with that a bit as something to learn CUDA on but don’t really expect to get better performance than the convolution. It’s non-trivial to see how to efficiently implement it. And of course the convolution in PyTorch (coming out of a nvidia library) is almost certainly highly optimised. Though not necessarily for this case, this is a much, much larger kernel than typical.
The basic issue is the massive size of the FIR filter which I gather is due to the fact that you are filtering the upsampled signal, so in the case of 48000->44100 you are upsampling by 160x and so filtering at 7.056MHz. This means you need a really long filter for decent performance. One way to improve performance would be to reduce the filter size, you can get what I think (based on minimal knowledge) is fairly equivalent performance with much smaller filters (in fact the 64K filter I was using from that blog post is just a comparison with resampy which uses a custom 64K filter, the default is 16001). This does help performance quite a lot, but still slower than FFT.
There’s a bit in the notebook with some analysis of FIR performance, still learning about this though so not really sure how best to assess it.

From the docs:" the cuFFT Library employs the Cooley-Tukey algorithm … This algorithm expresses the DFT matrix as a product of sparse building block matrices. The cuFFT Library implements the following building blocks: radix-2, radix-3, radix-5, and radix-7. Hence the performance of any transform size that can be factored as 2**a * 3**b * 5**c * 7**d (where a, b, c, and d are non-negative integers) is optimized in the cuFFT library. There are also radix-m building blocks for other primes, m, whose value is < 128. When the length cannot be decomposed as multiples of powers of primes from 2 to 127, Bluestein’s algorithm is used. Since the Bluestein implementation requires more computations per output point than the Cooley-Tukey implementation, the accuracy of the Cooley-Tukey algorithm is better."
I’ll look to do some tests to see how much that matters.

I did also note while looking around that according to this post simple FFT/IFFT resampling is “non-ideal in that it can introduce ringing and other nasty artifacts in the time domain” and suggests applying frequency domain filtering (others in the thread say to use polyphase instead). This frequency domain filtering takes advantage of the convolution theorem which show that convolution in the time domain, as in applying a FIR, is equivalent to multiplication in the frequency domain. So you FFT your standard FIR filter and then multiply that with the FFT’d signal. So that should be easy and performant enough. You just need an appropriate FIR filter (scipy has stuff for generating them based on requirements). Support for frequency domain filtering would also be a nice operation to support for cases where that is desirable (thinking more for EEG/ECG/etc stuff than for music/speech where I’m not sure of any particular use case).

A further option might be looking to use an overlap-and-add (or similar overlap-and-save) method for FFT based conversion (the author of the above suggests this for doing frequency domain filtering). Using this method you divide the input into overlapping chunks, process and then recombine. This might improve FFT performance a bit more by making each file a batch of smaller FFTs which the docs say will improve performance. You could also then optimise those fixed, for a given rate conversion, chunk sizes which is easier than trying to optimise for randomly sized inputs.

Yeah, or I’d probably submit to torchaudio-contrib (staging/playground for torchaudio). If you haven’t seen it it’s worth checking out as it has implementations that better mirror the librose api which I think is preferable. Not a lot there at the moment but some nice little methods. Also, any other re-implementations of librosa stuff would likely be accepted there.

Nice writeup and summary on the winning solutions…

Mixup is already implemented in fastai… It needs to write the method of mixup only at the line of creating the learner… That’s it… I used it in the freesound comp and it is indeed very important to get good results with such limited dataset.

1 Like

Oh wow, I had no idea it was that easy, I assumed we had to mess with the one-hot encoded labels ourselves and with freesounds multilabel data I didn’t want to try. I’m testing it now on the tensorflow speech challenge.

I’m assuming this is working the same way it does with other images, by using transparency. I’d love to experiment with directly adding spectrograms directly and seeing how it compares. My intuition is that the output images would be pretty similar, but not the same. Thanks for the tip.

Nice writeup. Look forward to having a play with some of your models.
Couple of thoughts:

I’d thought about that too. It might be that the disparate information isn’t well handled by the network. Channels are collapsed in the first convolution layer so it’s perhaps a little hard for the network to deal with disparate information across them. Perhaps you could introduce them after initially training on just one channel, adding in another channel with additional data once the network has got a bit of a grasp on the initial one. It may be that early on the gradients sort of pulling in different directions doesn’t help but later it could integrate it in (I’m guessing here).
Or even use a separate convolutional network for each channel, combining them either at the linear layer or after a few separate conv layers (or use the groups option in torch.nn.conv2d to separate kernels in the same conv layer).
Or for some of the simpler ones that just produce a few values (but have been found useful in classification with non-NN models) you could perhaps introduce them into the linear section of the network after the convolutions (and again possibly after some initial training).
Interesting ideas to play around with certainly.

I don’t think there’s any requirement to use square images even with pretrained. A convolutional stem will adapt to any input size (note that it applies the same kernel to every pixel so is not dependent on size). With a linear head at the end, you have to use the same size once you create your custom model with cnn_learner (or employ some sort of trick to ensure that the size is always the same at this point), but no need for it to be square.
As Jeremy clarified:

Sounds like a great plan.
Still kinda down the resampling rabbit hole but will pull out of that now. I also want to move back to looking at that sort of stuff, finding the best architectures to use and so on. Can hopefully bounce ideas off each other and compare results.

The ESC-50 dataset also looks interesting as there’s metrics for a variety of models available for it (and not necessarily having the hyper-optimisation of Kaggle solutions might be good). There’s also this fastai project with it:

Given your post on trying comet/neptune any preferences between them? That’s my next thing to look at. There’s also https://www.wandb.com/ which also has a good free tier and fastai integration.

@MadeUpMasters Thanks for the information. I tried the code in Lesson 1 of Part 1 (2019) Deep Learning course to perform emotion recognition using spectrograms of audio samples in the IEMOCAP dataset. The results I obtained were in line with your observations - without transformations, the model achieved around 50% accuracy. With transformations, it achieved a lower accuracy of around 43%. I am mainly curious as to why I am getting such a low accuracy given that model is very robust and can be trained on web-scraped photos to get high levels of accuracy. I think the issue might be related to the fact that I am plotting them using matplotlib and then saving the file as a .jpg but I am really not sure. I will look into this.

Thanks very much for your help once again. :smile:

1 Like

Sounds like a cool problem. Audio is really tricky, it adds a whole new bag of hyperparameters you can play with. What are the lengths of the audio clips you’re using? And are they all the same length, or varied? What library are you using to generate the spectrogram before plotting?

If you can share a bit more about the dataset and it’s contents, and what library you’re using to process the audio, and finally a few pics of your spectrograms, we can probably help you to fix your results.

Yeah there’s so much room to experiment with stuff like this. I’m just now getting to the point where I’m starting to be able to rearrange the inner parts of the network, so I hadn’t even thought about keeping them separate during the conv layers and combining them later.

That does look like a really cool option, given the great job they’ve done on benchmarking and referencing papers. And yea it would be refreshing to not have to ensemble 15 classifiers to have a high score hahaha.

Actually I haven’t been using either. Immediately I preferred and really like comet. I think I’ll start back with it tomorrow, thanks for the reminder.

Also, with regards to resampling and spectrograms. Has anyone looked at whether there is any difference between downsampling and just increasing the hop_length when generating spectrograms? It seems like if you’re just downsampling from one sample rate to another, say 44100 to 22050, the same thing could be accomplished much more cheaply by doubling the hop. I haven’t played with this at all and haven’t thought too deeply so I could be missing something.

1 Like

Yeah, I considered that, could be a reasonable addition (and perhaps primary method for spectrograms, obviously not helpful when using a time domain network).
Think you’d need to do some performance testing to know if it was better, and how much. It’d mean you’d be using less efficient FFT window lengths on longer chunks for various of the rates, i.e. if you were ‘resampling’ to 16000 via that method with nfft=1024 at a rate of 16000 it would mean a 22050 sample would need to be STFT’d with nfft=1411 (I think, quick calc). Whether the repeated FFTs of less efficient and longer windows would outweigh the advantage of just resampling in the first place, in particular for smaller hops where it will matter more, I don’t know. Given the FFT method is the most efficient way to resample on GPU i’d suspect it would probably be a net performance gain (except maybe at very small hop lengths), but it is a somewhat complex performance tradeoff.
Then there’s the question of whether the performance gain is worth the extra complexity. It also doesn’t just affect the spectrogram creation, you’d have to adapt things like the mel filterbank for the differing numbers of FFT bins.

Edit: Oh, and when thinking it through I realised that this only really works with mel filtering. Otherwise you are just left with different FFT resolutions for each sample rate and need to basically resample the FFT frequency data which seems unlikely to end up being a worthwhile method. With mel filtering (or similar frequency re-binning) you can just adapt the mel filterbank to each nfft.
Edit2: Oops, that’s wrong, you don’t need mel filtering, given appropriate scaling of nfft you just need to select the first x bins where x is the nfft at your standardised rate. So in the above example, the first 1024 bins of the 1411 bins from the 22050 clip should be identical to the ones from 16000 clip. So just the performance issue of whether the larger number (due to overlap with hop) of less efficient lengths are better than one longer (perhaps also inefficient) FFT to resample in the first place.

Nice write-up :slight_smile: I got good results using square images (7th place on public LB), I will share my solution in more detail later but I used images with size 256x256. An interesting observation: The original images had size 128xN depending on the audio clip, I could then do the crop in two ways. 1) Random crop 128x128 -> upscale 256x256; 2) Random crop 128x256 -> upscale 256x256. I was expecting better results with the second approach as it would capture a longer sequence. Yet I got better results with the first approach. Why? I’m not sure but it could just be that with approach 1) I have more different possible samples. Another option that I was thinking about is that the upscaling may make the image smoother, could that help the convolutions? I had no time to further test these ideas.

3 Likes

Hey Robert,

Thanks for creating this amazing thread to connect various people who work in Deep Learning for audio. I was also amongst the winners of the Making Sense of Sounds Challenge 2018(GTCTM_MAHSA, MSOS:https://cvssp.org/projects/making_sense_of_sounds/site/challenge/#results).

I would like to be a part of the Telegram group you mentioned about. I work at an audio tech startup in India and would like to contribute and learn from you guys.

Best,
Mansoor

1 Like

Awesome Mansoor, congratulations on the competition. Unfortunately the telegram group died off (still exists but there haven’t been posts in a month or so), but people post here pretty frequently with questions, ideas…etc. If you want to contribute, just share here the types of stuff you’re working on, and anything you learn in your work that you think others might benefit from. Also it’s a great place to ask questions, we have regular posters who are really helpful here.

Right now I’m working on old kaggle competitions for audio, trying to get state of the art results or close to it, and then transfer that knowledge to our fork of the fastai audio library to try to make it really easy for non-audio experts to train audio models. It’s been a great learning experience and we are going to release some new features soon. Anyways welcome to the thread

Best,
Rob

3 Likes

Congratulations on the high LB score and good luck in the final standings. Did you try non square images at all? So you took 128xN images that used 128 mel bins and then upscaled it 256x256? I’m surprised the upscaling would help as you’re not adding more info right? Just doing some type of interpolation between each mel bin. The only thing I can think of is that the smoothing helped the convolutions like you said, but I’d be surprised because if that worked I feel like I would’ve heard about it.

I look forward to reading your write up, thanks for reading mine! Cheers.

1 Like

Thanks, I did some experiments with non square images earlier in the competition but I would need to try it on the final setup. Yes, I re-scaled 128x128 crops to 256x256, just that. Changing from 128 to 256 usually improves results in image classifications tasks using these models (I forgot to mention, I used fastai xresnets), but it was quite surprising that converting 128x256 crops to 256x256 was not as good as 128x128 to 256x256. I also used max_zoom=1.5, I didn’t expected it to be a good idea but it improved the results. I’m not sure by how much, I will need to run some experiments after the competition is over and late submissions are available.

Meanwhile I need to finish my write up, and I will also share the code! Cheers.

2 Likes

Hey guys, we’ve gotten back to work on the fastai audio fork @baz and I are maintaining and have some cool new features that might be of interest.

First off the old code altered the head of resnets to accept 1 channel input, but the more I play around the more it seems resnets are not optimal for audio, so we removed that and instead now use `torch.expand(3,-1,-1) to change 1 channel inputs to 3 channels via shared memory. This also doesn’t affect the cache size for saved files. Now you can use any architecture that accepts images as input.

Also I’ve added MFCC (mel-frequency cepstral coefficient) as an alternative to melspectrogram as an input, all you have to do to switch is add “mfcc=True” to your config, right now the number of coefficients (n_mffc) defaults to 20 and I haven’t added a param to the config for that yet. MFCC is mostly used in speech recognition.

Another feature is that you now have the option to add the delta/accelerate (1st and 2nd derivatives of your image, a somewhat common practice in audio ML) as your 2nd and 3rd channel, instead of a copy of your original image. This will consume 3x the memory in both the cache and during training but can improve results and looks pretty cool.

@baz got show_batch() working so you can now hear the audio and see your spectrogram/MFCC/delta alongside it. Below are some examples.

Normal 1 channel spectrogram, expanded in memory to 3 channels but we just show the first

3 channel spectrogram, 1st is normal melspec, 2nd is delta, 3rd is accelerate

MFCC, expanded in memory to 3 channels but we just show the first
mfcc-no-delta

3 channel MFCC, 1st is normal melspec, 2nd is delta, 3rd is accelerate

4 Likes

Also we are totally open to new contributors (or old ones :grin:) so if you find yourself implementing something for audio AI and think it would be a cool feature for the library, or if you think the API sucks and want to try a refactor, go ahead.