Deep Learning with Audio Thread

baz · June 7, 2019, 12:47pm

Yes you are looking at the right repo.
Try now

ilovescience · June 7, 2019, 9:58pm

@baz the colaboratory notebook currently does not work. I think it was working like two days ago, but not anymore. What versions of torch, torchaudio, etc. does your package use?

ilovescience · June 7, 2019, 11:50pm

I solved the issue. There is a problem with the torchaudio code. The buggy code was committed just yesterday. For the time being, it is best to install an older version of torchaudio.

I replaced the torchaudio installation in the bash file with the following line:
pip install git+https://github.com/pytorch/audio.git@d92de5b97fc6204db4b1e3ed20c03ac06f5d53f0

yaegerknight · June 8, 2019, 7:25am

I have a question. If one is doing emotion recognition using spectrograms of speech files, then is it necessary to get the vertical translations and rotations of the spectrograms before giving it as input to the neural network?

TomB · June 8, 2019, 8:57am

I’ve implement scipy’s resample and resample_poly in PyTorch. For resample_poly I’ve used the filter from the link marii posted. It’s at https://nbviewer.jupyter.org/gist/thomasbrandon/63d609f37f8e73c56f5a4c76260aeb28

They seem to work but could do with more testing (especially around input shapes). For resample it’s a direct copy of the scipy one with just a few changes needed. For resample_poly it’s very different. There’s a bit of a mismatch in both (not massive but not np.allclose). For resample I’d say it’s likely differences in FFT (for some conversions it is np.allclose with default tolerances). For resample_poly there might be some mishandling there (likely around combining the various phases back). From the spectrogram it looks identical to scipy’s.

There’s some performance numbers at the end. resample performs pretty well, resample_poly doesn’t.
The PyTorch resample isn’t faster than scipy’s resample_poly (well, it runs faster but that’s PyTorch multi-core, 6 here, vs. scipy single-core). But it is faster than scipy’s resample, I’m pretty sureeven if you exclude cases where it blows up. The PyTorch FFT doesn’t seem to have too much issue with particular sizes (will do some testing to try and see if padding may still help). The results of various operations on the entire 6.7Gb urbansound8K dataset (ta_load is torchaudio load, all other operations also load with that, caches cleared between tests, pt_resample is my implementation on CPU, most files are 4s, some shorter, resampling to 22050 except decimate2 which is decimate by 2 I included to see if it was worth using when appropriate for optimisation):

Processing 8732 files with 5743M samples:
ta_load: 13.9s total; mean: 1.59ms; std: 1.364ms; max: 27.7ms
sp_resample: 1694.2s total; mean: 194.03ms; std: 2015.948ms; max: 91911.7ms
sp_resample_poly: 75.4s total; mean: 8.64ms; std: 4.568ms; max: 60.4ms
sp_decimate2: 75.4s total; mean: 8.63ms; std: 4.440ms; max: 46.1ms
pt_resample: 43.7s total; mean: 5.01ms; std: 2.703ms; max: 46.2ms

For resample on a synthetic GPU test, 1000 repeats of transferring a 4sec clip to GPU (RTX2070) and resampling with varying rates:

48000->44100: 0.225s; 225.270us/item; 4.693us/KSamp
48000->22050: 0.226s; 225.894us/item; 4.706us/KSamp
44100->22050: 0.229s; 228.923us/item; 5.191us/KSamp
22050->16000: 0.223s; 222.620us/item; 10.096us/KSamp

So pretty good performance I think. Haven’t looked at optimising either, it’s just a direct copy of the scipy code. Not that much I can see to optimise, apart from the fact it uses a full FFT rather than PyTorch’s default one-sided, so might be able to instead use just the one-sided, transforming values as needed.

For resample_poly the performance is not very good though (example from the blog mariii linked, 2sec clip, 48000->44100, ~65K taps):

>>> timeit resample_poly_torch(sig_t, P, Q, wfilt_t)
1.05 s ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit sp_resample_poly(sig, P, Q, window=wfilt)
31.8 ms ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

On GPU, 100 repeats, copying signal but not filter: 13.131s; 131.310ms/item.

I include some discussion at the end on why it’s slow compared to scipy. Though I am largely guessing there, will run some tests to confirm.

Did you get that to work on real samples? I implemented that (it’s in the notebook) and it worked on the tiny samples I used for testing, but when I tried to do the conv1d with the example from the blog you posted (96000 sample input, 48000->44100, the 65K tap filter there not the 800K one scipy defaults to) pytorch would segfault I believe due to lack of memory. I also tried using stride to do the decimation while doing the conv1d (not in that notebook), and still pytorch blew up. I don’t think convolving a 65K kernel with a 14112000 sample input (96000 * 147, the up for 48000->44100) is something they designed it for.
The method I used to get it to work is the FIR interpolator outlined here. In this method instead of zero-stuffing with up-1 zeros and filtering with N taps, you instead filter the input directly with up filters each having N/up taps (each of those filters being a phase, I think hence the polyphase name, so zero-stuffing isn’t actually polyphase filtering). I outline it a little in the notebook or that link is pretty good on it. Can’t see how you speed it up in PyTorch without implementing a custom kernel, or maybe with JIT, but not sure about that as it might be spinning up too many kernels doing convolutions for every output sample separately.

marii · June 8, 2019, 9:20am

I am guessing you know a lot more about this than me now, so I will go with “yes.” I am not nearly as fast as you are at working on this stuff. I think you have completely eclipsed the knowledge I have on the subject. Actually… you are an audio expert, so makes sense. Thanks for the interesting read, I was basically stumbling in the dark the whole time.

TomB · June 8, 2019, 10:17am

Thanks, but no expert. I was similarly stumbling in at least a pretty deep gloom, if not dark. There was a whole lot of testing of very basic things, and many errors, to get that notebook together.

baz · June 8, 2019, 12:26pm

Please make a PR so that other people don’t have the same issue

ThomM · June 10, 2019, 10:56pm

Interesting paper via the Import AI newsletter, Detecting Kissing Scenes in a Database of Hollywood Films.

The first component is a binary classifier that predicts a binary label (i.e. kissing or not) given a features exctracted from both the still frames and audio waves of a one-second segment. The second component aggregates the binary labels for contiguous non-overlapping segments into a set of kissing scenes.

Emphasis mine. Haven’t read the paper yet, but it’s interfering for 2 reasons: they’ve effectively done audio diarisation, and they used the VGGish architecture (which I’d never heard of) for the audio feature extraction. Probably worth a look!

MadeUpMasters · June 13, 2019, 12:47pm

Hey, sorry I’ve been away from the thread for a while, was busy getting crushed in the freesound comp. I joined super late but worked really intensely, had fun, and learned an incredible amount. Ultimately I think I was on the right track but data leakage did me in. Here’s the write-up: https://www.kaggle.com/c/freesound-audio-tagging-2019/discussion/95202#latest-550317

Unfortunately with the short time and annoyance of it being a kernels-only competition, I thought it was best to use just normal fastai vision library to avoid headaches in the kernel stage, so I didn’t get to test fastai audio. One thing I know though is that I still don’t know too much, and that we really need someone with audio ML expertise to help us with design choices. For instance, we have been pushing for larger (224x224) images, and focused on keeping them square, but most of the solutions coming out are using 128xN, and the best I’ve seen so far (11th place) used 64xN, and said they tried 32xN and only experienced a loss of .005 on the metric (lwlrap, scale 0-1)!!!

Squarification doesn’t seem to do anything when not using pretrained models, for instance 64x256 images train well. It may have something to do with imagenet and pretrained stuff. Also, many people used much simpler, shallower models than even a resnet18, with one person using a very simple 4-layer base model to get into the top 10%.

Specaugment didn’t seem to do much at all outside of the speech domain (time and frequency masking increased local CV but most people reported no effect on LB, with time-warping reported by some to cause decreased performance).

Mixup seemed to be extremely important (I didn’t have time to implement as I was too focused on trying to do semi-supervised learning for the first time and hoped SpecAugment would be enough). I’m still not totally sure how people implemented, but a version of that seems essential. Adding signals together is an option but just using log math to add two spectrograms together seems pretty promising.

For me, I plan to go back to the drawing board and work on the 3 audio kaggle comps (freesound2018, freesound2019, google tensorflow speech challenge) until I can produce SOTA or near SOTA results. Then hopefully I can come back to fastai audio and help make a version that is compatible with the new fastai 2.0 API. I’ll still probably work on it here and there in the meantime, but I’d like it to be able to not only be fast/convenient for people, but also to train at a world-class level.

MadeUpMasters · June 13, 2019, 12:52pm

Hey Naveen, it isn’t necessary, typically we turn most image transforms off. The way I try to think about which ones could be useful is by asking if a transformed spectrogram could possibly end up in my test set. A vertical flip of a spectrogram for instance would just be changing the frequencies present from the normal patterns of noise, into something random and most likely unidentifiable, thus I’d skip it. A horizontal flip is akin to reversing the audio, and would not be useful for speech, but maybe for something like scene recognition, but maybe not, you just have to try it if it’s plausible and see how your results are affected).

In short, rotations are likely never useful, vertical flips are likely never useful, horizontal could be. I think @ThomM reported though that very small rotations actually gave performance increase, but I could be mistaken.

MadeUpMasters · June 13, 2019, 12:59pm

This is awesome, you’ve gone super deep on this! Really nice work. Have you looked into what FFT algorithm PyTorch uses? I’ve read that it only Cooley-Tukey that has problems with large primes and that alternates (Bluestein’s (chirp-z), Rader’s) don’t.

You should consider getting in conversation with the torchaudio team and possibly submitting a PR. Then we can integrate it into fastai audio from there. We’re happy to have it as a PR as well, but I’d say go one level up the chain and make it more widely available

TomB · June 13, 2019, 4:16pm

Gone a bit deeper now. Just updated the notebook with some new stuff, some clean up and a bit of added explanation (not much still though).
I tried re-implementing polyphase with dot products rather than convolution, that way you can skip calculating the outputs that will just be dropped while decimating anyway. But it performs horribly, like 50x slower than the convolution approach, even while doing around 1/160th of the calculations (in that case of 48000->44100). As I thought might be the case it look like the overhead of spinning up a CUDA kernel for every output sample swamps any advantage (by comparison the convolution approach uses ~300 kernels, about half of which are convolutions and then some small data shuffling).
Can’t really see a way to improve that without a custom kernel. Have been playing around with that a bit as something to learn CUDA on but don’t really expect to get better performance than the convolution. It’s non-trivial to see how to efficiently implement it. And of course the convolution in PyTorch (coming out of a nvidia library) is almost certainly highly optimised. Though not necessarily for this case, this is a much, much larger kernel than typical.
The basic issue is the massive size of the FIR filter which I gather is due to the fact that you are filtering the upsampled signal, so in the case of 48000->44100 you are upsampling by 160x and so filtering at 7.056MHz. This means you need a really long filter for decent performance. One way to improve performance would be to reduce the filter size, you can get what I think (based on minimal knowledge) is fairly equivalent performance with much smaller filters (in fact the 64K filter I was using from that blog post is just a comparison with resampy which uses a custom 64K filter, the default is 16001). This does help performance quite a lot, but still slower than FFT.
There’s a bit in the notebook with some analysis of FIR performance, still learning about this though so not really sure how best to assess it.

From the docs:" the cuFFT Library employs the Cooley-Tukey algorithm … This algorithm expresses the DFT matrix as a product of sparse building block matrices. The cuFFT Library implements the following building blocks: radix-2, radix-3, radix-5, and radix-7. Hence the performance of any transform size that can be factored as 2**a * 3**b * 5**c * 7**d (where a, b, c, and d are non-negative integers) is optimized in the cuFFT library. There are also radix-m building blocks for other primes, m, whose value is < 128. When the length cannot be decomposed as multiples of powers of primes from 2 to 127, Bluestein’s algorithm is used. Since the Bluestein implementation requires more computations per output point than the Cooley-Tukey implementation, the accuracy of the Cooley-Tukey algorithm is better."
I’ll look to do some tests to see how much that matters.

I did also note while looking around that according to this post simple FFT/IFFT resampling is “non-ideal in that it can introduce ringing and other nasty artifacts in the time domain” and suggests applying frequency domain filtering (others in the thread say to use polyphase instead). This frequency domain filtering takes advantage of the convolution theorem which show that convolution in the time domain, as in applying a FIR, is equivalent to multiplication in the frequency domain. So you FFT your standard FIR filter and then multiply that with the FFT’d signal. So that should be easy and performant enough. You just need an appropriate FIR filter (scipy has stuff for generating them based on requirements). Support for frequency domain filtering would also be a nice operation to support for cases where that is desirable (thinking more for EEG/ECG/etc stuff than for music/speech where I’m not sure of any particular use case).

A further option might be looking to use an overlap-and-add (or similar overlap-and-save) method for FFT based conversion (the author of the above suggests this for doing frequency domain filtering). Using this method you divide the input into overlapping chunks, process and then recombine. This might improve FFT performance a bit more by making each file a batch of smaller FFTs which the docs say will improve performance. You could also then optimise those fixed, for a given rate conversion, chunk sizes which is easier than trying to optimise for randomly sized inputs.

Yeah, or I’d probably submit to torchaudio-contrib (staging/playground for torchaudio). If you haven’t seen it it’s worth checking out as it has implementations that better mirror the librose api which I think is preferable. Not a lot there at the moment but some nice little methods. Also, any other re-implementations of librosa stuff would likely be accepted there.

hwasiti · June 13, 2019, 4:44pm

Nice writeup and summary on the winning solutions…

Mixup is already implemented in fastai… It needs to write the method of mixup only at the line of creating the learner… That’s it… I used it in the freesound comp and it is indeed very important to get good results with such limited dataset.

MadeUpMasters · June 13, 2019, 5:22pm

Oh wow, I had no idea it was that easy, I assumed we had to mess with the one-hot encoded labels ourselves and with freesounds multilabel data I didn’t want to try. I’m testing it now on the tensorflow speech challenge.

I’m assuming this is working the same way it does with other images, by using transparency. I’d love to experiment with directly adding spectrograms directly and seeing how it compares. My intuition is that the output images would be pretty similar, but not the same. Thanks for the tip.

TomB · June 13, 2019, 6:10pm

Nice writeup. Look forward to having a play with some of your models.
Couple of thoughts:

I’d thought about that too. It might be that the disparate information isn’t well handled by the network. Channels are collapsed in the first convolution layer so it’s perhaps a little hard for the network to deal with disparate information across them. Perhaps you could introduce them after initially training on just one channel, adding in another channel with additional data once the network has got a bit of a grasp on the initial one. It may be that early on the gradients sort of pulling in different directions doesn’t help but later it could integrate it in (I’m guessing here).
Or even use a separate convolutional network for each channel, combining them either at the linear layer or after a few separate conv layers (or use the groups option in torch.nn.conv2d to separate kernels in the same conv layer).
Or for some of the simpler ones that just produce a few values (but have been found useful in classification with non-NN models) you could perhaps introduce them into the linear section of the network after the convolutions (and again possibly after some initial training).
Interesting ideas to play around with certainly.

I don’t think there’s any requirement to use square images even with pretrained. A convolutional stem will adapt to any input size (note that it applies the same kernel to every pixel so is not dependent on size). With a linear head at the end, you have to use the same size once you create your custom model with cnn_learner (or employ some sort of trick to ensure that the size is always the same at this point), but no need for it to be square.
As Jeremy clarified:

Sounds like a great plan.
Still kinda down the resampling rabbit hole but will pull out of that now. I also want to move back to looking at that sort of stuff, finding the best architectures to use and so on. Can hopefully bounce ideas off each other and compare results.

The ESC-50 dataset also looks interesting as there’s metrics for a variety of models available for it (and not necessarily having the hyper-optimisation of Kaggle solutions might be good). There’s also this fastai project with it:

Given your post on trying comet/neptune any preferences between them? That’s my next thing to look at. There’s also https://www.wandb.com/ which also has a good free tier and fastai integration.

yaegerknight · June 13, 2019, 7:52pm

@MadeUpMasters Thanks for the information. I tried the code in Lesson 1 of Part 1 (2019) Deep Learning course to perform emotion recognition using spectrograms of audio samples in the IEMOCAP dataset. The results I obtained were in line with your observations - without transformations, the model achieved around 50% accuracy. With transformations, it achieved a lower accuracy of around 43%. I am mainly curious as to why I am getting such a low accuracy given that model is very robust and can be trained on web-scraped photos to get high levels of accuracy. I think the issue might be related to the fact that I am plotting them using matplotlib and then saving the file as a .jpg but I am really not sure. I will look into this.

Thanks very much for your help once again.

MadeUpMasters · June 13, 2019, 8:47pm

Sounds like a cool problem. Audio is really tricky, it adds a whole new bag of hyperparameters you can play with. What are the lengths of the audio clips you’re using? And are they all the same length, or varied? What library are you using to generate the spectrogram before plotting?

If you can share a bit more about the dataset and it’s contents, and what library you’re using to process the audio, and finally a few pics of your spectrograms, we can probably help you to fix your results.

MadeUpMasters · June 13, 2019, 8:57pm

TomB:

I’d thought about that too. It might be that the disparate information isn’t well handled by the network. Channels are collapsed in the first convolution layer so it’s perhaps a little hard for the network to deal with disparate information across them. Perhaps you could introduce them after initially training on just one channel, adding in another channel with additional data once the network has got a bit of a grasp on the initial one. It may be that early on the gradients sort of pulling in different directions doesn’t help but later it could integrate it in (I’m guessing here).
Or even use a separate convolutional network for each channel, combining them either at the linear layer or after a few separate conv layers (or use the groups option in torch.nn.conv2d to separate kernels in the same conv layer).
Or for some of the simpler ones that just produce a few values (but have been found useful in classification with non-NN models) you could perhaps introduce them into the linear section of the network after the convolutions (and again possibly after some initial training).
Interesting ideas to play around with certainly.

Yeah there’s so much room to experiment with stuff like this. I’m just now getting to the point where I’m starting to be able to rearrange the inner parts of the network, so I hadn’t even thought about keeping them separate during the conv layers and combining them later.

That does look like a really cool option, given the great job they’ve done on benchmarking and referencing papers. And yea it would be refreshing to not have to ensemble 15 classifiers to have a high score hahaha.

Actually I haven’t been using either. Immediately I preferred and really like comet. I think I’ll start back with it tomorrow, thanks for the reminder.

MadeUpMasters · June 14, 2019, 2:05pm

Also, with regards to resampling and spectrograms. Has anyone looked at whether there is any difference between downsampling and just increasing the hop_length when generating spectrograms? It seems like if you’re just downsampling from one sample rate to another, say 44100 to 22050, the same thing could be accomplished much more cheaply by doubling the hop. I haven’t played with this at all and haven’t thought too deeply so I could be missing something.