Deep Learning with Audio Thread

MadeUpMasters · May 29, 2019, 9:07pm

Thank you, this was really helpful for me. I will make changes for the next update and include some stuff from the notebook you linked and credit you. You’re also welcome to make the changes and PR it if you prefer.

TomB · May 31, 2019, 4:06am

No problem. Again, thanks for your efforts, they spurred me to contribute.
I’ve put up a new version that rewrites bits I’d just copy-pasted from what was just going to be a forum post. I also added a bit on window length as I realised I hadn’t actually covered that, just windows in general. https://nbviewer.jupyter.org/gist/thomasbrandon/f0d11593b07dc5ccb2237aec6b4355a5

I’ll hopefully have a look have a look at trying to better integrate it into the existing stuff sometime soon. Think there might be a nice explanation going from the way the FFT shows frequencies across the whole sample which you nicely show with one of your figures (perhaps adding a little bit there about how phase gives some information on where frequencies are within the sample), into multiple non-overlapping FFTs and how they give information on frequency over time but introduce a trade off between temporal/frequency resolution as n_fft controls both, into STFT which separates that out. Think you also might be able to use the coloured rectangle stuff I used for windows to show how frames work.
But if you’re updating it and want to just throw in some parts now that’s fine.

dsr · May 31, 2019, 6:18am

Thank you @MadeUpMasters for this excellent notebook. It’s a great resource to learn core concepts about audio. Thanks once again

TomB · May 31, 2019, 8:16am

Out of interest, did you try comparing the performance of resampling while loading versus loading and then resampling in a subsequent step? As data is more likely to still be in cache while loading you may gain a pretty significant performance boost which could even eclipse gains from a more efficient resampling algorithm. There’s example code at https://github.com/pytorch/audio/blob/master/torchaudio/sox_effects.py for resampling while loading with torchaudio. Or librosa has the rate option to load, however as this is python code you’re likely to see less advantage as there’s a much greater chance other code will execute in between loading and resampling and overwrite the cache.
I’d also note, in case you haven’t done much performance work (not that I have), that given this code will be running along with various other code also reading from disk/memory, processing and loading to GPU, that synthetic benchmarks may not be a great indicator of eventual performance. For instance, if something loads a whole big chunk into cache to process, then when interleaved with other code also processing lots of data you are just going to get a lot of cache misses and heavily reduced performance. A slower algorithm that better streams data may be faster overall.

MadeUpMasters · May 31, 2019, 6:26pm

Yea agree about all of your points. My working assumption was that the degree of difference in time between polyphase and FFT based methods was so great that it was unlikely GPU batching stuff would result in a faster FFT method, and if it did, then just writing a polyphase algorithm that works in batch on GPU (as @marii was working on) would be even faster still.

I did try resampling while loading in both librosa and torch. In librosa it’s a non-starter, their load method is 30x slower than torchaudio. Sox’s resampling was my 2nd best option, mostly in the 30-50ms range (torchaudio load is ~1ms, both of these are for a 15s 44.1kHz wav file), but still at least 5x slower.

I’m going to not be working on this much the next week as I want to focus on freesound comp, but after that maybe we can work together a bit on it if you have time. Thanks

TomB · June 1, 2019, 6:12am

OK, yeah, initial profiling I did after posting also showed that librosa is really slow here. The sox results you gave seem a bit more promising. Looking at the code sox seems to be using polyphase resampling in some cases. Did you try any of the quality options? I would imagine that you don’t necessarily need very high quality here, certainly things that sound really bad and so are not usable normally would be fine if they don’t affect accuracy (but could be wrong and would want to test this). One issue would be if your classes use different rates, then you may just end up training the network to recognise your resampling, but higher quality resampling won’t necessarily reduce that. Think that might also be more of a general issue in audio than for other fastai areas given the greater scope for differences among formats.
I’ve been playing around with the fastai_audio library, just trying out some things to later look at integrating with the existing stuff if it works nicely. One of the changes I made was to remove the dependency on torchaudio because there currently isn’t a conda package of it and there seem to be some issues with windows support (which while not officially supported by fastai does work). Will compare the torchaudio performance now though. I did see some recent stuff from the torchaudio people about getting a conda package together so that would solve that issue.

Not very familiar with the algorithms but presumably you could fairly easily add frequency domain based resampling. As in the FFT->proc->IFFT based resampling methods but without the IFFT. Given the pretty common use of frequency domain networks, that avoids converting back to time domain just to later do another FFT on it. Of course that would preclude any time-domain only transforms after resampling. So can’t be the only method but could have uses.

One slight issue I have with using the GPU for transforms is that this will reduce available GPU memory and so achievable batch sizes. And probably more of an issue is it may make the GPU memory usage more in-deterministic. Currently, as I understand it, if the first epoch succeeds then the whole training will. Having training runs fail in the middle as you hit a particularly large shuffled batch seems troubling. This seems to be more of an issue for audio where there may be more variation in sizes between items (though I guess that people using datasets of images from e.g. google images will face similar issues given this looks to be the way Jeremy want to move there as well).
There is also the issue that I think data transfers can be a fairly large part of the GPU time. It may be faster overall to do the processing on the CPU if that reduces the amount of data you need to transfer to the GPU. But that’s based on synthetic benchmarks, so it may well not represent a bottleneck in at least most uses (I’ll post some more on that separately).
Of course if transforms are implemented in PyTorch then the same code works for both CPU and GPU processing (and you get multi-core parallelisation on the CPU by default). So as long as all processing is tensor based you can allow users to customise exactly what processing happens on CPU/GPU without much complication to the code (of course with sensible defaults in get_transforms). You just have a transform that moves the tensor to the GPU that can be inserted at the appropriate place in the transforms list. Of course you’d then need to ensure all subsequent transforms are in PyTorch to avoid copying to/from GPU multiple times.

Just about to look at benchmarking the neural net side of things without any of the file processing. Do you have any benchmarks on that? Especially for non-CNN based models which I have less experience with. What sort of load/process speed is needed to not bottleneck?

Mmm, I’d been looking to have a play with the freesound comp but doesn’t look like I’ll get to that before the end. The focus there also seems to be the noisy ground truth stuff while I’m still more at the basics stage. Good luck with that. Interested to see your results if you make them public.

TomB · June 1, 2019, 9:59am

I think that some of the code at the end of the DataAugmentation.ipynb notebook isn’t properly measuring timing as PyTorch GPU calls are asynchronous. They just schedule operations to be completed and don’t actually perform the work. Some results are correct as data is moved back to the CPU which will block until any outstanding processing is complete.
I wrote some stuff to profile both GPU and CPU code. It uses the torch.cuda.stream module which allows you to record timing information for events you create and implements this interface for CPU code so you can profile both. For simple testing you may also just be able to call torch.cuda.synchronize to ensure all outstanding work is completed but I saw some mixed information on the reliability of this method. The code isn’t extensively verified (and not entirely sure how you would) but seems to give reasonable results.
Here’s a notebook with the Profiler and some tests of STFT in both librosa and PyTorch: https://gist.github.com/thomasbrandon/a1e126de770c7e04f8d71a7dc971cfb7

On the STFT results, it looks like PyTorch on CPU might be significantly faster than librosa for STFT. By default PyTorch uses all cores (6 cores here) but even when I limited it to one it was nearly 3x faster (though there were some reports suggesting that torch.set_num_threads may not be entirely reliable as some underlying libraries may still use multi-core acceleration). Given the batch size numbers it looks like PyTorch may be doing batch based parallelisation even on CPU. Though I don’t entirely trust those numbers. I’d want to do some less synthetic tests to be sure. Also there may be some cache effects that are dominating the performance, particularly on small batches.
I also suspect that GPU performance may be able to perform a fair bit better in non-synthetic tests. I think limiting everything to a single stream for the timing may limit it’s ability to parallelise as without the synchronisation I use for timing the operations from subsequent batches could overlap. This is especially true if you use the wait option on the profiler to record intermediate times, I don’t use this for any of the graphs but there’s a test at the top where I record the copies and STFT in one run where you can turn off wait to see this.
You do see that for some operations transfer times seem to be a significant limit on performance. Though note that some of those tests are pretty unlikely cases and not intended to represent actual STFT performance in real cases. I did only copy back half the complex spectrogram to mirror a magphase separation, but you’d generally want to implement a Mel transform first which would dramatically reduce the size of data to be transferred. It does though suggest that you likely want to avoid copying data back and forth as would happen if you mix CPU and GPU transforms.

drscotthawley · June 2, 2019, 5:33am

“Here is a collab notebook to try it out!”

Would love to try it out, but I get a pop-up window saying:

Notebook loading error

There was an error loading this notebook. Ensure that the file is accessible and try again.

?.

BTW, presumably you all are aware of a lot of similar discussions going on over at https://github.com/keunwoochoi/torchaudio-contrib
Didn’t see it mentioned in this thread though.

realdoug · June 2, 2019, 5:53am

Not sure how useful this is for python/Pytorch users, but I’ve been working on porting over FAIR’s very fast Mfsc and Mfcc implementations in Wav2letter from C++ to Swift. (they use FFTW for the fourier transform).

https://github.com/realdoug/AudioFeature

Figured I’d share here in case anyone is working in Swift and wants to use it, contribute or point out something else that is way better that I should be using instead

baz · June 2, 2019, 1:18pm

Hey Scott,

Have you tried using a different browser or going in to incognito? Has anyone else had this issue? I can’t seem to recreate this?

ThomM · June 2, 2019, 11:14pm

Very cool Doug, thanks for sharing. My plan this week was to look into this I Swift so I think you’ve saved me a ton of work. I have a method of using Tensorflow’s WAV loading, spectrogram and MFCC methods to get a S4TF tensor of MFCCs and/or spectrograms, but also couldn’t find a good library for general audio manipulation in swift, particularly for handling resampling, no way of generating melspectrograms, and no method of slicing/concatenation either. My plan was to more or less mimic pydub’s API, probably based on the Sox wrapper Jeremy’s already done; I don’t know how far down that road I’ll get, though.

drscotthawley · June 3, 2019, 2:02pm

baz, Ok, yes, sorry, I don’t know what was up there, but I’m now able to run the Colab notebook. Thanks. It looks really good!

One thing: the spectrograms all look upside-down (i.e., with low frequencies on the top instead of the bottom) compared to how I’m used to seeing them in acoustics & audio engineering contexts (and, eg. MATLAB, and many other applications):

(This is just a consequence of computer graphics generally being “upside down” compared to how graphs are typically done; it happens with any kind of image, most of the time.)

To correct this and make the display a bit more “standard”, might I suggest a “flip()” operation when displaying spectrograms via show_batch()?

aquietlife · June 3, 2019, 3:09pm

Cross-posting from the Share Your Work thread:

Hello!

https://raw.githubusercontent.com/aquietlife/whisp/master/siren.gif(image larger than 4096KB)

I made an environmental sound classifer called WHISP that classifies sounds from 50 categories, trained on the ESC-50 dataset.

You can try it out here! https://whisp.onrender.com/

The code is up here on Github: https://github.com/aquietlife/whisp

I also wrote at length about WHISP, the ESC-50 dataset, training an environmental sound classifier, and some insights i had along the way while testing it in the field on my blog: http://aquiet.life/

Please let me know what you think! I’d love to get connected to other people using ML/AI in the sound/audio field

I’m happy this thread exists! Hoping to learn more about deep learning with audio from all of you

ThomM · June 4, 2019, 9:54pm

Very cool, congrats on the work! Such an interesting problem, and bravo on pushing it all the way through to an application. It would be very interesting to see what difference it made to “port” this to fastai_audio; I’d guess the audio-specific augmentations could help more than the default fastai image transforms. It’s also interesting you’re explicitly adding a colour map to the spectros and saving them as (presumably) RGB images rather than the ‘true’ values of the spectrogram - I guess as long as the colour mapping remains constant between spectrograms (does it?) it ultimately wouldn’t make much relative difference between them, but it could be interesting to try as greyscale.

Come to think of it, @baz/@MadeUpMasters when you’re caching the spectros, how are you saving & reloading them? As pickled tensors, scaled images or something else?

I really like the problem space, I think generally audio as a sensor for robotics/IoT is under-utilised, working on things like this is super duper low hanging fruit in the best possible way. Nice one & thanks for sharing!

ThomM · June 4, 2019, 9:59pm

I came to post that Apple featured audio classification in CoreML in their State of the Union talk at WWDC yesterday! Around the 1:25:00 mark on that video there’s an app for kids demoed which listens to your voice, recognises words, and plays appropriate sounds; then, you can make animal sounds and it guesses what sounds you’re making. I guess this is the manifestation of the turicreate sound classifier that I linked to a couple of months ago. More evidence that ML-for-sound is undervalued & low-hanging fruit

TomB · June 5, 2019, 1:43am

Using a cmap will triple the size of the inputs and also the weights of the initial layer (I think always only the first layer across all models, certainly resnet). So you should be able to use larger batch sizes with grayscale. Though of course then you need to fiddle with the model, and for pretrained the weights (or just not use pretrained weights for the first layer).
Interested to see verification that it doesn’t alter learning ability. Looking to run a couple of experiments on that. Mainly around the weight copying as I assume it won’t affect learning.

baz · June 5, 2019, 11:11am

We’re saving them as tensor .pt files. Was having formatting issues otherwise.

baz · June 5, 2019, 11:23am

Make an issue in the repo and I’ll get to it at some point

drscotthawley · June 5, 2019, 5:06pm

@baz Am I looking at the right repo? The one I see most recently referred to the most in this thread (and the one read by your Colab notebook) is https://github.com/mogwai/fastai_audio

Because there’s no “Issues” tab like most GitHub repos have.
37%20PM

drscotthawley · June 5, 2019, 5:26pm

Cross-posted in the “Share Your Work” thread: I was out for April & May b/c I was “slammed” finishing up my own (3-year-long) audio project!

“SignalTrain: Profiling Audio Compressors with Deep Neural Networks.”
Links: http://www.signaltrain.ml , https://arxiv.org/abs/1905.11928

Moving forward (and apart from using on other effects besides just compressors) I want to port it to be usable with fastai & add multichannel audio (e.g., at least 8 channels for my next app idea).