Deep Learning with Audio Thread

Nice work guys, a lot of effort in here :slight_smile: Keep in mind that the next version of fastai is expected to include some of these features (some taken from your implementations) directly in the main library, so we might not even need a separate module for long. Of course, who knows what will actually happen once Jeremy & Sylvain “push the button” :slight_smile:

I’m particularly interested in the resampling preprocessor - @marii did you get much further with your resampling investigations in the end?

Oh yeah - one other thing - if you rename the repository to fastai_audio or fastaiaudio or anything without a hyphen in the name, then it will be a valid python module name that can actually be imported… a decent quality of life improvement :slight_smile:

1 Like

Thanks a lot, I really appreciate the detailed feedback.

I noticed the spectrograms were coming out upside-down with Image but honestly wasn’t entirely sure why. I tried flipping them but neither np or torch support negative strides yet. Do you have any suggestions for an easy fix to give them the same perspective?

This is coming once I get through all the parameters, and will be at the top of the parameters section, and linked at the top of document.

Yeah it would be ideal to not have a separate module. We weren’t intending for a separate one but while using yours we both just started playing around and making breaking changes and ended up having similar ideas so we combined them. It was a really great learning experience for me as I wouldn’t been able to make it from scratch, but with your code and the fastai docs I was able to figure things out, so thank you.

2 Likes

I spent wayyy too much time on testing resampling to get the time down. There’s actually one more optimization that needs to be made, and a param to let the user choose resampling type, as well as testing to make sure there is no quality loss to training data resampled using polyphase (non-FFT based resampling). Cliffs of my findings are below. More will be in the audio guide.

There are 5 functions I considered for resampling, benchmarks below are for a 15 second clip @ 44.1kHz

  • librosa resample (incredibly slow, ~500ms)
  • resampy resample (slow, ~200ms)
  • sox resample using torchaudio (relatively fast but still too slow (~50ms)
  • scipy.signal.resample (fast but occasionally ruin everything slow ~30ms - 3 minutes)
  • scipy.signal.resample_poly (very fast, but occasionally slow ~3ms-150ms)

I spent a lot of time nailing down why scipy.signal.resample and scipy.signal.resample_poly were sometimes slow, and I got a satisfyingly concrete answer that allows me to predict pretty precisely how long it will take.

scipy.signal.resample has two bottlenecks, an FFT and an IFFT. If the clip is n0 samples long, the FFT will be done on a sample of length n0 and the inverse FFT on a clip of length n1 samples where n1 = n0*new_sample_rate/old_sample_rate. If the greatest prime factor of n0 is large, it will be slow. If the greatest prime factor of n1 is large, it will be about 10x as slow (this is how you get 4 minute resampling). This is because the underlying FFT algorithm, Cooley-Tukey, is optimized for highly composite numbers, but handles other numbers poorly.

There are a number of options for fixing this. One is padding to numbers with low prime factorizations, but this is tricky because you have to find a number that satisfies it for both n0 and n1 without a massive size increase. Another is replacing the FFT calls in scipy.signal with an alternate FFT algorithm that don’t have as much trouble with prime input like Rader’s or Bluestein’s algorithms. But if polyphase resampling doesnt affect our ML training, then you probably don’t have to go to the trouble, because it’s almost always faster.

Polyphase resampling doesn’t use and FFT, is almost always fast, and is to me in very unscientific testing so far, indistinguishable from FFT based resampling. It is slow sometimes, but only in rare cases where the greatest common denominator of the sample rates is very low (< 100). Given that most sample rates you would want come from a standard list with a GCD > 100 between them, this is unlikely to be a problem. But, if someone wants to resample from 44100, to 15997 (GCD = 1), it is going to take ~10ms per second of audio (150ms per 15s clip).

For now we just use polyphase resampling, but I want to give the user an FFT based option as I can’t say for sure it won’t give better results, but I’ve put it off because scipy resample is fastest except when it explodes, so I need to substitute another lib, like resampy, for large greatest prime factors.

2 Likes

@ThomM I am pretty stuck on the re-sampling actually, and as of right now think it would require jit/swift, or something else to accelerate it on the gpu. The algorithm I was looking into required upsampling(adding a lot of zeros between each datapoint), applying a conv1D, and then downsampling(pick every nth number). I was not able to find a way to implement this using pytorch, including looking at their sparse matrix operations. Other options seemed to require custom build scripts and such. Not really sure of a way to add this without adding a make file to fastai, which seems to be what Jeremy wants to avoid(for a good reason).

I have implemented it using standard pytorch operations, but the extra calculations that include extra zeroes due to upsampling bring the performance down to being slower than just doing the operations with current libraries.

@MadeUpMasters I was basically looking to implement scipy.signal.resample_poly on the GPU.

My main resource for understanding resampling is here:

Takeaway: scipy.resample and sciply.resample_poly have bad aliasing. resampy resample better for less aliasing. Custom filter allows you to get less aliasing using scipy.resample_poly

@MadeUpMasters Also… librosa sometime uses resampy(checked the source). So not sure which one you benchmarked?

2 Likes

Wow looks like we went down the same rabbit hole, but you went a lot deeper :slight_smile: . The linked notebook looks great. Is it just that post that helped you gain your level of understanding? Or the whole signalsprocessed blog?

Yeah I think that’s actually how I discovered resampy. Also I didn’t do great benchmarking on librosa or resampy because I already had a wide range on scipy and when I tried a few things and saw 200ms+, I realized neither of those were going to be fast enough. Also I was really shocked by the “benchmarking” documentation of resampy. They only give one example, a sample that feeds a prime into scipy’s ifft so it takes forever, and use this outlier to say they are much faster (they probably are on average, but at 200ms are slower in the majority of cases), I found it really misleading.

Nice work Robert, not an expert by any mean but have some familiarity with audio and seems mostly correct.
However I think you are a off on FFT length and hop length.

Hop_length is the size (in number of samples) of those chunks. If you set hop_length to 100, the STFT will divide your 52,480 sample long signal into 525 chunks, compute the FFT (fast fourier transform, just an algorithm for computing the FT of a discrete signal) of each one of those chunks.

Hop length isn’t the size of the chunks, it is the spacing of them. Each chunk is n_fft samples long, but spaced hop_length samples apart. So each chunk will will have an (n_fft - hop_length) sample overlap with the next chunk.

output of each FFT will be a 1D tensor with n_fft # of values

It is actually a tensor of length (n_fft//2) + 1, so with an n_fft of 1024 there will be 513 values

Window length is different again (and I’m a bit less clear here but think this is correct, -ish at least). First the signal is split into n_fft sized chunks spaced hop_length samples apart. Then the “window function” (function in the mathematical sense) is applied to each of those chunks. There seem to be tricks you can use with window lengths larger/smaller than your n_fft to accomplish various things which I don’t really understand. By default win_length = n_fft.

I put together a notebook illustrating this here. At first I just tested some things to verify for myself I was correct, again no expert feel free to correct me if you think I’m wrong as I may well be. So I didn’t edit the existing one, but I then added text rather than commenting here. Feel free to integrate into the existing one or I’ll look at that at some point. I didn’t add any code to produce meaningful signals (just zeros) which you did nicely so couldn’t cover some of that side.

I think you are also a bit off when you say:

When we increase resolution in the frequency dimension (y-axis), we lose resolution in the time dimension, so there is an inherent tradeoff between the choices you make for n_fft, n_mels, and your hop_length.

This is true of the FFT where the choice of n_fft trades off temporal resolution for frequency resolution as it determines both. You have n_fft//2 frequency bins, but your spatial resolution is limited to sample_rate/n_fft, e.g. 16000/1024=15.625 means a temporal resolution of 15.6 milliseconds. But this is why you use the STFT. This separates temporal resolution, determined by hop_length, from frequency resolution, set by n_fft.
There’s still a bit of a tradeoff as while you get an FFT every hop_length samples it is still giving you frequencies over the next n_fft samples not just those hop_length samples, but it isn’t the direct tradeoff of the FFT. And using a window function will balance this out a bit, reducing the sort of temporal smearing a larger n_fft will give without a window function. So you are correct that there is still a tradeoff but it’s not the simple frequency resolution vs. time-resolution of a standard FFT. Thus you see that when you raised n_fft from 1024 to 8192 you still got the same 206 time values based off your hop_length.

And as a very minor quibble, the humans hear 20Hz-20kHz is a commonly quoted but rather inaccurate number. That tends to be the sort of range you’d try and design audio electronics to work across but we don’t really hear the edges of that range. The top of hearing is more like 15-17kHz for the very young (and that’s the real limits of perceptibility), 13-15kHz for middle age, then dropping as you get older. And speech tops out below 10kHz and even just up to 4kHz remains intelligible (hence the 8kHz sample rate you see on lower quality stuff). At the bottom end anything below about 160Hz is not really heard but felt and a cutoff around here is common even at music events with huge speakers (in part due to these lower frequencies requiring a lot of power to reproduce and still often just being a muddy rumble). I mainly mention this because these outsides of the range are what are cutoff with various parameter choices but you shouldn’t generally worry much about trying to preserve that full 20Hz-20kHz range. A 22050 sampling rate, and so 11kHz cutoff, likely wouldn’t lose much useful information even for music.

5 Likes

The post gave me most of my knowledge, but the rest of the blog is useful, especially when understanding assumptions in the notebook. Other places I read was the resampy docs/paper(skim)/source and the the resample_poly docs/source code

I would ignore anything that is not directly related to the academic value of the approach in a lot of academic work. Resampy is an academic work that was purposefully trying to decrease aliasing in audio, they had less interest in performance from my understanding. The benchmarking in this case is probably something the spent less time on, and therefore I would take it with a handful of salt. IT really only had to be “usable”, they did spend time optimizing it as much as they could though.

Renamed the repo

Yes I’m super interested to see what they do actually. I know that applying transforms on the GPU is going to make a big difference. I had some problems creating tensors in CUDA on training loops so didn’t get far.

I’m sure that they will probably not use the caching if they do implement things on the GPU but for most fo the pre-transforms you only want to do them once at the beginning anyway.

The resampling, segment and remove silence (pre-transforms) can take quite a while so I’ve added a method to preview them, demonstrated in the Getting Started Notebook.

Thank you, this was really helpful for me. I will make changes for the next update and include some stuff from the notebook you linked and credit you. You’re also welcome to make the changes and PR it if you prefer.

No problem. Again, thanks for your efforts, they spurred me to contribute.
I’ve put up a new version that rewrites bits I’d just copy-pasted from what was just going to be a forum post. I also added a bit on window length as I realised I hadn’t actually covered that, just windows in general. https://nbviewer.jupyter.org/gist/thomasbrandon/f0d11593b07dc5ccb2237aec6b4355a5

I’ll hopefully have a look have a look at trying to better integrate it into the existing stuff sometime soon. Think there might be a nice explanation going from the way the FFT shows frequencies across the whole sample which you nicely show with one of your figures (perhaps adding a little bit there about how phase gives some information on where frequencies are within the sample), into multiple non-overlapping FFTs and how they give information on frequency over time but introduce a trade off between temporal/frequency resolution as n_fft controls both, into STFT which separates that out. Think you also might be able to use the coloured rectangle stuff I used for windows to show how frames work.
But if you’re updating it and want to just throw in some parts now that’s fine.

3 Likes

Thank you @MadeUpMasters for this excellent notebook. It’s a great resource to learn core concepts about audio. Thanks once again :bowing_man:t4:

1 Like

Out of interest, did you try comparing the performance of resampling while loading versus loading and then resampling in a subsequent step? As data is more likely to still be in cache while loading you may gain a pretty significant performance boost which could even eclipse gains from a more efficient resampling algorithm. There’s example code at https://github.com/pytorch/audio/blob/master/torchaudio/sox_effects.py for resampling while loading with torchaudio. Or librosa has the rate option to load, however as this is python code you’re likely to see less advantage as there’s a much greater chance other code will execute in between loading and resampling and overwrite the cache.
I’d also note, in case you haven’t done much performance work (not that I have), that given this code will be running along with various other code also reading from disk/memory, processing and loading to GPU, that synthetic benchmarks may not be a great indicator of eventual performance. For instance, if something loads a whole big chunk into cache to process, then when interleaved with other code also processing lots of data you are just going to get a lot of cache misses and heavily reduced performance. A slower algorithm that better streams data may be faster overall.

Yea agree about all of your points. My working assumption was that the degree of difference in time between polyphase and FFT based methods was so great that it was unlikely GPU batching stuff would result in a faster FFT method, and if it did, then just writing a polyphase algorithm that works in batch on GPU (as @marii was working on) would be even faster still.

I did try resampling while loading in both librosa and torch. In librosa it’s a non-starter, their load method is 30x slower than torchaudio. Sox’s resampling was my 2nd best option, mostly in the 30-50ms range (torchaudio load is ~1ms, both of these are for a 15s 44.1kHz wav file), but still at least 5x slower.

I’m going to not be working on this much the next week as I want to focus on freesound comp, but after that maybe we can work together a bit on it if you have time. Thanks

OK, yeah, initial profiling I did after posting also showed that librosa is really slow here. The sox results you gave seem a bit more promising. Looking at the code sox seems to be using polyphase resampling in some cases. Did you try any of the quality options? I would imagine that you don’t necessarily need very high quality here, certainly things that sound really bad and so are not usable normally would be fine if they don’t affect accuracy (but could be wrong and would want to test this). One issue would be if your classes use different rates, then you may just end up training the network to recognise your resampling, but higher quality resampling won’t necessarily reduce that. Think that might also be more of a general issue in audio than for other fastai areas given the greater scope for differences among formats.
I’ve been playing around with the fastai_audio library, just trying out some things to later look at integrating with the existing stuff if it works nicely. One of the changes I made was to remove the dependency on torchaudio because there currently isn’t a conda package of it and there seem to be some issues with windows support (which while not officially supported by fastai does work). Will compare the torchaudio performance now though. I did see some recent stuff from the torchaudio people about getting a conda package together so that would solve that issue.

Not very familiar with the algorithms but presumably you could fairly easily add frequency domain based resampling. As in the FFT->proc->IFFT based resampling methods but without the IFFT. Given the pretty common use of frequency domain networks, that avoids converting back to time domain just to later do another FFT on it. Of course that would preclude any time-domain only transforms after resampling. So can’t be the only method but could have uses.

One slight issue I have with using the GPU for transforms is that this will reduce available GPU memory and so achievable batch sizes. And probably more of an issue is it may make the GPU memory usage more in-deterministic. Currently, as I understand it, if the first epoch succeeds then the whole training will. Having training runs fail in the middle as you hit a particularly large shuffled batch seems troubling. This seems to be more of an issue for audio where there may be more variation in sizes between items (though I guess that people using datasets of images from e.g. google images will face similar issues given this looks to be the way Jeremy want to move there as well).
There is also the issue that I think data transfers can be a fairly large part of the GPU time. It may be faster overall to do the processing on the CPU if that reduces the amount of data you need to transfer to the GPU. But that’s based on synthetic benchmarks, so it may well not represent a bottleneck in at least most uses (I’ll post some more on that separately).
Of course if transforms are implemented in PyTorch then the same code works for both CPU and GPU processing (and you get multi-core parallelisation on the CPU by default). So as long as all processing is tensor based you can allow users to customise exactly what processing happens on CPU/GPU without much complication to the code (of course with sensible defaults in get_transforms). You just have a transform that moves the tensor to the GPU that can be inserted at the appropriate place in the transforms list. Of course you’d then need to ensure all subsequent transforms are in PyTorch to avoid copying to/from GPU multiple times.

Just about to look at benchmarking the neural net side of things without any of the file processing. Do you have any benchmarks on that? Especially for non-CNN based models which I have less experience with. What sort of load/process speed is needed to not bottleneck?

Mmm, I’d been looking to have a play with the freesound comp but doesn’t look like I’ll get to that before the end. The focus there also seems to be the noisy ground truth stuff while I’m still more at the basics stage. Good luck with that. Interested to see your results if you make them public.

1 Like

I think that some of the code at the end of the DataAugmentation.ipynb notebook isn’t properly measuring timing as PyTorch GPU calls are asynchronous. They just schedule operations to be completed and don’t actually perform the work. Some results are correct as data is moved back to the CPU which will block until any outstanding processing is complete.
I wrote some stuff to profile both GPU and CPU code. It uses the torch.cuda.stream module which allows you to record timing information for events you create and implements this interface for CPU code so you can profile both. For simple testing you may also just be able to call torch.cuda.synchronize to ensure all outstanding work is completed but I saw some mixed information on the reliability of this method. The code isn’t extensively verified (and not entirely sure how you would) but seems to give reasonable results.
Here’s a notebook with the Profiler and some tests of STFT in both librosa and PyTorch: https://gist.github.com/thomasbrandon/a1e126de770c7e04f8d71a7dc971cfb7

On the STFT results, it looks like PyTorch on CPU might be significantly faster than librosa for STFT. By default PyTorch uses all cores (6 cores here) but even when I limited it to one it was nearly 3x faster (though there were some reports suggesting that torch.set_num_threads may not be entirely reliable as some underlying libraries may still use multi-core acceleration). Given the batch size numbers it looks like PyTorch may be doing batch based parallelisation even on CPU. Though I don’t entirely trust those numbers. I’d want to do some less synthetic tests to be sure. Also there may be some cache effects that are dominating the performance, particularly on small batches.
I also suspect that GPU performance may be able to perform a fair bit better in non-synthetic tests. I think limiting everything to a single stream for the timing may limit it’s ability to parallelise as without the synchronisation I use for timing the operations from subsequent batches could overlap. This is especially true if you use the wait option on the profiler to record intermediate times, I don’t use this for any of the graphs but there’s a test at the top where I record the copies and STFT in one run where you can turn off wait to see this.
You do see that for some operations transfer times seem to be a significant limit on performance. Though note that some of those tests are pretty unlikely cases and not intended to represent actual STFT performance in real cases. I did only copy back half the complex spectrogram to mirror a magphase separation, but you’d generally want to implement a Mel transform first which would dramatically reduce the size of data to be transferred. It does though suggest that you likely want to avoid copying data back and forth as would happen if you mix CPU and GPU transforms.

2 Likes

“Here is a collab notebook to try it out!”

Would love to try it out, but I get a pop-up window saying:

Notebook loading error

There was an error loading this notebook. Ensure that the file is accessible and try again.

?.

BTW, presumably you all are aware of a lot of similar discussions going on over at https://github.com/keunwoochoi/torchaudio-contrib
Didn’t see it mentioned in this thread though.

1 Like

Not sure how useful this is for python/Pytorch users, but I’ve been working on porting over FAIR’s very fast Mfsc and Mfcc implementations in Wav2letter from C++ to Swift. (they use FFTW for the fourier transform).

https://github.com/realdoug/AudioFeature

Figured I’d share here in case anyone is working in Swift and wants to use it, contribute or point out something else that is way better that I should be using instead :laughing:

4 Likes

Hey Scott,

Have you tried using a different browser or going in to incognito? Has anyone else had this issue? I can’t seem to recreate this?

Very cool Doug, thanks for sharing. My plan this week was to look into this I Swift so I think you’ve saved me a ton of work. I have a method of using Tensorflow’s WAV loading, spectrogram and MFCC methods to get a S4TF tensor of MFCCs and/or spectrograms, but also couldn’t find a good library for general audio manipulation in swift, particularly for handling resampling, no way of generating melspectrograms, and no method of slicing/concatenation either. My plan was to more or less mimic pydub’s API, probably based on the Sox wrapper Jeremy’s already done; I don’t know how far down that road I’ll get, though.

2 Likes