Deep Learning with Audio Thread

@MadeUpMasters Found a method to do it in the librosa.effects module:

Essentially yes it trims the silence from the beginning and end of a sample. It would be extremely useful for as a way of pre-processing multiple segments out of very spare data. For example a 1 minutes whale recording with 5 calls in it but mostly noise and silence for the rest of it.

def chop_silence(signal, rate, threshold=70, pad_ms=200):
    actual = signal.clone().squeeze()
    padding = int(pad_ms/1000*rate)
    splits = split(actual.numpy(), top_db=threshold)
    return actual[splits[0, 0]-padding:splits[-1, -1]+padding]

28

It does require numpy to work though. Maybe we could torchify.

When I’ve tried to do transformations on the GPU, I’ve had cuda initisalisation errors during training?

1 Like

I use this as it comes with loads of cool extensions https://github.com/jupyter-contrib/jupyter_nbextensions_configurator

One of them is a gist publisher which I just set up and it works :slight_smile:

FIXED

Needed to upgrade torch to 1.1.0 and re-run the python setup.py install

Problem

So I’m not actually able to get torchaudio to work at all now :frowning: I’ve create a conda environment like so:

conda create -n fastai-audio
conda activate fastai-audio
sudo apt-get --assume-yes install ffmpeg sox libsox-dev libsox-fmt-all
pip install pydub librosa fire --user
conda install -c fastai fastai 
git clone https://github.com/pytorch/audio torchaudio 
cd torchaudio 
python setup.py install

Then trying to import a wav file which has worked recently:

import torchaudio

## Need this because the source tar file doesn't extract to its own folder
path = '/home/h/.fastai/data/ST-AEDS-20180100_1-OS/m0005_us_m0005_00448.wav'
try :
    sig, sr = torchaudio.load(path)
except Exception as e:
    print(e)

Gives me:

Traceback (most recent call last):
  File "crash.py", line 1, in <module>
    import torchaudio
  File "/home/h/miniconda3/envs/audio/lib/python3.6/site-packages/torchaudio-0.2-py3.6-linux-x86_64.egg/torchaudio/__init__.py", line 5, in <module>
    import _torch_sox
ImportError: /home/h/miniconda3/envs/audio/lib/python3.6/site-packages/torchaudio-0.2-py3.6-linux-x86_64.egg/_torch_sox.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN3c1019UndefinedTensorImpl10_singletonE

Tried to:

  • Re-install conda
  • clean conda cache conda clean --all

Has anyone been able to create a fresh environment recently?

1 Like

Congrats @zachcaceres and @JennyCai, nice work!

https://towardsdatascience.com/state-of-the-art-audio-data-augmentation-with-google-brains-specaugment-and-pytorch-d3d1a3ce291e

4 Likes

Great work! Will it be eventually a part of fast.ai.audio library?

Seems to think I’m one of the females from the 10 Speakers dataset :slight_smile:

4 Likes

@MadeUpMasters and I have created a fork of the incredible audio module that @ste @zachcaceres @ThomM started.

We’ve added:

  • A Tutorial Notebook on using the module
  • A Tutorial notebook on Audio
  • Spectrogram Caching (Speeds up CPU training speeds)
  • Resampling Pre Processor
  • Split by Silence (To remove large chunks of dead noise)
  • Split into segements (Windowing in the next iteration)
  • Refactored some of the classes
  • Spec Augement (no time warp yet) Transformation
  • Shift + Roll Transformation
  • Testing
  • Smaller version Voxceleb2 dataset
  • Inference method

We’d love to hear your thoughts on it.

Here is a collab notebook to try it out!

It was a difficult but great learning experience getting the diffierent processes happening throughout the Data Pipeline. Looking at the new v2 API, I think it will be much easier to do these things.

9 Likes

Great Job @baz!

With this change on cell [2] you can run the notebook without accessing you google drive :wink:

#Setup without google drive access
data_folder = 'data'
!git clone https://github.com/mogwai/fastai-audio.git
%cd fastai-audio
!git pull
!bash install.sh
2 Likes

Ok cool yeah that is better! Thanks for the tip :wink: Changed

1 Like

Added a lot to the Intro to Audio Guide. Would love feedback both from people learning audio to tell me where I could be more clear, to people who are experts to tell me where I am wrong or imprecise. Cheers.

9 Likes

So cool Robert! So much effort, well done! I think your explanations are very clear, and the build-up of context is great. You should promote this more widely!

I’m looking forward to when you explain & compare the rest of the spectrogram params, especially window size vs. hop length (something that it took me ages to figure out - (I’m not even certain that I have!) - I’m pretty sure it’s just “hop length” is number of frequencies considered in a STFT, and “window size” is the amount of overlap between those “hop length” wide buckets. I get confused because I would naturally swap those two terms around if I was naming it myself).

Certainly I can’t spot anything factually incorrect you’ve mentioned.

It could be worth pointing out why the comparison of visualisations between librosa.specshow and Image look different; i.e. the axis order is upside-down & the librosa function does some transformations to accomodate that. But also point out that it makes (or should make) no difference to training a space-invariant model as the data is the same.

Also, I’m guessing you just haven’t got to it yet, but I’m sure most beginners would really benefit from a “OK so just tell me what parameters I need to care about and what values I should use” summary section. You allude to this, and correctly mention that the defaults aim to be sensible anyway, but I think as a bare beginner getting started it could be made more explicit.

The only thing I’d concretely suggest is pretty meta and that’s sharing your notebook using nbviewer e.g.

Which renders it much more nicely, lets you play the audio samples, etc. :slight_smile:

Thanks so much for your contribution to the community here, I think it’s incredibly useful, it would’ve saved me weeks of fumbling a couple of months ago :slight_smile:

2 Likes

Nice work guys, a lot of effort in here :slight_smile: Keep in mind that the next version of fastai is expected to include some of these features (some taken from your implementations) directly in the main library, so we might not even need a separate module for long. Of course, who knows what will actually happen once Jeremy & Sylvain “push the button” :slight_smile:

I’m particularly interested in the resampling preprocessor - @marii did you get much further with your resampling investigations in the end?

Oh yeah - one other thing - if you rename the repository to fastai_audio or fastaiaudio or anything without a hyphen in the name, then it will be a valid python module name that can actually be imported… a decent quality of life improvement :slight_smile:

1 Like

Thanks a lot, I really appreciate the detailed feedback.

I noticed the spectrograms were coming out upside-down with Image but honestly wasn’t entirely sure why. I tried flipping them but neither np or torch support negative strides yet. Do you have any suggestions for an easy fix to give them the same perspective?

This is coming once I get through all the parameters, and will be at the top of the parameters section, and linked at the top of document.

Yeah it would be ideal to not have a separate module. We weren’t intending for a separate one but while using yours we both just started playing around and making breaking changes and ended up having similar ideas so we combined them. It was a really great learning experience for me as I wouldn’t been able to make it from scratch, but with your code and the fastai docs I was able to figure things out, so thank you.

2 Likes

I spent wayyy too much time on testing resampling to get the time down. There’s actually one more optimization that needs to be made, and a param to let the user choose resampling type, as well as testing to make sure there is no quality loss to training data resampled using polyphase (non-FFT based resampling). Cliffs of my findings are below. More will be in the audio guide.

There are 5 functions I considered for resampling, benchmarks below are for a 15 second clip @ 44.1kHz

  • librosa resample (incredibly slow, ~500ms)
  • resampy resample (slow, ~200ms)
  • sox resample using torchaudio (relatively fast but still too slow (~50ms)
  • scipy.signal.resample (fast but occasionally ruin everything slow ~30ms - 3 minutes)
  • scipy.signal.resample_poly (very fast, but occasionally slow ~3ms-150ms)

I spent a lot of time nailing down why scipy.signal.resample and scipy.signal.resample_poly were sometimes slow, and I got a satisfyingly concrete answer that allows me to predict pretty precisely how long it will take.

scipy.signal.resample has two bottlenecks, an FFT and an IFFT. If the clip is n0 samples long, the FFT will be done on a sample of length n0 and the inverse FFT on a clip of length n1 samples where n1 = n0*new_sample_rate/old_sample_rate. If the greatest prime factor of n0 is large, it will be slow. If the greatest prime factor of n1 is large, it will be about 10x as slow (this is how you get 4 minute resampling). This is because the underlying FFT algorithm, Cooley-Tukey, is optimized for highly composite numbers, but handles other numbers poorly.

There are a number of options for fixing this. One is padding to numbers with low prime factorizations, but this is tricky because you have to find a number that satisfies it for both n0 and n1 without a massive size increase. Another is replacing the FFT calls in scipy.signal with an alternate FFT algorithm that don’t have as much trouble with prime input like Rader’s or Bluestein’s algorithms. But if polyphase resampling doesnt affect our ML training, then you probably don’t have to go to the trouble, because it’s almost always faster.

Polyphase resampling doesn’t use and FFT, is almost always fast, and is to me in very unscientific testing so far, indistinguishable from FFT based resampling. It is slow sometimes, but only in rare cases where the greatest common denominator of the sample rates is very low (< 100). Given that most sample rates you would want come from a standard list with a GCD > 100 between them, this is unlikely to be a problem. But, if someone wants to resample from 44100, to 15997 (GCD = 1), it is going to take ~10ms per second of audio (150ms per 15s clip).

For now we just use polyphase resampling, but I want to give the user an FFT based option as I can’t say for sure it won’t give better results, but I’ve put it off because scipy resample is fastest except when it explodes, so I need to substitute another lib, like resampy, for large greatest prime factors.

2 Likes

@ThomM I am pretty stuck on the re-sampling actually, and as of right now think it would require jit/swift, or something else to accelerate it on the gpu. The algorithm I was looking into required upsampling(adding a lot of zeros between each datapoint), applying a conv1D, and then downsampling(pick every nth number). I was not able to find a way to implement this using pytorch, including looking at their sparse matrix operations. Other options seemed to require custom build scripts and such. Not really sure of a way to add this without adding a make file to fastai, which seems to be what Jeremy wants to avoid(for a good reason).

I have implemented it using standard pytorch operations, but the extra calculations that include extra zeroes due to upsampling bring the performance down to being slower than just doing the operations with current libraries.

@MadeUpMasters I was basically looking to implement scipy.signal.resample_poly on the GPU.

My main resource for understanding resampling is here:

Takeaway: scipy.resample and sciply.resample_poly have bad aliasing. resampy resample better for less aliasing. Custom filter allows you to get less aliasing using scipy.resample_poly

@MadeUpMasters Also… librosa sometime uses resampy(checked the source). So not sure which one you benchmarked?

2 Likes

Wow looks like we went down the same rabbit hole, but you went a lot deeper :slight_smile: . The linked notebook looks great. Is it just that post that helped you gain your level of understanding? Or the whole signalsprocessed blog?

Yeah I think that’s actually how I discovered resampy. Also I didn’t do great benchmarking on librosa or resampy because I already had a wide range on scipy and when I tried a few things and saw 200ms+, I realized neither of those were going to be fast enough. Also I was really shocked by the “benchmarking” documentation of resampy. They only give one example, a sample that feeds a prime into scipy’s ifft so it takes forever, and use this outlier to say they are much faster (they probably are on average, but at 200ms are slower in the majority of cases), I found it really misleading.

Nice work Robert, not an expert by any mean but have some familiarity with audio and seems mostly correct.
However I think you are a off on FFT length and hop length.

Hop_length is the size (in number of samples) of those chunks. If you set hop_length to 100, the STFT will divide your 52,480 sample long signal into 525 chunks, compute the FFT (fast fourier transform, just an algorithm for computing the FT of a discrete signal) of each one of those chunks.

Hop length isn’t the size of the chunks, it is the spacing of them. Each chunk is n_fft samples long, but spaced hop_length samples apart. So each chunk will will have an (n_fft - hop_length) sample overlap with the next chunk.

output of each FFT will be a 1D tensor with n_fft # of values

It is actually a tensor of length (n_fft//2) + 1, so with an n_fft of 1024 there will be 513 values

Window length is different again (and I’m a bit less clear here but think this is correct, -ish at least). First the signal is split into n_fft sized chunks spaced hop_length samples apart. Then the “window function” (function in the mathematical sense) is applied to each of those chunks. There seem to be tricks you can use with window lengths larger/smaller than your n_fft to accomplish various things which I don’t really understand. By default win_length = n_fft.

I put together a notebook illustrating this here. At first I just tested some things to verify for myself I was correct, again no expert feel free to correct me if you think I’m wrong as I may well be. So I didn’t edit the existing one, but I then added text rather than commenting here. Feel free to integrate into the existing one or I’ll look at that at some point. I didn’t add any code to produce meaningful signals (just zeros) which you did nicely so couldn’t cover some of that side.

I think you are also a bit off when you say:

When we increase resolution in the frequency dimension (y-axis), we lose resolution in the time dimension, so there is an inherent tradeoff between the choices you make for n_fft, n_mels, and your hop_length.

This is true of the FFT where the choice of n_fft trades off temporal resolution for frequency resolution as it determines both. You have n_fft//2 frequency bins, but your spatial resolution is limited to sample_rate/n_fft, e.g. 16000/1024=15.625 means a temporal resolution of 15.6 milliseconds. But this is why you use the STFT. This separates temporal resolution, determined by hop_length, from frequency resolution, set by n_fft.
There’s still a bit of a tradeoff as while you get an FFT every hop_length samples it is still giving you frequencies over the next n_fft samples not just those hop_length samples, but it isn’t the direct tradeoff of the FFT. And using a window function will balance this out a bit, reducing the sort of temporal smearing a larger n_fft will give without a window function. So you are correct that there is still a tradeoff but it’s not the simple frequency resolution vs. time-resolution of a standard FFT. Thus you see that when you raised n_fft from 1024 to 8192 you still got the same 206 time values based off your hop_length.

And as a very minor quibble, the humans hear 20Hz-20kHz is a commonly quoted but rather inaccurate number. That tends to be the sort of range you’d try and design audio electronics to work across but we don’t really hear the edges of that range. The top of hearing is more like 15-17kHz for the very young (and that’s the real limits of perceptibility), 13-15kHz for middle age, then dropping as you get older. And speech tops out below 10kHz and even just up to 4kHz remains intelligible (hence the 8kHz sample rate you see on lower quality stuff). At the bottom end anything below about 160Hz is not really heard but felt and a cutoff around here is common even at music events with huge speakers (in part due to these lower frequencies requiring a lot of power to reproduce and still often just being a muddy rumble). I mainly mention this because these outsides of the range are what are cutoff with various parameter choices but you shouldn’t generally worry much about trying to preserve that full 20Hz-20kHz range. A 22050 sampling rate, and so 11kHz cutoff, likely wouldn’t lose much useful information even for music.

5 Likes

The post gave me most of my knowledge, but the rest of the blog is useful, especially when understanding assumptions in the notebook. Other places I read was the resampy docs/paper(skim)/source and the the resample_poly docs/source code

I would ignore anything that is not directly related to the academic value of the approach in a lot of academic work. Resampy is an academic work that was purposefully trying to decrease aliasing in audio, they had less interest in performance from my understanding. The benchmarking in this case is probably something the spent less time on, and therefore I would take it with a handful of salt. IT really only had to be “usable”, they did spend time optimizing it as much as they could though.

Renamed the repo

Yes I’m super interested to see what they do actually. I know that applying transforms on the GPU is going to make a big difference. I had some problems creating tensors in CUDA on training loops so didn’t get far.

I’m sure that they will probably not use the caching if they do implement things on the GPU but for most fo the pre-transforms you only want to do them once at the beginning anyway.

The resampling, segment and remove silence (pre-transforms) can take quite a while so I’ve added a method to preview them, demonstrated in the Getting Started Notebook.