Deep Learning with Audio Thread

TomB · May 29, 2019, 8:58am

Nice work Robert, not an expert by any mean but have some familiarity with audio and seems mostly correct.
However I think you are a off on FFT length and hop length.

Hop_length is the size (in number of samples) of those chunks. If you set hop_length to 100, the STFT will divide your 52,480 sample long signal into 525 chunks, compute the FFT (fast fourier transform, just an algorithm for computing the FT of a discrete signal) of each one of those chunks.

Hop length isn’t the size of the chunks, it is the spacing of them. Each chunk is n_fft samples long, but spaced hop_length samples apart. So each chunk will will have an (n_fft - hop_length) sample overlap with the next chunk.

output of each FFT will be a 1D tensor with n_fft # of values

It is actually a tensor of length (n_fft//2) + 1, so with an n_fft of 1024 there will be 513 values

Window length is different again (and I’m a bit less clear here but think this is correct, -ish at least). First the signal is split into n_fft sized chunks spaced hop_length samples apart. Then the “window function” (function in the mathematical sense) is applied to each of those chunks. There seem to be tricks you can use with window lengths larger/smaller than your n_fft to accomplish various things which I don’t really understand. By default win_length = n_fft.

I put together a notebook illustrating this here. At first I just tested some things to verify for myself I was correct, again no expert feel free to correct me if you think I’m wrong as I may well be. So I didn’t edit the existing one, but I then added text rather than commenting here. Feel free to integrate into the existing one or I’ll look at that at some point. I didn’t add any code to produce meaningful signals (just zeros) which you did nicely so couldn’t cover some of that side.

I think you are also a bit off when you say:

When we increase resolution in the frequency dimension (y-axis), we lose resolution in the time dimension, so there is an inherent tradeoff between the choices you make for n_fft, n_mels, and your hop_length.

This is true of the FFT where the choice of n_fft trades off temporal resolution for frequency resolution as it determines both. You have n_fft//2 frequency bins, but your spatial resolution is limited to sample_rate/n_fft, e.g. 16000/1024=15.625 means a temporal resolution of 15.6 milliseconds. But this is why you use the STFT. This separates temporal resolution, determined by hop_length, from frequency resolution, set by n_fft.
There’s still a bit of a tradeoff as while you get an FFT every hop_length samples it is still giving you frequencies over the next n_fft samples not just those hop_length samples, but it isn’t the direct tradeoff of the FFT. And using a window function will balance this out a bit, reducing the sort of temporal smearing a larger n_fft will give without a window function. So you are correct that there is still a tradeoff but it’s not the simple frequency resolution vs. time-resolution of a standard FFT. Thus you see that when you raised n_fft from 1024 to 8192 you still got the same 206 time values based off your hop_length.

And as a very minor quibble, the humans hear 20Hz-20kHz is a commonly quoted but rather inaccurate number. That tends to be the sort of range you’d try and design audio electronics to work across but we don’t really hear the edges of that range. The top of hearing is more like 15-17kHz for the very young (and that’s the real limits of perceptibility), 13-15kHz for middle age, then dropping as you get older. And speech tops out below 10kHz and even just up to 4kHz remains intelligible (hence the 8kHz sample rate you see on lower quality stuff). At the bottom end anything below about 160Hz is not really heard but felt and a cutoff around here is common even at music events with huge speakers (in part due to these lower frequencies requiring a lot of power to reproduce and still often just being a muddy rumble). I mainly mention this because these outsides of the range are what are cutoff with various parameter choices but you shouldn’t generally worry much about trying to preserve that full 20Hz-20kHz range. A 22050 sampling rate, and so 11kHz cutoff, likely wouldn’t lose much useful information even for music.