Gone a bit deeper now. Just updated the notebook with some new stuff, some clean up and a bit of added explanation (not much still though).
I tried re-implementing polyphase with dot products rather than convolution, that way you can skip calculating the outputs that will just be dropped while decimating anyway. But it performs horribly, like 50x slower than the convolution approach, even while doing around 1/160th of the calculations (in that case of 48000->44100). As I thought might be the case it look like the overhead of spinning up a CUDA kernel for every output sample swamps any advantage (by comparison the convolution approach uses ~300 kernels, about half of which are convolutions and then some small data shuffling).
Can’t really see a way to improve that without a custom kernel. Have been playing around with that a bit as something to learn CUDA on but don’t really expect to get better performance than the convolution. It’s non-trivial to see how to efficiently implement it. And of course the convolution in PyTorch (coming out of a nvidia library) is almost certainly highly optimised. Though not necessarily for this case, this is a much, much larger kernel than typical.
The basic issue is the massive size of the FIR filter which I gather is due to the fact that you are filtering the upsampled signal, so in the case of 48000->44100 you are upsampling by 160x and so filtering at 7.056MHz. This means you need a really long filter for decent performance. One way to improve performance would be to reduce the filter size, you can get what I think (based on minimal knowledge) is fairly equivalent performance with much smaller filters (in fact the 64K filter I was using from that blog post is just a comparison with resampy which uses a custom 64K filter, the default is 16001). This does help performance quite a lot, but still slower than FFT.
There’s a bit in the notebook with some analysis of FIR performance, still learning about this though so not really sure how best to assess it.
From the docs:" the cuFFT Library employs the Cooley-Tukey algorithm … This algorithm expresses the DFT matrix as a product of sparse building block matrices. The cuFFT Library implements the following building blocks: radix-2, radix-3, radix-5, and radix-7. Hence the performance of any transform size that can be factored as 2**a * 3**b * 5**c * 7**d
(where a
, b
, c
, and d
are non-negative integers) is optimized in the cuFFT library. There are also radix-m building blocks for other primes, m, whose value is < 128. When the length cannot be decomposed as multiples of powers of primes from 2 to 127, Bluestein’s algorithm is used. Since the Bluestein implementation requires more computations per output point than the Cooley-Tukey implementation, the accuracy of the Cooley-Tukey algorithm is better."
I’ll look to do some tests to see how much that matters.
I did also note while looking around that according to this post simple FFT/IFFT resampling is “non-ideal in that it can introduce ringing and other nasty artifacts in the time domain” and suggests applying frequency domain filtering (others in the thread say to use polyphase instead). This frequency domain filtering takes advantage of the convolution theorem which show that convolution in the time domain, as in applying a FIR, is equivalent to multiplication in the frequency domain. So you FFT your standard FIR filter and then multiply that with the FFT’d signal. So that should be easy and performant enough. You just need an appropriate FIR filter (scipy has stuff for generating them based on requirements). Support for frequency domain filtering would also be a nice operation to support for cases where that is desirable (thinking more for EEG/ECG/etc stuff than for music/speech where I’m not sure of any particular use case).
A further option might be looking to use an overlap-and-add (or similar overlap-and-save) method for FFT based conversion (the author of the above suggests this for doing frequency domain filtering). Using this method you divide the input into overlapping chunks, process and then recombine. This might improve FFT performance a bit more by making each file a batch of smaller FFTs which the docs say will improve performance. You could also then optimise those fixed, for a given rate conversion, chunk sizes which is easier than trying to optimise for randomly sized inputs.
Yeah, or I’d probably submit to torchaudio-contrib (staging/playground for torchaudio). If you haven’t seen it it’s worth checking out as it has implementations that better mirror the librose api which I think is preferable. Not a lot there at the moment but some nice little methods. Also, any other re-implementations of librosa stuff would likely be accepted there.