[Invitation to open collaboration] Practice what you learn in the course and help animal researchers! 🐵

You can find it here @gautam_e https://github.com/earthspecies/open_collaboration_on_audio_classification/blob/master/vanilla_datablock_torchaudio_xresnet.ipynb

1 Like

Thanks @dhoa!

I can confirm that librosa and torchaudio differ in their conversion of amplitude to db. I have loaded the same wave file with torchaudio and fed it to the Melspectrogram and amplitude-to-db functions of both libraries and they give different results (see below).
The output of the Melspectrograms alone however, look similar (probably identical).

au2spec = torchaudio.transforms.MelSpectrogram(sample_rate=target_rate,n_fft=n_fft, hop_length=hop_length, n_mels=64)
ampli2db = torchaudio.transforms.AmplitudeToDB()

def get_x(path, target_rate=target_rate, num_samples=num_samples):
    x, rate = torchaudio.load_wav(path)
    if rate != target_rate: 
        x = torchaudio.transforms.Resample(orig_freq=rate, new_freq=target_rate, resampling_method='sinc_interpolation')(x)
    x = x[0] / 32767
    x = x.numpy()
    x = librosa.util.fix_length(x, num_samples)
    torch_x = torch.tensor(x)
    spec = au2spec(torch_x)
    spec = ampli2db(spec)
    spec = spec.data.squeeze(0).numpy()
    spec = spec - spec.min()
    spec = spec/spec.max()*255
    return spec


Screenshot 2020-04-11 at 00.35.23


def get_x(path, target_rate=target_rate, num_samples=num_samples):
    x, rate = torchaudio.load_wav(path)
    if rate != target_rate: 
      x = torchaudio.transforms.Resample(orig_freq=rate, new_freq=target_rate, resampling_method='sinc_interpolation')(x)
    x = x[0] / 32767
    x = x.numpy()
    x = librosa.util.fix_length(x, num_samples)
    spec = librosa.feature.melspectrogram(x, sr=target_rate, n_fft=n_fft, hop_length=140, n_mels=64)
    spec = librosa.amplitude_to_db(spec)
    spec = spec - spec.min()
    spec = spec/spec.max()*255 
    return spec


Screenshot 2020-04-11 at 00.37.22

It can clearly be seen why torchaudio performs better. The image yielded from torchaudio looks a lot richer with features as compared to librosa's.

Or, have I made a mistake (it’s quite late here and my 4 month old son leaves me quite sleep deprived :wink: )?


Why does one take fixed Conv1d layers and not learn their parameters instead?

1 Like

Hi, and thanks for being curious.

I think the fundamental reason is that it is meant to be an entirely different approach than gradient descent. You make a huge number of cheap features (fixed, random) and let a simple linear classifier learn which ones are predictive. The usual ML methods can learn a kernel and a bias for a particular conv1d, but they cannot learn across kernel lengths and dilations. Those parameters are non-differentiable. The reduction to ppv and max is likewise non-differentiable. Therefore an optimizer can’t see the effect of varying bias and kernel weights.

You could think of ROCKET as a way to search a very large parameter space that gradient descent cannot traverse. In a manner, it searches across model architectures. Instead of pre-defining a model with fixed conv dilations and ppv thresholds, the method finds which ones work the best.

Cheers, Malcolm


You did it right. I have played with some parameters in librosa, for ex power, but can not make the same result as torch (I’m also lack of sleep with my 4 month old daughter haha)


Great explanation @Pomo. I think with large enough conv1d , ROCKET can create a sufficient and dense feature space that we can think of it like a dictionary to look up. Imagine it a kind of spectrogram with small dilation conv1d cover high frequency and large dilation for small frequency.

P/s: I’ve just found that the repo you refered to ROCKET for fastai is compatible for version 2 now. It is updated yesterday. But I cannot find the information about ROCKET. Can you please take a look at it to see if the author just removed something ? Thanks

1 Like

Hi, I created the timeseriesAI repo. I’ve started the process to port the library to v2. I will add ROCKET very soon (later today if I find the time). In the meantime, you main find v1 version here: timeseriesAI1.


Has anyone successfully tried to do inference on the Coo-dataset (or any audio set)?
I tried, using the “vanilla” notebook, but on:


I get the error:

cannot identify image file ‘AL53.wav’

as the system tries to open is al an image file:

 73     "Open and load a `PIL.Image` and convert to `mode`"

—> 74 im = Image.open(fn, **kwargs)
75 im.load()

Anyone any idea on how to move forward from here?

I’ve just uploaded the ROCKET code to the timeseriesAI repo, as well as a nb with details on how to use both the univariate version (w/o GPU) or the univariate or multivariate version with GPU support.
Please, let me know if you have any issues with it.


Great @oguiza ! Thanks so much for your update

Hi Everyone,

I have written an article on how to do data-augmentation on audio files in python with help of librosa library.

@radek, @Pomo, others too, have you seen examples of directly doing the deep learning stuff on the audio array data? That is, much like the computer vision case where one learns the filters, in this case perhaps 1d convs?

I’ve been doing some reading on this stuff (like wavelets etc) and can’t help feeling that audio data is destined to be processed by 1d Convs!

I found this one, but not much other stuff.

1 Like

Interesting article that you found! It may be worth implementing.

One example is ROCKET itself. Its conv1d’s are not directly learned, but in effect they are, because they are selected out of a large random set of variants. I have also experimented with learnable conv1d’s feeding LSTM and layers of Linear, not for audio but for timeseries forecasting in general.

I don’t think the existing work on this topic is scarce these days. For example, Radek linked to this article

This article links to other articles, all with further references. There surely exists a body of work to use as a foundation.

A couple of my favorite articles are:
ROCKET: Exceptionally fast and accurate time series classification using random convolutional kernels

One idea to take further: you could take the most predictive conv1d’s from ROCKET, initialize a different model’s conv1d first layer with them, and let the conv1d’s learn further. It’s type of transfer learning!

HTH to inspire some ideas for directions.

1 Like

I think you are on a great track here :slight_smile: Working with audio data directly seems to have a lot of advantages and being able to learn features does sound very appealing :slight_smile:

Still, in many situations going to spectrograms first and then training a CNN on it seems to be the solution of choice. But is it because people are not exploring working with raw audio enough? Or maybe we still haven’t developed good enough ways of working with audio in certain contexts? Or maybe creating spectrograms offers a short cut, is a representation that makes many tasks easier on our models? It seems that bulk of research work over many years went into developing CNN architectures - maybe that is what gives these models an edge?

I do not know what the answer is but it certainly makes sense to explore working with audio directly, probably for any project that one works on :slight_smile: At least that is my current understanding.


I will chime in to this discussion with a mix of facts and opinions.

Reasons spectrograms+resnet works well…

  • The mammalian cochlea physically extracts the frequency and intensity spectrum from sounds. IOW, our physical ear sends a spectrogram vs. time to the brain, not an audio waveform. It makes sense to preprocess audio into the same form. It’s biomimetic and life has usually already found an efficient and effective solution. Furthermore, all the characteristics of a sound that we mammals find important can be taken from the spectrogram. There may be more information, but we and monkeys can’t hear it.

  • ML image processing is already a very well-developed area. Great tools, techniques, and expertise are available. GPU calls have been tuned for efficiency. It’s familiar too. (Radek said this above.)

  • It works! I recognize all the great work people have done so far and that this approach gives more accuracy so far than any audio direct method.

Reasons spectrograms+resnet is not a great approach…

Especially the argument that sounds are inherently sequential. ML vision excels at extracting local textures and features. It does not know that the left to right evolution of the image even matters.

  • resnet probably uses many more parameters than are actually needed to classify sounds, therefore it’s inefficient. It’s like using a ten ton backhoe to hammer a nail. It works, but the tool does not match the job.

  • My favorite… all the various preprocessing methods to turn a timeseries into an image - spectrogram, Gram matrices, etc. - are just math. You could figure out what the math plus resnet is actually doing to the audio, and do it directly. That would give both insight into how better to process audio, and gigantically more efficiency.

  • Converting a sound to an image is a bottleneck that the sound’s information has to go through. What if this bottleneck inherently loses information that existed in the orignal sample? It would never be able ultimately to perform as well as working directly with the waveform. Maybe the cochlea sends more than just a spectrogram.

  • What about continuous sound classification and recognition, something we may want to handle eventually? How can a series of images do it?

So yes, I am a 1-D, sequential partisan. But before even that, whatever works best.

P.S. While writing this I noticed that ROCKETSound as written also does not understand that sound is sequential. max and ppv throw away all the sequential knowledge that the conv1d’s extract. I wonder what would happen if we used some of it?



two days ago I had the idea to try a ResNet-style 1d convolution on sound. So I copied a pytorch resent implementation I found, adapted it to 1d instead of 2d and played around. I also tried to improve my “fastai2 skills” and created an audio data block with transforms etc … . I achieved ~99 % accuracy with this approach (best result after about 80 epochs / 12 seconds each was 99.3 %).

If you have any improvements (code or the resent approach) please let me know. Most of the code was just try + error :).


  • If you adjust the length (0.75 seconds right now) you have to adapt the input shape of the linear layers
  • I tried to keep the kernel_size small (like the 3x3 kernel in resnet) so I tried between 3 and 21 …
  • i also experimented with different strides
  • best results for now were kernel size 15 , stride 4 with 6 resent blocks

Next thing I’d like to try is transfer learning and maybe a combination of CNN / LSTM (don’t know if that’s possible / useful for this task).


Resizing (-> downsampling) works too with the 1d resnet approach and Audio files. I downsampled the files to sr//2 which lead to faster epochs and 99.3 % after 60 epochs.

Does anyone know an audio dataset we could use for pretraining / transfer learning?

1 Like

This is really great @florianl! Thanks for sharing since this approach was precisely what I had in mind in my post above! I am (much too slowly!) getting familiar with the fastai2 Datasets and Transforms so I have a couple of questions to your code. Let me start with the first one in this post.

You have used inherited from a TensorBase class for the TensorAudio object. Why is that? That is, why not just use a Pytorch tensor or why not inherit a Transform?

I’d appreciate any info here, also thanks (to you, or anyone else who can answer!) in advance for your time!


First I started without the TensorAudio class - but that lead to different problems. E.g. transforms are performed to all Inputs / Blocks. To specify which transform is performed on which input, you have to specify the class of the object class in the encodes method of the transform (encodes(self, o: TensorAudio)):

class AudioAddNoise(RandTransform):
    "Randomly add noise with probability `p`"
    def __init__(self, p=0.5, device='cpu'): super().__init__(p=p)

    def encodes(self, o: TensorAudio): 
        noise_amp = 0.001*torch.rand(1).to(device)*torch.max(o).to(device)
        o = o + noise_amp * torch.empty(o.shape).normal_().to(device)
        return o

Otherwise (encodes(self,o)) the transform will also be applied to your category tensors.

Thats why I created the TensorAudio class. I inherit from TensorBase because the default classes (TensorCategory, TensorImage / TensorImageBase) inherit from TensorBase too ;).


Hi Florian. Great work! And inspiring!

I am very much looking forward to exploring this direction you discovered. However, I still get the import error that I posted about earlier and never solved. Would you please post conda list for your sox, torchaudio, fastai2, and pytorch? Maybe it will shed some light on the problem.

For anyone who has any solutions, my error is:

ImportError Traceback (most recent call last)
3 #import librosa
4 #import librosa.display
----> 5 import torchaudio

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/torchaudio/init.py in
4 import torch
----> 5 import _torch_sox
7 from torchaudio import transforms, datasets, sox_effects, legacy

ImportError: /home/malcolm/anaconda3/envs/fastai2/lib/python3.7/site-packages/_torch_sox.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail36_typeMetaDataInstance_preallocated_7E