Deep Learning with Audio Thread

digitalspecialists · March 4, 2020, 12:16am

And DCASE 2020 now. http://dcase.community/challenge2020/index

much_learner · March 5, 2020, 3:37pm

I am trying to inference in Kaggle kernel, and I am stuck, there seems to be some incompatibility which I couldn’t find. In Colab it works, so I installed all the same dependencies just to be sure, but no avail.

Code

learn = load_learner("/kaggle/input/deepfake")
test = AudioList.from_folder(audio_path, config=config)
learn.predict(test[-1])

/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in __getitem__(self, idxs)
    652             else:                 x,y = self.item   ,0
    653             if self.tfms or self.tfmargs:
--> 654                 x = x.apply_tfms(self.tfms, **self.tfmargs)
    655             if hasattr(self, 'tfms_y') and self.tfm_y and self.item is None:
    656                 y = y.apply_tfms(self.tfms_y, **{**self.tfmargs_y, 'do_resolve':False})

TypeError: apply_tfms() got an unexpected keyword argument 'size'

MadeUpMasters · March 5, 2020, 10:41pm

@scart97, any idea what would be causing this discrepancy?

scart97 · March 6, 2020, 5:37am

I have no idea, we need more info to understand what is happening there. @much_learner could you please run the following code on both colab and kaggle to check if the packages installed are the same?
import fastai.utils; fastai.utils.show_install(1)
Also, a full copy of the stack trace would be useful.

much_learner · March 6, 2020, 1:32pm

I manually installed exact packages on Kaggle. Here they are. Also the full trace.

I thought it could be something with pytorch since after I hacked size it spurted tensor doesn't have pixel property or something.

torchaudio, torchvision are the same too (0.4 and 0.5)

much_learner · March 10, 2020, 5:18pm

@scart97 Any idea? Does the inference part work at all? As @shruti_01 pointed audio_predict and audio_predict_all are broken in master too.

I am happy to help if you could hint where to dig.

ppotrykus · March 26, 2020, 12:29pm

Hey, great work with the library!

I started using it for phonemic classification of separate fourier time bins. When studying the lib I noticed is that you try to shadow the LabelList class in order to be able to pre_process audio files together with their labels. The preprocessing here involves downmixing, silence removal, etc.

I was wondering if that wouldn’t be possible to achieve using PreProcessors? I figured this would be more in compliance with the fastai workflow. Alternatively the config file could be processed in the .transform method. Are there some reasons why it was not possible to be done this way?

Best regards

baz · March 27, 2020, 11:15am

Yes I totally agree, it would be more in line and we did try to implement it with the PreProcessor initally initially but there was some limitation that meant that we couldn’t save state I believe for some reason. Bare in mind the the v1 is likely to be deprecated in the next few months in favour of the new fastaiv2 based API.

NicWick · March 31, 2020, 11:06pm

Hello all, thank you for the excellent work on this module.

I have had success doing basic classification. I’m hoping for some technical help tackling a new type of problem.

I would like to feed slices of spectrograms through an LSTM and predict a corresponding float value. So if I had a 480-width spectrogram and 240 float values, for the first 2-wide slice of the spectrogram I would like to predict the first Float value. Then use that predicted float and the next slice to predict the next float, and so on.

Do I need to divide my labels into chunks which correspond to the chunks in my audio sequences?

gautam_e · April 7, 2020, 11:44pm

I’m really interested in DL for audio too.
I’m trying to figure out why librosa and torch audio give different spectrograms even though one uses the same parameters for n_fft, hop_length, rate, power?
I can’t seem to find the reason for this difference. Is it some sort of normalisation that’s different?

@MadeUpMasters, any hints?
Oh, and is the Telegram group alive?

MadeUpMasters · April 9, 2020, 12:48pm

I have been overwhelmed with a few projects and haven’t been working on audio the past 8 weeks so sorry I haven’t been as active in the thread. Both the library and audio telegrams are still up, PM me and I’ll get you added. @kdorichev is the one most actively working on the library at the moment. He is doing a great job of organizing people to contribute.

About the spectrograms. Are you using the same exact signal (read in by one library, either librosa or torchaudio, and then fed to the 2 distinct spectrogram functions)? They both have their own quirks with regards to how they read in audio and also how they normalize. I believe they mostly use the same underlying window functions and algorithms because (if it hasnt changed in past 2 months) torchaudio mostly delegates to librosa.

gautam_e · April 10, 2020, 11:19pm

Yes, have used torchaudio to read in the same signal and then fed that to the melspectrogram functions followed by the amplitude-to-db functions of each library. The output looks very different!
torchaudio:
Screenshot 2020-04-11 at 00.35.23
librosa:
Screenshot 2020-04-11 at 00.37.22

If I just look at the outputs after Melspectrograms alone, they then look very similar (identical, I think). So it has to be the amplitude-to-db conversion functions that are working differently.
Obviously, the torchaudio output performs way better for the deep learning model.

MadeUpMasters · April 11, 2020, 12:25am

Can you include code used to generate both?

gautam_e · April 11, 2020, 7:26am

Sure @MadeUpMasters, see below. Note: The code defining the function get_x is the same in both cases up to the dashed line.

au2spec = torchaudio.transforms.MelSpectrogram(sample_rate=target_rate,n_fft=n_fft, hop_length=hop_length, n_mels=64)
ampli2db = torchaudio.transforms.AmplitudeToDB()

def get_x(path, target_rate=target_rate, num_samples=num_samples):
    x, rate = torchaudio.load_wav(path)
    if rate != target_rate: 
        x = torchaudio.transforms.Resample(orig_freq=rate, new_freq=target_rate, resampling_method='sinc_interpolation')(x)
    x = x[0] / 32767
    x = x.numpy()
    x = librosa.util.fix_length(x, num_samples)
    #------------------------------------------------------#
    torch_x = torch.tensor(x)
    spec = au2spec(torch_x)
    spec = ampli2db(spec)
    spec = spec.data.squeeze(0).numpy()
    spec = spec - spec.min()
    spec = spec/spec.max()*255
    return spec

Image.fromarray(get_x(wav_files[0]).astype(np.uint8))

yields
Screenshot 2020-04-11 at 00.35.23

and

def get_x(path, target_rate=target_rate, num_samples=num_samples):
    x, rate = torchaudio.load_wav(path)
    if rate != target_rate: 
      x = torchaudio.transforms.Resample(orig_freq=rate, new_freq=target_rate, resampling_method='sinc_interpolation')(x)
    x = x[0] / 32767
    x = x.numpy()
    x = librosa.util.fix_length(x, num_samples)
    #------------------------------------------------------#
    spec = librosa.feature.melspectrogram(x, sr=target_rate, n_fft=n_fft, hop_length=140, n_mels=64)
    spec = librosa.amplitude_to_db(spec)
    spec = spec - spec.min()
    spec = spec/spec.max()*255 
    return spec

Image.fromarray(get_x(wav_files[0]).astype(np.uint8))

yields
Screenshot 2020-04-11 at 00.37.22

Any insights, are welcome.
I don’t see why the results should be different, since torchaudio claims to be consistent with librosa's algorithms everywhere in their source code.

riven314 · April 11, 2020, 9:18am

I find this post really amazing! I recently want to implement a audio detector for badminton shot (i.e. detecting smash, serve etc. in a badminton match etc.), but I found my problem has a slight differences with the general tutorials available in a sense that:

unlike usual audio samples which lasts for a few seconds / minutes, badminton shots usually lasts for less than 1 second. Could deep learning models work well on such samples with short duration?
I would like to detect those badminton shots in a match of around 1 hour long, so it’s more like an audio detection problem instead of a classification problem. To make a simple first step, is it reasonable to do a work around by chopping a 1-hour long match audio into a bunch of short audio short clips and do classification on each short clips?

MadeUpMasters · April 12, 2020, 12:54pm

So I figured it out. There are two different magnitude scales for spectrograms, amplitude(sometimes called magnitude) and power, and these are controlled by the power parameter. The power spectrogram is just the elementwise square of the amplitude spectrogram. If you take the square of every value in your amplitude spectrogram, you then have a “power” spectrogram.

To properly convert to dB depends on which scale you’re using. Librosa uses two distinct functions librosa.power_to_db and librosa.amplitude_to_db (a convenience function that squares the spectrogram and then delegates to librosa.power_to_db). Torchaudio only has one, the poorly named amplitude_to_db which takes a string argument stype that tells if the incoming spectrogram is a power or magnitude spectrogram. It’s default value is “power” so unfortunately torchaudio.amplitude_to_db != librosa.amplitude_to_db.

torchaudio.amplitude_to_db by default is actually the same as librosa.power_to_db. Change librosa.amplitude_to_db(spec) to librosa.power_to_db(spec) and I think you’ll have roughly the same spectrograms.

MadeUpMasters · April 12, 2020, 1:01pm

Absolutely, I’m not sure just how short a badminton sound is but I’m confident there will be a way to do it. Try the normal spectrogram way, but if that doesn’t work you may need to use something called wavelets that are more useful than the discrete fourier transform for detecting abrupt changes to sound (I’ve never actually used wavelet transforms myself on a project, spectrogram will likely work so don’t go down that rabbit hole until you have to).

Yes that is a great idea to work and see if the problem of classifying badminton is feasible in the first place. If it is, then you can worry about identifying which sounds are badminton sounds in the first place. If your long clips have no noise other than the shots (unlikely) you can use regular silence removal (build into fastai v2 audio). If they do have other noise (cheering, human grunts) you will need to have another model that determines if something is a badminton sound, but for now just manually make clips and start there.

gautam_e · April 12, 2020, 1:22pm

Thanks @MadeUpMasters for checking this out and letting us know. However, I did try this out earlier but didn’t mention it because I doubt that this is the reason.

Yes, the outputs are closer to one another than before but still too different from each other for them to be no other difference, in my opinion (I can plot them next time I start up my machine and run these experiments).

Moreover, the results of training the same model on these different datasets (torchaudio.amplitude_to_db vs librosa.amplitude_to_db or librosa.power_to_db) are also vastly different (an order of magnitude difference). That is, there’s no real difference whether one uses librosa.amplitude_to_db(spec) or librosa.power_to_db(spec). I think this should be expected, too since just squaring the values of pixels should be a task that’s easy enough for the model to learn, right?
FYI, I’m using resnet and xresnet which are sufficiently capable, I would think.

BTW, it occurred to me that my previous conclusion that the “Melspectrogram functions are not responsible/ yield the same result”, might be false. Perhaps I just don’t see the difference in the image because the Melspectrograms have all the interesting stuff happening in a very small region of the image (a small one, at that) and the amplitude_to_db functions only show this difference but may not be the reason for the difference.

MadeUpMasters · April 12, 2020, 1:38pm

That’s really interesting, I’ve never seen such a large difference in training results between the two. Do you have the full code on github? Are you using fastai, fastai2, pytorch or something else? Are you using either of our audio libraries? What’s the dataset?

Also is there a reason you chose to normalize manually? There are already several points where the data is getting normalized (torchaudio normalizes on audio load by default, and amplitude_to_db has a degree of normalization via the ref and top_db parameters although I still don’t fully understand the impact this has on training audio models. Finally, why did you convert to uint8 at the end, just to fit more on GPU?

baz · April 14, 2020, 11:31pm

For those still working with v1. I’ve created a notebook in the v1 repo that demonstrates how to do spec_augement on the GPU which increases training speed. There is potential to do a few of the random transforms here.

github.com

mogwai/fastai_audio/blob/master/tutorials/04_GPU_Spec_Augment.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-14T23:15:01.264155Z",
     "start_time": "2020-04-14T23:15:00.706133Z"
    }
   },
   "outputs": [],
   "source": [
    "%reload_ext autoreload\n",
    "%autoreload 2\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",

This file has been truncated. show original