Deep Learning with Audio Thread

We’ve added inference to the library (before if you used learn.predict you had to manually extract the spectrogram to pass it in, and you had to handle preprocessing yourself).

New functions are:

  • audio_predict(learn, item) - takes a learner and an AudioItem and preprocesses it using the same config of your learner including valid/test transforms
  • audio_predict_all(learn, audiolist) - takes a learner and an AudioList and returns preds for all items in the AudioList (preprocessed based on the audiolist’s config)

Edit: Sylvain gave us the fix, crossposting solution below here and leaving the original post, export now works

From Sylvain:

That’s because the ItemList you’re using doesn’t have a reconstruct method, so it tries to grab the one on the first element of the dataset. You should implement this method (just return x if you don’t have any postprocessing) to avoid the issue.

Problem:
Calling audio_predict_all on a test set with an imported learner causes the following problem (full stack trace at bottom):

372         ds = self.data.single_ds
    373         pred = ds.y.analyze_pred(raw_pred, **kwargs)
--> 374         x = ds.x.reconstruct(grab_idx(x, 0))
    375         y = ds.y.reconstruct(pred, x) if has_arg(ds.y.reconstruct, 'x') else ds.y.reconstruct(pred)
    376         return (x, y, pred, raw_pred) if return_x else (y, pred, raw_pred)

single_ds calls single_dl which pulls data from the validation set. (this line is from DataBunch init)
self.single_dl = _create_dl(DataLoader(valid_dl.dataset, batch_size=1, num_workers=0))

But since we have an empty validation set, when it tries to get 1 item from there it gets an IndexError. Any ideas?

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-84-ae0665b3b9f3> in <module>
----> 1 audio_predict_all(new_learn, test)

~/rob/fastai_audio/audio/learner.py in audio_predict_all(learn, al)
     17     al = al.split_none().label_empty()
     18     data = [AudioList.open(al, ai[0].path).spectro for ai in al.train]
---> 19     preds = [learn.predict(spectro) for spectro in progress_bar(data)]
     20     return [o for o in zip(*preds)]

~/rob/fastai_audio/audio/learner.py in <listcomp>(.0)
     17     al = al.split_none().label_empty()
     18     data = [AudioList.open(al, ai[0].path).spectro for ai in al.train]
---> 19     preds = [learn.predict(spectro) for spectro in progress_bar(data)]
     20     return [o for o in zip(*preds)]

/opt/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in predict(self, item, return_x, batch_first, with_dropout, **kwargs)
    372         ds = self.data.single_ds
    373         pred = ds.y.analyze_pred(raw_pred, **kwargs)
--> 374         x = ds.x.reconstruct(grab_idx(x, 0))
    375         y = ds.y.reconstruct(pred, x) if has_arg(ds.y.reconstruct, 'x') else ds.y.reconstruct(pred)
    376         return (x, y, pred, raw_pred) if return_x else (y, pred, raw_pred)

/opt/anaconda3/lib/python3.7/site-packages/fastai/data_block.py in reconstruct(self, t, x)
     97     def reconstruct(self, t:Tensor, x:Tensor=None):
     98         "Reconstruct one of the underlying item for its data `t`."
---> 99         return self[0].reconstruct(t,x) if has_arg(self[0].reconstruct, 'x') else self[0].reconstruct(t)
    100 
    101     def new(self, items:Iterator, processor:PreProcessors=None, **kwargs)->'ItemList':

/opt/anaconda3/lib/python3.7/site-packages/fastai/data_block.py in __getitem__(self, idxs)
    116         "returns a single item based if `idxs` is an integer or a new `ItemList` object if `idxs` is a range."
    117         idxs = try_int(idxs)
--> 118         if isinstance(idxs, Integral): return self.get(idxs)
    119         else: return self.new(self.items[idxs], inner_df=index_row(self.inner_df, idxs))
    120 

~/rob/fastai_audio/audio/data.py in get(self, i)
    312 
    313     def get(self, i):
--> 314         item = self.items[i]
    315         if isinstance(item, AudioItem): return item
    316         if isinstance(item, (str, PosixPath, Path)):

IndexError: index 0 is out of bounds for axis 0 with size 0
1 Like

OK, I’ll get together a PR.
I can put in code to pull channels from the databunch if channels is None (it can at least initially default to 3). Assuming that you’ll standardise on having a singleton channel dimension for mono so I can use sig.shape[-3], which is I think the easiest way to sort out transforms.

1 Like

Heard an excellent talk by Jon Nordby on Audio Classification using Machine Learning at EuroPython today. Will post a link to the video here once it becomes availabe. But his github repo is a good source as is (and also contains the slides/info for the talk). Very interesting resource! Great introduction for people new to this topic, but also interesting for the more advanced in this thread. It covers training specialized CNNs from scratch as well as transfer learning from ImageNet-pretrained models like many of us have tried/used, using the common sound-to-spectrogram-image and then image-classification approach.

His Master’s Thesis on Environmental Sound Classification is also available there, code is in Keras though.

8 Likes

Video

Good afternoon everyone!

I am loving fastai but fear I may have bitten off more than i can chew with trying to process audio :frowning:

I have loaded up some bird audio and have used this tutorial to get the audio data in this kernel

For purposes of the competition, the author (daisukelab - full credit for all code) is loading the images from memory for the CNN however when I try to use the same process with my own code it doesn’t work. I have tried a few different options but keep running into issues.

Instead, I was wondering if anyone had a complete, beginner friendly tutorial that will take me through how to create a loop that will go through my files, create and save melspectrogram files as images from librosa?

1 Like

Awesome video! @ste’s idea of combining various time resolution spectrograms in different channels appears to be a good idea! Varying time resolution in multiple channels It’s been on my list to play with in the library but haven’t gotten to it yet. I know @ste implemented it and tested it on phonemes from TIMIT and it appeared to help. I’ll get to it soon but I think implementing some specific audio architectures is higher on the list. We are using ones that are getting great results but are probably unnecessarily deep and complex.

1 Like

Hey @AIClaire, I’d recommend using our library instead of manually preprocessing the spectrograms and saving the files, it does all of that for you. Here’s a beginner friendly overview , you can just import your audio files using from_folder or from_csv in the same way you do with fastai images and it handles the rest. We are still developing it so the install can be a bit tricky, but I’m happy to help. Also, if you’d prefer to manually generate the spectrograms in librosa, let me know and I can find a guide.

We have an audio ML group on Telegram that is totally open and all skill levels, PM me if you’d like to join us.

4 Likes

Hey @MadeUpMasters, Thank you so much for this! It is exactly what I have been looking for! I will PM if I get stuck on the install (thank you for offering - very kind) and about the group. I would love to learn the manual version but I think it’s best for me right now to do one new skill at a time so I don’t end up a panicked ball on the floor :smiley: . Thanks again for your help.

Awesome, I hope it helps with your project, and we would love some feedback afterwards. We are really looking for ways to make it more easy/functional/intuitive. Cheers.

I’m still having some issues getting inference to be working/efficient. Crossposting this from a dev thread to see if anyone here can help.

That got things working, but I think I’m going about it the wrong way, and I would like things to be done in the fastai compatible way as much as possible.

Our audio preprocessing (resampling, silence removal…etc) is done when a LabelList is created. At inference time we have individual AudioItems, so to make sure the same preprocessing is followed, we’ve been using an audio_predict method that takes an AudioItem or AudioList and calls .split_none().label_empty() to cause the items to be preprocessed before passing the items to learn.predict(). This seems to work but I feel like I should be using reconstruct to initiate the preprocessing, but I’ve read the custom ItemList guide and I’m still not sure how to do it.

What do I need to do so our users can just call learn.predict() and get_preds() directly instead of our custom methods? Here is what our code looks like now, it feels really bad/inefficient. Thank you.

def audio_predict(learn, item:AudioItem):
    '''Applies preprocessing to an AudioItem before predicting its class'''
    al = AudioList([item], path=item.path, config=learn.data.x.config).split_none().label_empty()
    ai = AudioList.open(al, item.path)
    return learn.predict(ai)                                              

def audio_predict_all(learn, al:AudioList):
    '''Applies preprocessing to an AudioList then predicts on all items'''
    al = al.split_none().label_empty()
    audioItems = [AudioList.open(al, ai[0].path) for ai in al.train]
    preds = [learn.predict(ai) for ai in progress_bar(audioItems)]
    return [o for o in zip(*preds)]

After your fix I sorted out Learner.predict in my fork. You can see the changes I made at https://github.com/thomasbrandon/fastai_audio-test/commit/c98e14d0ed41473d4b431b44e1c5d8f516b1efa9.
Not sure how much that method will help you though. Think you might have issues as IIRC you don’t do transforms through the standard mechanisms which is what Learner.predict is assuming (transforms get passed to the DataBunch.single_dl to apply, so I didn’t have to make any changes around that). I guess though you should be able to apply them in AudioDataBunch.one_item (which I override for opening items).

1 Like

Thanks, I’m definitely not as familiar with the underlying way fastai handles everything as I should be and I’m starting to feel it as I work to give it full functionality. I’ll probably wait for APIv2 to drop to really get in the weeds and straighten everything out.

Discovered a really cool library called Colored Noise by the data scientist Felix Patzelt that allows you to generate different colors of noise algorithmically. I’m using it in our googlespeech notebook to generate data for types of noise that are in the test set but not the training set. We also might be able to adapt it to generate spectrograms of various types of white/pink/brown noise quickly (instead of generating the signal, and then doing a spectrogram which is a small time bottleneck) and add it to the data as a transform.

Jeremy did remark in the part 2 v3 that even he struggles to remember how all the data blocks stuff fits together and avoids working on it.
One somewhat quick fix might be to refactor the transforms stuff out of AudioList.open so there’s a single apply_transforms(item, config), then you could call that from DataBunch.one_item rather than having to create an AudioList to call it’s open.

1 Like

Cool, adding noise was on the (ever-growing) TODO list.

Guess you hadn’t dug into the implementation yet. it’s generating the noise in the frequency domain, then doing an IFFT to get the time-domain noise. Converting to torch, and applying to STFTed signals (n_frames being the desired STFT length) I think it’s basically (completely untested):

# mean of 0 with shape determining the output shape in torch.normal (unlike np.normal)
m = torch.zeros(n_fft, n_frames)
# frequency-domain power spectrum of desired noise type, used as std of noise
P = get_noise_spectrum('pink')
noise_re = torch.normal(mean=m, std=P)
noise_im = torch.normal(mean=m, std=P)
noise_comp = torch.stack((noise_re,noise_im),2) # complex noise signal
noise = torch.istft(noise_comp) # Inverse complex-to-real STFT

(except there’s no torch.istft yet, but it is coming it looks, and there’s code around)
One possible issue with doing this to spectrograms is that it is generating random phases as well as magnitudes. Think applying it just to magnitudes as you’re generally using might not achieve the correct results. Also not sure how, having generated your complex noise, you’d correctly add it in the frequency domain (I’m sure wikipedia will say but I likely won’t really get it) or what the issues are there. Maybe just adding it in before you do a magphase on your STFT is enough (where ‘adding’ may not be just signal + noise as they’re complex).

One idea I would like to look at, that may apply here, is adding sort of opportunistic augmentations, as in adding stuff to transforms you have to do anyway that also works to augment the data. As fastai does when resizing by a partly random amount then cropping rather than doing the exact resize needed. This avoids all the extra work you’d need to do to resize a properly sized image off by a bit then crop back to achieve much the same result.
I’m looking to try and apply this in resampling. Instead of resampling to an exact rate, use something near the desired rate, then pad/crop to get proper size. This should affect something like a time stretch/pitch shift of the properly resampled signal. I can also then use optimised FFT sizes for the resampling as it doesn’t have to be exact, so it should actually make it faster than doing the exact resample.
Similarly you might be able to add noise in at some other points in the pipeline, reducing overall computational cost compared to doing the other needed ops then adding noise separately.

This sort of thing is probably less applicable with caching though as that precludes any augmentations before caching.

Hi all, first off, I’m so glad I found this thread, I spent most of yesterday reading through everything written and linked here and it has been incredibly helpful.

I’ve been playing around with generating the spectrograms myself and am having trouble figuring out the relationship between n_fft, n_mels, hop_length, and sr. I have a basic theoretical understanding of what all of them are on their own, but I’m just having a little trouble building a mental model with how they all interact with each other when a spectrogram is generated.

Are there any resources that anyone could point me to that might help me understand this better? And of course, if anyone here wants to take a stab at it, that would be very appreciated as well :slight_smile:

Thanks!

1 Like

There’s an Intro To Audio notebook in the fastai_audio repo which includes some info on all of those. And as linked in that, there’s a notebook I made with some more detailed information on n_fft, hop_length and windowing in STFT.
Any feedback on what’s clear or not clear would be great.

5 Likes

Thanks I’ll check them out!

Yeah it’s gotten a little out of hand the amount of stuff we are handling in that method. I especially don’t like the way duration is being handled there as it grabs a chunk instead of handing you back the full sg, which makes life hard if you want to window over your test set for inference. I’m actually taking the next 2 weeks off, but I’m looking forward to coming back and considering a redesign with fresh eyes. My coding has massively improved over the past 6 months, but I’m unsure how to take it to the next level. I guess steady chipping away via the usual means, reading good code, thinking about why it is the way it is, and trying to emulate it, is the obvious answer, or dedicating time to a well-maintained open-source project, but I’d love advice on other approaches more advanced coders here have found helpful but that I might not have considered.

What I was imagining was using some stochastic process to generate a spectrogram of the desired type of noise with a hop/n_fft…etc that matches our dataset and then adding them together directly using the appropriate math. Not sure if that’s viable but the patterns from looking inside white/pink noise spectrograms makes me think there’s a fast way to algorithmically create a 2D torch array that is the same as creating white/pink noise and doing an STFT over it.

That’s a very cool idea, especially if you can get your fast resample working. I think it would be really cool to have all augmentations be directly on the spectrogram so stuff that would typically be done on a signal, like pitch shifting, could be done in real time while being compatible with caching. That being said it seems a bit unlikely (otherwise ppl more knowledgeable than me about DSP would be going that route), plus there are a lot of indications that working on raw waveform is going to be a rising area of research, so in the next version we’d like to support that and direct signal transforms.

Hi, I’m quite new to audio was studying up to do a project in speech seperation. I was reading the paper “Supervised Speech Separation Based on Deep Learning: An Overview” (DeLiang Wang and Jitong Chen) and all was good until part III.

If talked about different targets, but I couldn’t quite understand how the spectral magnitude mask worked. At first I thought it was pixel wise division of sound_magnitude over noise magnitude but in the examples 2.e I see that there is a bigger difference, since some of the images use “channels” while others use fft-bins.

Looked online with no luck as to what the spectral magnitude mask is or why some images use channels and others use fft-bins, I thought they all used fft-bins and I don’t know what channels are in this context. Could someone explain these terms?

PS. Since I usually ask in the wrong forum, if this isn’t the place for these kinds of questions could you tell me where I could ask?

Not at all an expert, and only skimmed the article, but pending a more knowledgeable reply I’ll take a crack.
Are they using channel in the context of filterbanks? It looks like they are talking both about approaches based on digital signal processing and more analog signal processing inspired ones (even if they are actually implementing it digitally).
So in the digital approaches you’d talk about STFT conversion to the Time-Frequency domain and properties of bins in that domain. But in the analog based approaches you’d talk about multi-channel filterbanks and properties of subbands of the processed signal (subband seem to be the term for the output of single channel of a filterbank).
Wikipedia doesn’t specifically define channels in the context of filterbanks, but they do talk about M-channel filterbanks which seems to mean applying M parallel filters to a signal to divide it into M subbands.

So assuming my guess there is correct:
A channel of a filterbank would be somewhat comparable to a bin of an STFT in the case of simple filterbanks. In general the filters could be arbitrarily complex and there’d be little relation. But in something like a a Mel filterbank where each filterbank channel is the combination of a few successive FFT bins the link would be fairly simple.