Deep Learning with Audio Thread

ThomM · April 21, 2019, 1:36am

I’m sure it would be very helpful to have a large open diarization dataset; I haven’t looked very closely myself. Very strong +1 for having multiple languages if possible!

As for what kind of dataset would be “ideal”, honestly, I don’t really know yet. I suppose at least it would have to have labelled speakers and the timestamps for when each participant is talking. You’d have to make some decisions about how precise you wanted to be when it comes to speaker changes (latched speech) or how you treat overlapping speech. I think it would be very beneficial to have varying audio quality, too.

My little project is intentionally starting very minimal - a single podcast with only 3 known speakers. It will have to evolve to a very different system to generalise to an unknown number of unknown speakers, especially in a more challenging audio quality environment. From what browsing I’ve done of the literature, it must be a hard problem in the wild, as it seems nobody has done it very well!

One idea I’ve had that might be worth pursuing is whether you could treat part of it as a regression problem - trying to predict the timestamps of speaker boundaries, then using those boundaries to decide what clips to take, rather than just deciding up front “I’m going to use 3 second clips” or whatever. I’m not sure whether it really makes sense, though.

I feel like the more promising avenue would be an image segmentation approach. Just as in lesson 3 where you can label every pixel of a photo as “pedestrian”, “building”, “road” etc., I feel like you could label every spectrogram column (or bunch thereof) as “speaker A”, “speaker B” etc. This feels like it should be doable, but I haven’t gone through that lesson in detail for a while. Generating the data for that shouldn’t be too hard. Processing the data, on the other hand, could be interesting; creating & computing on 1-2hr audio clips could be a bit much…? On the other hand, you seem to only need 64 mels, and you could use a fairly large timestep for your SFTFs, to produce a 1x64x(not too many)px spectrogram… Only one way to find out I guess

Honestly, personally I don’t think I’d be ready to try out any hypothetical dataset you’d create in the immediate term, so certainly don’t do it on my account; but I’m sure it would be a valuable asset to the community, as there aren’t very many speech datasets available in public anyway, let alone ones focused on diarisation (i.e. clips with multiple speakers, and timestamped labels).

hwasiti · April 22, 2019, 9:21am

I think for starting a difficult project like this, we should start as simple as possible. It would be good to create a dataset that it should be:

easy to generate in large quantities
too simple to get good results with your proposed method

Like making a function to generate 2 different sound types: like a flute and violin. And the function can generate infinite amount of random notes of both instruments. Then you would be able to create whatever number of labelled segments automatically within couple of hours…

Heck, you can even make it generate the dataset on the fly in the dataloader… For an inspiration how to do it:

radek’s quick draw solution where he generated the image dataset on the fly from vector data
@johnhartquist excellent audio repo where he generated spectrograms in the runtime.

Such simple dataset is way easier than a diarization dataset, but that will make your method easier to inspect and debug at this early stage. More encouraging and exciting to continue if suceeded. If it doesn’t work here, it will not work definitely for diarization. But at least you will have better insight on how it failed and what to do about it…

I didn’t look into how to generate different instrument sounds in python, but I suppose it is doable…

ThomM · April 22, 2019, 1:48pm

This could be a good idea for generating a lot of labelled information for e.g. transfer learning. Create a large bunch of abstract sounds with differing properties & arbitrary labels, train a classifier to distinguish between them, then fine-tune on actual data. It might work. On the other hand, we seem to be generally getting pretty good results on speaker recognition. I’m at >98% in my podcast project and that’s before I add mixup or any of the newer techniques. But for a more complex problem, like the current Kaggle comp, it could be an interesting technique. Thanks.

jeremy · April 22, 2019, 2:11pm

I guess nsynth dataset is something like this already?

Could also look at audioset.

MadeUpMasters · April 22, 2019, 3:18pm

Solved this, exp.nb_TransformsManager imports exp.nb_FastWidgets but it wasn’t getting build. buildFastAiAudio.sh is missing nb_FastWidgets so I added the line python notebook2script.py FastWidgets.ipynb and it now works. Is nobody else experiencing this error?

ThomM · April 22, 2019, 4:58pm

Interesting, I was just looking at audio processing in TensorFlow (gotta get with the times after all ) and in their example they mention a very similar technique to what you suggest (search for generate_streaming_test_wav) — basically taking your labelled clips and stitching them together (in addition to some background noise), keeping track of the timestamps. That seems like a more sensible way of doing it than truly random sounds, especially if you’re doing particular speaker identification. They also say a ‘typical’ approach to classifying a stream is to do what @baz & I (& probably everyone else) have thought about, i.e. keeping a moving average prediction over a rolling subsample of input.

ThomM · April 22, 2019, 8:36pm

Just came across this simple approach for “alignment” which seems like it should work relatively well, and might be a smarter approach than “clip in the middle” which is what we’re currently doing by default. Essentially it takes a fixed window, e.g. 1sec, and looks for the “loudest” (i.e. sum of abs signal) 1 sec in the sample. Seems sensible to me, but you’d (as always) have to be careful with picking your window sample.

@marii, might be of interest if you’re still looking at the silence removal stuff

ThomM · April 22, 2019, 8:40pm

No problem, it’s a bit over my head too I think yes it is a bit like PCA, they’re both matrix decompositions so they’re probably getting at something similar; you could also look at SVD to break it out into components. I’m not sure whether it would be directly useful as a classifier, but if it works the way it seems to, you might be able to preprocess your samples to extract the calls from the background noise, and then do your classification on the extracted calls.

There’s a scipy and pytorch implementation, so maybe you could give it a go!

Tchotchke · April 22, 2019, 9:15pm

Today Google announced a new Data Augmentation method for spectrograms, which they call SpecAugment. They show large improvements on the previous SOTA on two datasets.

They describe it as:

SpecAugment modifies the spectrogram by it in the time direction, masking blocks of consecutive frequency channels, and masking blocks of utterances in time. These augmentations have been chosen to help the network to be robust against deformations in the time direction, partial loss of frequency information and partial loss of small segments of speech of the input. An example of such an augmentation policy is displayed below.

MadeUpMasters · April 22, 2019, 9:52pm

This is awesome, great find. Direct Link to Paper (Arxiv) for those who are interested. I’ll add it to the OP as well when I get a chance.

Seems simple enough to implement (even on the GPU) and looks like there’s huge gains in accuracy .

alekoretzky · April 23, 2019, 12:08am

This is great Robert! Thanks for the initiative. I’m also into making more resources available on AI+Audio. I have started an ‘Audio AI’ series on Medium and I’d like to invite you to read my first post, which made a good amount of noise out there: “Isolating Vocals from Stereo Music Using Convolutional Neural Networks”.

Let me know any way I can support you. Best, Ale

zachcaceres · April 23, 2019, 12:47am

3rd link appears to be broken. Here’s another link

kodzaks · April 23, 2019, 11:58am

This is great! The power of occlusion, we need to include masking/occlusion in fast.ai augmentation.

MadeUpMasters · April 23, 2019, 11:41pm

This is super helpful, thanks for sharing. I’ll sticky it at the top when I go redo all the resources. Do you plan to write more pieces in the series? What topics are you considering?

alekoretzky · April 24, 2019, 12:05am

Thanks Robert! Yes, I’m planning to write more articles on Medium under an “Audio AI” series. I’m currently drafting the second part for this one, addressing full stem source separation from stereo music.

in terms of topics, really anything. I just gave a talk at ccrma last friday on audio classification. I’ll rescue some stuff from that presentation to write on that topic. I have a special interest and expertise in source separation. currently validating a Conditional GAN approach.

Let me know any specific topics that you think could be interesting

zachcaceres · April 24, 2019, 4:54pm

https://openai.com/blog/sparse-transformer/ OpenAI’s new model has examples of generative Transformers via training on raw audio. Must be something in the gestalt right now

MadeUpMasters · April 24, 2019, 5:52pm

Hey guys, I was checking out the fastai audio notebooks today. I ran on tensorflow speech challenge using the transforms and got much lower results so I started looking into the spectrograms (I was using librosa previously) and it seems like there are really unusual default settings?

Defaults are currently

tfm_spectro(ad:AudioData, sr=16000, to_db_scale=False, n_fft=400, 
                ws=None, hop=None, f_min=0.0, f_max=-80, pad=0, n_mels=128, **kwargs)

and it produces this (example is the same as notebooks, 3.8 seconds of speech)…

Set to_db_scale = True and we get

change n_fft from 400 to 1024 and you get

then change hop_length from None(default in torchaudio is 800) to 512

which now allows us to fix the frequency range and change it from -80 to 8000

This looks similar to what you would get if you used torchaudio defaults (they don’t display a default f_max in docs, didn’t look at source, but from my understanding usually you would use 8000 for speech, 20000 for music/noise and maybe higher if you are working with frequencies that are beyond human hearing.

TorchAudio default settings

As for speed, my spectrogram took 1.27ms to generate, using the torchaudio defaults was 1.07ms.

One other place to play around is hop_length. I don’t know if we are actually gaining a useful amount of resolution here, but switching to 256 hop (with my other settings) yields (1.73ms)

128 hop (2.38ms)

Keeping in mind that for tasks like speech, sometimes you want to do phone segmentation/detection, a wider hop_length might be worth the time cost, otherwise your windowing for pulling out phones might be really narrow.

I’m going to go back and do some more training on tensorflow speech and I’ll report back results.

MadeUpMasters · April 24, 2019, 5:58pm

Oh yeah, one other consideration. I’ve usually used a power of 2 for my hop_length, but it might make more sense to use a number (e.g. 800) that evenly divides into your sample rate (if you’re padding to an even amount of time). It saves some trouble when trying to reverse the stft to get audio back out after doing transformations directly on the spectrogram. Otherwise you have to do some weird padding stuff.

MadeUpMasters · April 24, 2019, 8:30pm

So I tried a bunch of things on a ~2500 image set taken from 10 subclasses (250 clips of 1 sec duration, evenly split classes) from the tensorflow speech challenge. All work done on resnet50, I ran the LR finder to look for a good LR but they were mostly flat with a steep ascent at the end, so I just trained all models

5 epochs at 1e-02
5 epochs at 3e-03
5 epochs at 5e-04

Default Settings

tfm_spectro(ad:AudioData, sr=16000, to_db_scale=False, n_fft=400, 
                ws=None, hop=None, f_min=0.0, f_max=-80, pad=0, n_mels=128, **kwargs)

Notes: Specs seem very small and hard to see, but interested to see if it matters how a spectrogram looks or if enough data is in there for the model to get good results.

1st round: 74.1% accuracy
2nd round: 78.0% accuracy
3rd round: 80.9% accuracy

Settings that look good to humans

tfm_spectro(ad:AudioData, sr=16000, to_db_scale=True, n_fft=1024,
             ws=None, hop=512, f_min=0.0, f_max=8000, pad=0, n_mels=128, **kwargs):

Notes: Images are still a little small, this is 1 second of audio, it seems like we’d do better on bigger.

1st Round: 87.2%
2nd Round: 88.9%
3rd Round: 91.2%

As big as we can get it

def tfm_spectro(ad:AudioData, sr=16000, to_db_scale=True, n_fft=1024,
             ws=None, hop=50, f_min=0.0, f_max=8000, pad=0, n_mels=224, **kwargs):

Notes: These look good to me, but they are a weird shape (224x321) and I also had to halve the batchsize to get them to work, and they take longer to train.

1st Round: 93.1%
2nd Round: 94.8%
3rd Round: 95.2%

Let’s try square images
A bit about the math here, the spectrograms I’m showing as examples are 1s long (sample rate is 16000). The width of our spectrogram is determined by hop_length, and the height of our image is determined by n_mels. Jeremy has said square images are good and 224x224 are good (this latter part might only apply to Unets? I dont remember), but anyway our image is 224 tall, but to get it 224 wide, we need to pad it to be (224 * our hop length) - 1. 72*224-1 = 16127, so we can pad to 16127 and use hop_length of 72 for a 224x224 image.

def tfm_spectro(ad:AudioData, sr=16000, to_db_scale=True, n_fft=1024,
             ws=None, hop=72, f_min=0.0, f_max=8000, pad=0, n_mels=224, **kwargs):

Notes: This seemed to work the best, we were able to raise batch size back to 64 and it trained well. I’m going to now take it and try it on the full tensorflow speech challenge dataset, along with a few other combos that worked well.

1st Round: 93.4%
2nd Round: 95.2%
3rd Round: 95.9%

Hope this helps everyone out there fiddling with spectrograms! We need more data on MFCC too, it seems like a promising input for speech. SpecAugment paper had the best results using that as the input data.

Edit: Also wanna say thanks to @ste for fiddling with the first layer to accept 1 channel inputs! All of these are trained on a variation of resnet50 that wouldn’t have worked without his code.

ThomM · April 24, 2019, 11:25pm

This is awesome Robert, thanks for experimenting, and sharing! We haven’t looked at all into these kinds of details yet, so this is really valuable. We should probably change the defaults, and/or incorporate your “squarification” math into the transform itself, behind a “make images square” flag. We toyed with this at the beginning, but being so dependent on clip length and n_mels, we weren’t sure whether it made sense.