Deep Learning with Audio Thread

To me reverting a spectrogram always seemed possible, but from what I’ve read of more advanced signals processing people, it isn’t, or is high effort/low quality. Here’s a decent discussion of it.

2nd paragraph outlines a potential path. It’s way beyond my understanding so not sure if it’s helpful.

Another (more correct?) option would be to extract the full FFT (not just the magnitude, but also the phase) and model those. But to have a chance to be able to model the phase components, you would need to extract the FFT pitch-synchronously, and then, resample to fixed frame rate. In other words, you would need to find the GCI (glottal closure instants) in the original wav (for example using reaper), and center your FFT window around those. I suppose once it is modeled, you could resample your full FFT to be pitch synchronous, and then recover a decent enough raw wav by ifft and using OLA.

Yes, my first goal is just to separate two voices from a mono-channel audio, then optimize from there trying to get better results.

Yeah, it depends a lot on how you define the stft and what transformations you apply afterwards.

YES I think this could help, they also link the following discussion which gives an idea of something I could test with a full magnitude spectrogram.

If anybody is noticing poor results on your learner with fastai_audio, it may be a recent change that we make (working on fixing now). Before you pull all of your hair out trying to figure out what you’re doing wrong, set pretrained = True for the audio_learner and set learner=freeze() after you build the audio learner.

1 Like

My GPU is tied up with other things at the moment so can’t test and just a brain dump which could be off the mark.

Certainly setting pretrained=True is sensible, in fact (as original author of that code) I didn’t really intend the default to be false, if only as that is different to fastai’s cnn_learner. But not sure about the utility of freezing the model. That means you only train the couple of linear layers appended to the convolutional stem. While reasonable for many vision tasks where your images are close enough to standard imagenet categories to not need much tuning of the convolutional layers this seems unlikely to be ideal for audio.
While doing proper testing of this is still on my todo list, initial testing didn’t show any advantage to freezing then fine-tuning. I did see some possible evidence of increased issues with stability when training the whole model and it may be more care needs to be taken with learning rates as you can kick the whole model into a bad state rather than just the last couple of layers. But in general I didn’t find this to be a major issue and initial frozen training just seemed to add training time without any particular benefit.
Again, I need further testing to be sure but there did seem to be some evidence that using differential learning rates helped when training an unfrozen model (they should make no difference otherwise as only one layer group is being trained). Just passing a single item slice seemed OK (i.e. lrn.fit_one_cycle(EPOCHS, slice(MAX_LR)) so the initial layer group gets a lower rate, MAX_LR/10 IIRC). It may also be more important that data is normalised as that’s going to be more of an impact on the earlier layers.

To avoid a separate post directed at you, I’d also note that having added multichannel support you probably want to look at the changes I made in this commit. Currently that learner code just reuses the first channel of the original weights (with new_conv.weight.data[...] = conv.weight.data[:,0:1,:,:] which broadcasts to the new number of channels). This is likely non-optimal. That commit at least cycles the original channels and lets users provide a function to adapt the weights in a more intelligent way (though I’m not exactly sure how, beyond a basic intuition that you might want to maximise diversity of activation, i.e choose those kernels that produce different activation for different inputs).

Thinking further it may also be worth looking at scaling the kernel weights based on the new number of channels. Otherwise you will end up changing the output distribution as now you have multiple input channels using the same kernels and contributing to the same output channels (noting that you have a different set of kernels for each input channel->output channel pairing). I.e. going from 3 to 6 input channels say, means you now have two sets of the same kernels contributing to an output channel, so the mean activation will be doubled (in aggregate, of course it will depend on exact inputs and kernels). So you might want to multiply the kernel weights by new_input_channels/orig_input_channels to correct for this. But that’s just a guess.
This may also link to your OP as this is likely to be exacerbated if you freeze the model as then the weights can’t update to better values for the new number of input channels.

1 Like

Yeah, there may be a default in between freeze and unfreeze that is better generally. I agree that testing needs to be done to try to determine some of those numbers. I initially didn’t think pretraining would help at all with spectrogram analysis but it does make sense that the lower level filters at least do since spectrograms have edges and curves and stuff to detect.

I’ll look at the commit you referenced when I am able to look into this.

Hi! I do not fully understand what you are trying to accomplish but I think I can provide some help with the spectrogram.

In its purest form, a spectrogram is simply a matrix where every row (or column depending on how you want tot store it) is the Fourier transform of a small chunk of your original audio waveform. Given a spectrogram in this form you can do the inverse Fourier transform on each chunk to get back the waveform, Librosa has a function for doing this on an entire spectrogram called istft.

With that being said, you can only do the inverse transform if you have saved the phase information for each entry in the spectrogram matrix, in almost every application I have seen the phase information is not kept (the magnitude of the complex number is computed, and since many different complex numbers can have the same magnitude, this operation is not revertable). But you could perhaps try to add the phase as a second channel to your model input? In which case you should be able to re-construct the waveform.

When computing the melspectrogram from a spectrogram, you are essentially averaging neighbouring entries of the matrix, which is not revertable.

1 Like

I found this blog post about inverting MelSpectrograms:


I tried it myself with my sound-enhancing project and got a slightly better result than with only STFT.
Sadly, slightly better results mean that instead of generating random beeps, I am generating an almost-human, very distorted voice.

1 Like

This showed up in my podcast feed: https://anchor.fm/chaitimedatascience/episodes/Deep-Learning-Applied-to-Audio--Self-Studying-ML--Interview-with-fast-ai-fellow-Robert-Bracco-e4r6d9/a-ak53ld

I really enjoyed your discussion! I hope I will have time to try out fastai-audio very soon.

2 Likes

Thanks for the update Jeremy the new API looks really good.

@MadeUpMasters, @KevinB, @hiromi and I have decided to aim to have a v2 compatible branch ready by the end of September including upgrading torchaudio. We’ll try to follow you with the development in notebooks but the environment might be slightly different. Where would be the best place to ask questions to you regarding v2?

3 Likes

This looks good. Thank you.

Glad you enjoyed it, thanks for sharing it here, you beat me to it!

3 Likes

@KevinB Thanks for the update, that sounds great! My 8 channels are on hold until I can get my noise floor lower, but I definitely want to try it at some point.

Also, if it would help anyone: I’ve been working on a real-time visualization/debugging tool for use with PyTorch audio models. Currently it’s tied to my SignalTrain code, but I’ve been trying to write it in a manner that could be adapted to others’ models fairly easily:

Also, if anyone will be at the AES conference in October, or the ASA in December, let’s meet up! I’m co-chairing sessions on signal processing and ML-audio, but otherwise ‘just hanging out’.

1 Like

Multichannel is now implemented in fastai_audio by default. This shouldn’t cause any issues, but would love to get feedback from the community on the update!

1 Like

SpeechBrain, a new PyTorch-based Speech Toolkit was just announced. It’s looking to provide a single toolkit for many common speech-related tasks, such as speech recognition, speech separation, speech enhancement, speaker recognition, and language model training.

4 Likes

Really amazed to see the work by @MadeUpMasters @KevinB and others. Will it be possible to implement https://github.com/CorentinJ/Real-Time-Voice-Cloning using fastai_audio?

I wrote a brief blog post about the components to implement this as a research paper summary here

1 Like

Do you know where I could find their current repo? Seems kind of weird that they’re open source but they don’t have a link to their code and you have to send a mail to contribute.

1 Like

It seems like it will be a few months before it’s ready. in that post they say “A first alpha version will be available in the next months.”

2 Likes

I get an error, when trying to use:

preds = audio_predict_all(learn, test)

test is configured with the same config file than learn, both seem to be loaded correctly

~/steth_ai_fast_ai/fastai_audio/audio/learner.py in audio_predict_all(learn, al)
67 ‘’‘Applies preprocessing to an AudioList then predicts on all items’’’
68 al = al.split_none().label_empty()
—> 69 audioItems = [AudioList.open(al, ai[0].path) for ai in al.train]
70 preds = [learn.predict(ai) for ai in progress_bar(audioItems)]
71 return [o for o in zip(*preds)]

~/steth_ai_fast_ai/fastai_audio/audio/learner.py in (.0)
67 ‘’‘Applies preprocessing to an AudioList then predicts on all items’’’
68 al = al.split_none().label_empty()
—> 69 audioItems = [AudioList.open(al, ai[0].path) for ai in al.train]
70 preds = [learn.predict(ai) for ai in progress_bar(audioItems)]
71 return [o for o in zip(*preds)]

~/steth_ai_fast_ai/fastai_audio/audio/data.py in open(self, fn)
295 if self.path is not None and not fn.exists() and str(self.path) not in str(fn): fn = self.path/item
296 if self.config.use_spectro:
–> 297 item=self.add_spectro(fn)
298 else:
299 func_to_add = self._get_pad_func() if self.config.max_to_pad or self.config.segment_size else None

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/data_block.py in _inner(*args, **kwargs)
470 assert isinstance(fv, Callable)
471 def _inner(*args, **kwargs):
–> 472 self.train = ft(*args, from_item_lists=True, **kwargs)
473 assert isinstance(self.train, LabelList)
474 kwargs[‘label_cls’] = self.train.y.class

TypeError: add_spectro() got an unexpected keyword argument ‘from_item_lists’

I couldn t find an example of someone using this ?

1 Like

Sorry about that, our inference stuff is not thoroughly tested and still a bit buggy, we will have a more stable implementation in v2. I’m out of town this weekend and unable to test it, but if you’re comfortable messing around with the source code, going into data.py and changing the signature of add_spectro so that it accepts **kwargs. The line should be:

def add_spectro(self, fn:PathOrStr, **kwargs):

Give that a shot and let me know if it works, or post the error if it leads directly to another bug.

Thanks!

1 Like