Deep Learning with Audio Thread

DM sent but right now fastaudio only does classification, not ASR. @scart97 and I are off learning ASR stuff in pytorch/lightning and hope to bring a simplified ASR pipeline to fastai. As I mentioned in the DM, for people looking to do ASR as easily as possible, Nvidia’s Nemo Library is probably the best place to start.

2 Likes

Can you give more details of the problem you’re trying to solve? What are you trying to detect in audio. The most common feature extraction is Mel-scaled Spectrograms and preprocessing really depends but it is standard to downsample to mono (average stereo channels so you have one channel) and resampling (changing the sampling rate of the audio). Let us know what you’re working on and we can point you in the right direction. There are also plenty of intro resources and tutorials in the first post of this thread.

Hi everyone,
I am working on a Speech Recogniser for Spanish. I will be using the Common Voice dataset https://commonvoice.mozilla.org/en/datasets and Google Cloud Platform. I am not sure how should I set it up.
How do you usually proceed with storage ? Do you store the compressed file or uncompressed ? Do you make the WAV conversion before or in the VM before training ? Is it even useful to pay for storage as those are public and should not I simply make sure the size of my VM is big enough ?
I could also use some GPU setup recommendation. I have been recommended 1 x T4 but we will probably need to train it from scratch as I could not find a pretrained model for Spanish. Will it do the job ?
Thanks so much and any answer even partial will be greatly appreciated. :slight_smile:

Charles

1 Like

Hi
I’m doing speech recognition for 6 months now on a slightly modified version of the library, with quite good success (0.31 WER).
At first I was storing raw files, but conversion to spectrograms is an unnecessary waste of computation and storage time, unless you modify your SpectrogramConfig at runtime (which I believe is not usually the case). According to google paper, SpecAug, spectrogram augmentations perform better than those done on raw files, eg. timewarp.
I ended up using cached specgrams on the cloud, for which some trivial code modifications were necessary.
Google also allows you to keep images of complete instances, together with your dataset on the machine. I think that is the most accessible approach if you don’t mind the extra $$ for storage.

You should also keep in mind that you need 1000+ hours of training data for a good speech recognition system. State of the art results achieved on English language are trained on 10000h+ datasets.

1 Like

Thanks @ppotrykus. Impressive results ! Let me summarise to make sure I undertood :

  1. I should uncompress, convert to WAV and to spectrograms and store only the spectrograms on the cloud.
  2. I can perform my augmentation directly on those spectrograms and store them too on the cloud. What do you mean by “cached spectrums” ?
  3. What do you mean by :

Google also allows you to keep images of complete instances, together with your dataset on the machine.

How should I use that ?

Thanks again !

1 Like

I started from a former version of https://github.com/fastaudio/fastaudio but now I see there has been some significant updates in the library.
What I mean by cached spectrograms (spectrums) is a result of calculating spectrograms from raw audio using a given config (sampling rate, max frequency, n_mels, n_fft, etc.) and stored as serialized pytorch tensors. Instead of recalculating those each time you train a model, you can load them up from a cached directory.

  1. Google VM utility allows you to save an instance image and use it to create a new instance, with exactly the same setup and data. Sort of the same happens when you suspend or hibernate a running instance. Its much cheaper to hibernate your instance instead of keeping it alive when you’re not training anything.
1 Like

Hey Charles, that’s awesome I am also working with CV Spanish on an ASR project.

You absolutely want to handle audio preprocessing in advance as things like resampling and file format conversion are very slow. I would recommend just storing it on your VM. CV has multiple sample rates (I believe 48000hz and 44100hz) and you really only need 16000hz so I would downsample and store the audio. I have already done this for CV Spanish and can make the .tar available if you like. For me, I am using colab+drive, so I have the 16k mp3 files compressed and stored on drive, and I decompress in colab every morning (takes 90s).

Training from scratch is challenging for ASR as it takes a ton of compute. The fastest GPU models I’ve seen are trained on a cluster of 8GPUs for at least 3-4 days and generally longer. @scart97 and I have been working on ways to reduce this so it’s feasible to train from scratch on 1GPU but we haven’t gotten anything that beats using transfer learning. Luckily transfer learning can work across languages. Here’s an article about xfer learning with nvidias NeMo library. I’ve been using it for English with good/fast results and I’m getting okay results with Spanish. For this you do need to convert MP3 to wav and also generate NeMo manifest files but I’ve already done it and I’m happy to share it as well, just let me know if you decide to try NeMo. Cheers.

1 Like

Awesome, I’d love to hear more about how you modified the library for ASR. What language? Do you mean 3.1% WER (near state of the art) or 31% WER? 0.31% WER would be about 7x better than current SOTA, and 31% WER is very high but would still be good for some languages.

I convert to spectrogram in real time. If you do the conversion to spectrogram on the GPU it’s quite fast and a lot of the spectrogram settings are [important hyperparameters that sometimes need tweaking]( John Hartquist Spectrogram param search. Once you’ve decided which settings work best for your application caching them seems fine but I’m not sure how much of a speedup you actually get over doing the GPU version of the transform.

This is true, and painful. It makes it really hard to do great work/research. I think there’s a lot of potential in transfer learning. I’m also not convinced that there’s not a way to come up with faster, fully convolutional models that perform 1-2% short of SOTA but train 10x faster.

Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains

might be intresting to look at

Hi @MadeUpMasters,

Thanks for your answers very useful. If I understand correctly, I should create a VM to do the downsampling and store only the results (16kHz sampled MP3 files to Google Cloud Storage or maybe WAV files as it seems most models use it). Is using a GPU even useful for this task ? I think I will use audiomate https://github.com/ynop/audiomate to download and preprocess data (even possible to create a subset to experiment faster and there are more corpora), but if I have troubles I will keep your offer for a .tar file in mind :rofl:
At the moment I am considering using the DeepSpeech model and audiomate should help me to conveniently get it into that format. I will try this way and let you know how it goes. I did not know NeMo, I will consider it next :slight_smile: That is also something I had in mind that transfer learning between languages should work as the sounds are the same, it is mostly the language model you use in your decoder that is going to be impacted by the change of language.

Thanks again !
Charles

Awesome, I’d love to hear more about how you modified the library for ASR. What language? Do you mean 3.1% WER (near state of the art) or 31% WER? 0.31% WER would be about 7x better than current SOTA, and 31% WER is very high but would still be good for some languages.

I managed to get to 31% WER on Polish language using mozilla, clarin-pl and some manually extracted dataset for training. I’m testing on manually collected speech samples from various speakers. On that test set Google scores 10% and microsoft’s ASR gets as low as 7.7% WER. I am using symspell as a naive language model which results in 39 to 31% WER boost.

John Hartquist Spectrogram param search.

I didn’t know about this technique but I will definitely give it a try.
So far my workflow relies on CPU spectrogram generation and that’s a bottleneck.

I haven’t tried using transfer learning. NeMo looks very promising. I’ll post the results here once I manage to get it running.

2 Likes

Sorry for the delayed response, how is your work going? Any update? Creating the VM to do the downsampling is fine, or it can be done locally, whatever you prefer. I’ve never used audiomate, please report back if it makes things easier, I usually just do everything manually in python or fastai.

You would expect the acoustic models to be similar across languages because most sounds are similar or the same but it seems much harder to train other languages to as high level of accuracy as is achieved in English and it’s not just differences in language models. There are a lot of potential reasons for this, none of them inherent to the languages themselves (if anything Spanish should be much easier since it’s phonetic and there is a near one-to-one correspondence between letters and symbols that English doesnt have).

In my experience, noisy labels affect CTC (the loss function used in models like Deepspeech/Quartznet) more than they do in other applications like vision. I would recommend listening to CommonVoice Spanish and trying to filter bad labels. In my experience with that set, the shorter labels are especially noisy, even if they have been “validated”

Hi @MadeUpMasters, it is all going great at the moment. Downsampled everything at 16kHz and stored multiple datasets (from audiomate) on a Google Storage bucket. I am at the moment training DeepSpeech with V100. With only one GPU, it is very slow so I will need to use several.
Audiomate did make things easier as not all datasets come in the same format. It enables to prepare the data and turn it into Mozilla format so it is ready to be fed into DeepSpeech.
Also DeepSpeech enables to add all sorts of augmentation very easily.

1 Like

Hello Everyone,

I am exploring DCASE Acoustic scene classification data set, and I am planning to transform all the audio files to spectrograms and build a model (CNN) that could read these spectrograms (images) and come up with the predictions.

Each audio file is of duration 10 secs. I am using librosa library to convert to spectrograms, I have doubt regarding resizing a spectrogram.
Eg: Takes a audio file which is labeled as Airport.
y, sr = librosa.load(fpath)
# S = librosa.feature.melspectrogram(y)
plt.figure(figsize=(12, 12))
D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)
# plt.subplot(4, 2, 1)
librosa.display.specshow(D)
# my_path = os.path.abspath(file)
fpath = fpath.split(’/’)
fpath = fpath[len(fpath) - 1].split(’.’)[0] + “_12x12.png”
s_file_name = path / fpath
plt.savefig(s_file_name)
by changing the fig size parameter we could resize the figure dimensions

as we resize to lower units the image is getting squeezed, does this effect the model prediction? and what is the better way to resize(square shape).

1 Like

Hi, I’m getting a problem with loading my model once it has been learn.export(‘model’)

In a separate notebook I use learn = load.learner(‘model’)

How to I run new audio files applying the necessary transformations (Delta) when apparently my model only accepts audio tensors. They don’t include the delta spectrograms for change.

I get the following error:
AssertionError: Expected an input of type in

  • <class ‘pathlib.PosixPath’>
  • <class ‘fastaudio.core.signal.AudioTensor’>
    but got <class ‘fastaudio.core.spectrogram.AudioSpectrogram’>

Resizing the graphs to make them square you could be losing information in the time domain. Try Changing the hop_length in librosa.feature.melspectrogram(y).

The section on hop length describes it very well.
Another way to make them more square would be to change the length of the sound clip with ResizeSignal(5000) (5 Seconds).
The spectrograms being square doesn’t matter as the CNN will add some amount of padding. What does matter is that ALL your input spectrograms are the same length (size)

Hi everyone,
I have been working with NeMo to get a speech recognizer working with Spanish. Now, I am trying to get it to run in real-time, to do something similar to Alexa or ‘Ok Google’. However, I am not sure how to proceed. The goal would be to give orders to the app to execute specific actions. So we need to continuously listen and transcribe to be able to recognize previously defined sentences. Does anyone have experience with this ?

Are you doing this in a continous speech recognition with Quartznet or as a command classifier, eg. MatchBoxnet?

1 Like

Hello Beautiful peeps,

I’m trying to get into ASR and deep learning, but I’m getting stuck.
I have a speech dataset of multiple speakers, each speaker has 3 audio files of 45 seconds.
I have the corresponding outputs in a .seg file
I also have a list of 50 “phonemes/characters” for which I have to make a dictionary.

my questions are:
1- While computing the MFCC’s, do I compute the feature matrix for each utterance individually, or do I compute one single one single matrix containing the features for all utterances?
2- What’s the purpose of the phonemes dictionary?
3- Do you have any tutorial or pytorch example of dataloader that I can follow once I have the MFCC’s on disk?

Thank a lot.

1 Like

Hi folks,
I recently added speech commands classification based on nvidia nemo models to icevision library. You can check the colab here: https://colab.research.google.com/drive/1POCDMwNqe1Sq8eOwb5fEopRE5RobQJCJ?usp=sharing

It supports training with fastai using the nemo classification models.
We have an active discord channel for the library and growing audio community: https://discord.gg/wFhC6nZQ
If anyone is interested in this topic and wants to contribute to the project (mainly adding nemo asr models support) you’re more than welcome!
Im also planning to add support for various fastaudio features there.

2 Likes