Deep Learning with Audio Thread

I got the same empty set with from_csv, had to move to from_folder. How did you fix it?

It depends on where the script is running. Printing items in AudioList. __init__ might help.

Update of my last response: Running some tests I realized that it didn’t take that long it was only the first epoch. Currently using only the duration parameter. Thank you.

What is the size of your dataset? How long is each file?

What configuration are you running with? How many songs do you have?

I think its quite common to break audio files into smaller chunks to be classified.

Have a look through the tutorial notebooks in the repo for more information and let us know if you have any questions.

The fastaiv1 library is now installable via:

pip install git+https://github.com/mogwai/fastai_audio@0.1

or as part of your requirements.txt file:

Pillow==6.0.1
git+https://github.com/mogwai/fastai_audio@0.1
1 Like

Hiii, I was able to do classification on audio (cat vs dog) and was wondering how to take it to speech-to-text. Has anyone worked on it?

Thanks

2 Likes
7 Likes

Hi baz Nice post.
Cheers mrfabulous1 :smiley: :smiley: :smiley:

1 Like

Hello! I am training a CNN model to extract vocals from audio. I used MUSDB18 dataset for training. The problem is when I try to evaluate them with museval, I get negative SDR, ISR, SAR. I thought that the problem is how I inverse the signal with ISTFT(I used Librosa and SciPy Libraries), but I couldn’t figure it out. It’s really strange because the results are very good just by listening to them. Does anyone have any idea? Thanks!

1 Like

There is now a channel on the fastai discord server to discuss all things fastai_audio :slight_smile:

2 Likes

Hey all, sorry I haven’t been contributing to this thread much. I’ve still been working on audio, specifically speech recognition, and have learned a lot of great tricks for training models. I plan to put some work into this thread updating the wiki with more resources and sharing what I’ve learned. Here’s some of what I hope to add:

  • Sorting your first epoch by audio length (Sortagrad) and how to implement
  • How to use a SortishSampler (trademark Jeremy Howard) for grouping audios by length for efficient GPU utilization while not having the same items together for every batch
  • Training with CTC loss, how it plateaus, and why it sometimes fails to converge
  • Experiments to warmup speech models on short labels until they converge, then train them on full datasets.
  • How to use CTCDecode and beam-search: https://github.com/parlance/ctcdecode
3 Likes

which wiki?

The original post in this thread: Deep Learning with Audio Thread

Hello audio peeps,

I’m a newbie in deep learning let alone audio… but I have a small question.
I’m trying to make sense of using MLP for speech feature extraction, and I’m a bit confused about the names of parameters.
Could you explain the difference between ‘control parameters’, ‘trainable parameters’, and ‘learning parameters’?

Thanks a lot.

Cheers,

1 Like

Hey @ahammami0, can you share with me where you saw these terms? I’m not completely familiar with them but can take a look and hopefully explain after. In deep learning we generally differentiate between ‘trainable parameters’ and ‘untrainable parameters’ which are more commonly called ‘hyperparameters’.

Trainable parameters are just what they sound like, parameters that are updated through training in order to lower the loss and get closer to a good solution. Hyperparameters are the ones that are chosen in advance and aren’t learned by the model through backpropagation, such as learning rate, the number of mel bins in your melspectrogram…etc.

Anyways welcome to audio DL and please reach out if you have any questions.

Hey @MadeUpMasters, Thanks a lot for showing support. I actually cleared my own confusion. The “control parameters” is the term used to call the acoustic parameters of the speech signal, during the speech feature extraction phase.
Much appreciated.

1 Like

Hello everyone,

I’m a software engineer but a newbie in the ML/DL world. I’ve a non-english STT project that I’m currently tinkering with.

I’m trying to use the fastai DataBlock API to prepare a learner with ImageBlock (spectograms) and TextBlock, using the model from deepspeech.pytorch. Haven’t really been able to get things going yet unfortunately.

Do you folks recommend that I try the fastai-audio library instead?

1 Like

Dear audio peeps,
I’m diving into pathological speech processing and I’m trying to wrap up my head around the feature extraction and audio pre-processing. For this, I’d like to know what you believe are best tools / methods / approaches and best practices that are applied in the said domain.

I appreciate any input.

Thank you all in advance.
A

1 Like

DM sent but right now fastaudio only does classification, not ASR. @scart97 and I are off learning ASR stuff in pytorch/lightning and hope to bring a simplified ASR pipeline to fastai. As I mentioned in the DM, for people looking to do ASR as easily as possible, Nvidia’s Nemo Library is probably the best place to start.

2 Likes

Can you give more details of the problem you’re trying to solve? What are you trying to detect in audio. The most common feature extraction is Mel-scaled Spectrograms and preprocessing really depends but it is standard to downsample to mono (average stereo channels so you have one channel) and resampling (changing the sampling rate of the audio). Let us know what you’re working on and we can point you in the right direction. There are also plenty of intro resources and tutorials in the first post of this thread.

Hi everyone,
I am working on a Speech Recogniser for Spanish. I will be using the Common Voice dataset https://commonvoice.mozilla.org/en/datasets and Google Cloud Platform. I am not sure how should I set it up.
How do you usually proceed with storage ? Do you store the compressed file or uncompressed ? Do you make the WAV conversion before or in the VM before training ? Is it even useful to pay for storage as those are public and should not I simply make sure the size of my VM is big enough ?
I could also use some GPU setup recommendation. I have been recommended 1 x T4 but we will probably need to train it from scratch as I could not find a pretrained model for Spanish. Will it do the job ?
Thanks so much and any answer even partial will be greatly appreciated. :slight_smile:

Charles

1 Like