Deep Learning with Audio Thread

abnersuniga · May 3, 2020, 7:12pm

Hello guys, I started my first lesson of DL for coders this week. I’m working on a problem of multilabel classification, but I’ve been having a really hard time opening my dataset with the from_ methods. I’m using the fastai_audio library, I applied the correct paths but it doesn’t open any item. Probably I’m missing something. Thank you.

PATH_BASE = Path('datasets/CAL500_32kps')
PATH_BASE.ls()
[PosixPath('datasets/CAL500_32kps/CAL500_32kps_wav'),
 PosixPath('datasets/CAL500_32kps/CAL500_32kps.csv')]


pd.read_csv(PATH_BASE/'CAL500_32kps.csv').iloc[0]
song              CAL500_32kps_wav/10cc-for_you_and_i.wav
tag      NOT.Emotion.Angry_._Agressive  NOT.Emotion.Ar...


audios = (AudioList.from_csv(PATH_BASE, 'CAL500_32kps.csv')
    .split_by_rand_pct(0.2)
    .label_from_df(label_delim=' ')
    .databunch()
)
audios.c, audios.classes, audios

Filtered out 502 empty files

Out[18]:

(0, [], AudioDataBunch; Train: AudioLabelList (0 items) x: AudioList y: MultiCategoryList Path: datasets/CAL500_32kps;

baz · May 4, 2020, 9:51am

What happens when you try to open one of the audio files on their own? AudioList.open(‘file_path’).

or open_audio(‘file_path’)

What is the format of your csv? Could you send us a sample of a few of the rows?

abnersuniga · May 7, 2020, 11:04am

@baz Thanks for the response. I changed my folder structure to work correctly. But before It should be working fine as well. Anyways, I would like to know how you deal with real songs, where the duration of each audio is longer than some seconds. The preprocessing and train are taking a lot of time even with a small dataset. What do you recommend? I got a better performance using max_to_pad but I’m not sure it’s the best option. Thank you.

shruti_01 · May 8, 2020, 9:12am

I got the same empty set with from_csv, had to move to from_folder. How did you fix it?

abnersuniga · May 8, 2020, 2:41pm

It depends on where the script is running. Printing items in AudioList. __init__ might help.

Update of my last response: Running some tests I realized that it didn’t take that long it was only the first epoch. Currently using only the duration parameter. Thank you.

baz · May 15, 2020, 2:58pm

What is the size of your dataset? How long is each file?

What configuration are you running with? How many songs do you have?

I think its quite common to break audio files into smaller chunks to be classified.

Have a look through the tutorial notebooks in the repo for more information and let us know if you have any questions.

baz · May 15, 2020, 2:59pm

The fastaiv1 library is now installable via:

pip install git+https://github.com/mogwai/fastai_audio@0.1

or as part of your requirements.txt file:

Pillow==6.0.1
git+https://github.com/mogwai/fastai_audio@0.1

AjayStark · May 24, 2020, 3:09pm

Hiii, I was able to do classification on audio (cat vs dog) and was wondering how to take it to speech-to-text. Has anyone worked on it?

Thanks

baz · July 15, 2020, 6:24pm

mrfabulous1 · July 17, 2020, 1:15pm

Hi baz Nice post.
Cheers mrfabulous1

angelo_b · August 23, 2020, 4:41pm

Hello! I am training a CNN model to extract vocals from audio. I used MUSDB18 dataset for training. The problem is when I try to evaluate them with museval, I get negative SDR, ISR, SAR. I thought that the problem is how I inverse the signal with ISTFT(I used Librosa and SciPy Libraries), but I couldn’t figure it out. It’s really strange because the results are very good just by listening to them. Does anyone have any idea? Thanks!

KevinB · August 27, 2020, 11:31pm

There is now a channel on the fastai discord server to discuss all things fastai_audio

MadeUpMasters · August 31, 2020, 4:07pm

Hey all, sorry I haven’t been contributing to this thread much. I’ve still been working on audio, specifically speech recognition, and have learned a lot of great tricks for training models. I plan to put some work into this thread updating the wiki with more resources and sharing what I’ve learned. Here’s some of what I hope to add:

Sorting your first epoch by audio length (Sortagrad) and how to implement
How to use a SortishSampler (trademark Jeremy Howard) for grouping audios by length for efficient GPU utilization while not having the same items together for every batch
Training with CTC loss, how it plateaus, and why it sometimes fails to converge
Experiments to warmup speech models on short labels until they converge, then train them on full datasets.
How to use CTCDecode and beam-search: https://github.com/parlance/ctcdecode

tyoc213 · September 1, 2020, 5:35pm

which wiki?

MadeUpMasters · September 3, 2020, 2:06pm

The original post in this thread: Deep Learning with Audio Thread

ahammami0 · October 29, 2020, 10:03am

Hello audio peeps,

I’m a newbie in deep learning let alone audio… but I have a small question.
I’m trying to make sense of using MLP for speech feature extraction, and I’m a bit confused about the names of parameters.
Could you explain the difference between ‘control parameters’, ‘trainable parameters’, and ‘learning parameters’?

Thanks a lot.

Cheers,

MadeUpMasters · October 30, 2020, 2:09pm

Hey @ahammami0, can you share with me where you saw these terms? I’m not completely familiar with them but can take a look and hopefully explain after. In deep learning we generally differentiate between ‘trainable parameters’ and ‘untrainable parameters’ which are more commonly called ‘hyperparameters’.

Trainable parameters are just what they sound like, parameters that are updated through training in order to lower the loss and get closer to a good solution. Hyperparameters are the ones that are chosen in advance and aren’t learned by the model through backpropagation, such as learning rate, the number of mel bins in your melspectrogram…etc.

Anyways welcome to audio DL and please reach out if you have any questions.

ahammami0 · October 30, 2020, 2:26pm

Hey @MadeUpMasters, Thanks a lot for showing support. I actually cleared my own confusion. The “control parameters” is the term used to call the acoustic parameters of the speech signal, during the speech feature extraction phase.
Much appreciated.

s2k · November 9, 2020, 11:20am

Hello everyone,

I’m a software engineer but a newbie in the ML/DL world. I’ve a non-english STT project that I’m currently tinkering with.

I’m trying to use the fastai DataBlock API to prepare a learner with ImageBlock (spectograms) and TextBlock, using the model from deepspeech.pytorch. Haven’t really been able to get things going yet unfortunately.

Do you folks recommend that I try the fastai-audio library instead?

ahammami0 · November 12, 2020, 4:33pm

Dear audio peeps,
I’m diving into pathological speech processing and I’m trying to wrap up my head around the feature extraction and audio pre-processing. For this, I’d like to know what you believe are best tools / methods / approaches and best practices that are applied in the said domain.

I appreciate any input.

Thank you all in advance.
A