Fastai v2 audio

muellerzr · June 18, 2020, 2:19pm

I’d start at looking at the NLP tokenizing process for this, as this is exaclty what is done (TextBlock)

jerbly · June 18, 2020, 3:15pm

On large datasets the CPU processing is pretty heavy already I wouldn’t like to default to pack more processing in. FWIW I do this kind of thing:

fnames = get_files(in_base_path)
for fn in fnames:
    !ffmpeg -i {fn} -ac 1 -ar 16000 {out_base_path/fn.name}

This will downmix and resample to 16kHz.

jerbly · June 18, 2020, 3:33pm

Also, just to share, if you have long files you want to segment into 10 second chunks then this works really well:

!ffmpeg -i "{fn}" -f segment -segment_time 10 "{new_fn}_%04d.wav"

scart97 · June 21, 2020, 6:33am

I usually preprocess my files using ffmpeg, that’s why I didn’t notice the increased time added by those transforms. Will try to implement caching similar to the one present in the text pipeline.

tyoc213 · June 22, 2020, 4:58am

I think this datasets are NOT AVAILABLE anymore

scart97 · June 27, 2020, 11:06pm

AudioBlock has been upgraded, now there’s a new method similar to the TextBlock that applies some preprocessing transforms on audio and caches the results to a different folder, then loads the audio data from the new folder.

AudioBlock.from_folder(path, sample_rate=16000, force_mono=True, crop_signal_to=None)

It does not introduce more CPU processing during training, the caching is done during the block creation.

jerbly · July 2, 2020, 12:35am

If I’m not using any augmentation on the spectrogram is there a recommended way to cache that?

MadeUpMasters · July 2, 2020, 7:21pm

V1 supports spectrogram caching, V2 doesn’t yet. It seems likely that spectrogram generation will be running on GPU soon which would make caching less desirable as it won’t save as much time (and might even be slower).

Can you give a bit more detail on what you’re working on? Do you know how long it’s taking to generate spectrograms? If you want to see how much time savings you’d get you could always load your dataset in fastai_audio V1 and see. Cheers.

jerbly · July 2, 2020, 11:56pm

OK, that sounds really interesting - using what library? DALI?

I’ve made some assumptions here that I need to back up with data. What I know for sure is that I’m CPU bound. I’ll do some experiments…

MadeUpMasters · July 5, 2020, 2:29pm

Just using torchaudio, I don’t believe it will require any special libraries for GPU optimization/augmentation.

MichaelScofield · July 17, 2020, 7:24am

Hi everyone,

I’m interested in Voice Activity Detection, it’s like a segmentation problem where we classify each frame of the audio input to whether ‘speech’ or ‘non-speech’. Can I do it with fastai1 or fastai2 audio?

paidion · July 17, 2020, 8:51am

Hi, I came across Mozilla Common Voice Cantonese community. Some of us would like to used the collected Cantonese voice data to learn Deep Speech or other voice to text technology.

What would be the equivalent resources in Fastai2? I am looking forwards to taking the course – fastai2- July 2020. I would appreciate any advice or networking to the resources/communities. Thanks.

MadeUpMasters · July 17, 2020, 11:56am

This is an interesting question and I’m a bit unsure of the answer because I don’t know exactly what the outputs would look like. How detailed is the segmentation? When you say ‘each frame’, are you talking about some discrete chunk of the audio like a 25ms slice?

My instinct is that fastai audio wouldn’t be an out of the box solution for this, and that you would have to add a lot of additional code/tweaks, but that the other functionality like spectogram generation/resampling/silence removal might make it worth it. On the other hand, torchaudio has advanced a lot since we started this, and it now has most of the functionality that we implemented, so using that and pytorch is an option as well.

MadeUpMasters · July 17, 2020, 12:01pm

Fastai2 audio only supports classification at the moment, and there is no speech-to-text functionality.

@scart97 and I are both working on speech-to-text in multiple languages (I’m attempting to use Common Voice for one of them), and we are in a fairly active audio telegram chat. If you, or anyone here, are interested, PM me your telegram info and I’ll get you added. I’m also happy to answer any questions here, or in the Deep Learning with Audio Thread

MichaelScofield · July 17, 2020, 2:34pm

Thanks for the quick response, it makes this topic so active and reliable.

So I get it, we haven’t get there with fastai. For everyone who has the same interest, there’s a library that handles Voice Activity Detection, Speaker Change Detection,… and is under very active developments, it’s ‘pyannote-audio’ (sry I’m using smartphone to reply). To me, its only downside is that it has many complex inner abstracts, methods, that’s not easy to customize or integrate with fastai.

rishsriv · July 20, 2020, 8:00pm

Hi all, thanks to all who are working on this.

Are you considering supporting something like kapre? (code here, paper here). This seems to be an order of magnitude faster than tutorial on fastai2_audio – though that may be an error in my usage than something intrinsic in the code.

From what I could tell, GPU utilisation seems fairly low for the fastai2_audio tutorial, since much of the preprocessing is being done by the CPU. Performing the pre-processing steps on GPU should significantly speed up training.

In case one of you is not actively working on moving pre-processing to GPU but have plans to incorporate it, would love to contribute.

scart97 · July 24, 2020, 2:35am

Torchaudio would be the equivalent of kapre in the pytorch ecosystem. We already use it, and it supports GPU transforms, but the way it’s integrated is preventing the code to work on the GPU.

I was trying to fix this right after implementing the caching system some time ago, but there was a serious problem and I didn’t have the time/energy to deal with it. Also I’m working right now with Robert on Speech Recognition, but we are using lightning instead of fastai2 because ASR needs some really weird tricks to work that are easier to implement there.

If you want to give it a try, just message me and I’ll guide you.

orangelmx · August 2, 2020, 3:06am

Hi there,

I was wondering if someone can tell me where is the 250-speaker database as the link that is provided is broken.

<Error>
<Code>AccessDenied</Code>
<BucketName>public-datasets</BucketName>
<RequestId>tx0000000000000211b1556-005f262cdb-2afb529-fra1a</RequestId>
<HostId>2afb529-fra1a-fra1</HostId>
</Error>

Thanks!

MadeUpMasters · August 2, 2020, 4:08pm

This was a handmade subset of the VoxCeleb dataset that we are no longer hosting because there were concerns about potential data leakage. If you can tell me what exactly you would like to do, I can suggest a dataset that would be good. Here are some good options:

izk8 · August 14, 2020, 2:41pm

Chi-learn apparent personality data set “First Impressions” is pretty good.