Fastai v2 audio

kdorichev · April 11, 2020, 6:53am

>> 1st Fastai2 Audio Meetup <<
Sorry for the late notice, but in case interested, come join the meeting of the developers of the Fastai2 Audio.
Meeting Agenda and the link to Google Meet is in the Calendar event:
https://calendar.google.com/event?action=TEMPLATE&tmeid=MzU1NG52bmFlc3JhaGVpbjZqaGZzOHVwa2ogM2djNDI3dW45cWRsMTgwOThhbmU4OHRoMzRAZw&tmsrc=3gc427un9qdl18098ane88th34%40group.calendar.google.com

riven314 · April 12, 2020, 7:40am

I would love to contribute to the this amazing repo
I see fastai_audio has a number of great tutorial notebooks, would you plan to migrate them to fastai2_audio?

MadeUpMasters · April 12, 2020, 12:29pm

Yes we do plan to migrate them with a bit of restructuring and added features.

Mikful · April 12, 2020, 5:07pm

Hi folks, I’ve got a question re transforms. Can we use the the standard vision batch_tfms augmentations on the audio spectrograms in a DataBlock like this? It runs, but I’m just trying to understand what’s going on behind the scenes if we’re using the AudioBlock instead of ImageBlock?:

dblock = DataBlock(blocks=(AudioBlock, MultiCategoryBlock),
                   splitter=RandomSplitter(valid_pct=0.2, seed=42)
                   get_x=get_x,
                   get_y=get_y,
                   item_tfms = item_tfms,
                   batch_tfms=[Normalize(),
                               RandomResizedCrop(256, min_scale=0.08, ratio=(1, 1))]
                  )

Thanks

MadeUpMasters · April 12, 2020, 7:03pm

We haven’t actually experimented that much with the standard image transforms. My intuition is that almost all of them would be a bad idea. Spectrograms and photos are very different. If you move a human face 20 pixels upward, it’s still a human face. If you move a lawnmower sound up 20 mel bins, is it still a lawnmower? I don’t know honestly, on one hand the pitch will change, but the pattern of sound in the time domain may be distinctive enough to still be identified, and maybe there are some recording conditions that cause a lawnmower to have higher/lower pitch so this would be a successful augmentation.

Other transforms, like skewing, seem like a very bad idea. RandomResizedCrop (I’m assuming this crops a random section and expands it to the size of the original?) also seem like bad ideas for spectrograms. I think both the X and Y-axis needs to be of a constant scale for spectrograms to be compared by your model. That being said, try it out and see what results you get, because with deep learning, my intuition is almost always wrong, and some very silly stuff often turns out to be effective, including for audio. If you do try it out, please report back here if it was better or worse. Thanks.

Mikful · April 13, 2020, 1:32pm

I totally agree in terms of the flipping/shifting/warping distorting the spectrogram such that it loses meaning perceptually. However, I’ve been reading the following write-up from a Freesound 2019 entry and there are some really interesting ideas in there:

The one that I was thinking about is the training on sub-sections of the overall clip. I was thinking that maybe using the standard transforms I could leverage the idea of cropping out smaller square sections of longer, maybe ~10s spectrograms (which would be either repeated clips using ‘Repeat’ mode on CropSignal, or complete sections of longer clips) and use those to train. I think this can be done with on of the following?:

RandomResizedCropGPU(256, min_scale=1, ratio=(1, 1), mode='bilinear', valid_scale=1.0)

or RandomCrop + Resize

Although I’m not totally sure. Say if we set n_mels to 128, then Resize to 256px then would it be as if we are randomly cropping out 128px square sections and resizing to 256x256px and thereby possibly gaining the maximum amount of feature learning over the different batches? I’ve been trying it by training on the Freesound 2019 curated train set and it seems to work well enough, but this doesn’t seem to translate to the test set, nor with Brightness or Contrast augmentations. Perhaps it’s not the best dataset to test this one, so I will try on ESC50 or something more standard later and update.

MadeUpMasters · April 13, 2020, 1:49pm

Cropping shorter subsections definitely works and is a valid augmentation, I almost always use this. I’m not sure if repeat is better than the default pad mode we use (random zero-padding before and after).

We have tried the crop 128x128 and resample to 256x256 and it does show improved results. It still blows me away that that appears to work better than just taking 256 melbins (higher frequency resolution). I have no idea why bilinear interpolation would work better than actually adding more information, but it appears that it does.

Definitely keep experimenting and report back what you find. It is very early days for this type of stuff.

Mikful · April 14, 2020, 6:42pm

That’s interesting re the 128 -> 256x256 resample. Is it that using the 128 mel bins -> 256 via bilinear interpolation is like a form of compression that clearly delineates the harmonic/temporal relationships, which essentially gives the network clearer edges to learn?

I have another query re V2, is there a quick method to see the same batch shown within dls.show_batch() after the batch transforms are applied? A bit lost in the docs with this as I’m fairly new to it.

DanL · April 15, 2020, 9:31am

Did you end up recording this meeting? Thanks.

scart97 · May 22, 2020, 10:11pm

BREAKING CHANGE ANNOUNCEMENT

I just merged a change into the fastai2_audio repository that alters the structure of the modules. Now, both core and augment submodules are split into multiple files, meaning that you can choose to use just a part of them (like import just the signal processing stuff and ignore spectrograms if you want).

That also means some imports need to be changed:

from fastai2_audio.core import * is now from fastai2_audio.core.all import *
from fastai2_audio.augment import * is now from fastai2_audio.augment.all import *

There is also a from fastai2_audio.all import * if you want to quickly import everything.

Mikful · May 25, 2020, 10:12pm

Hi folks, is there any way to export a trained audio learner currently? I’m getting the following error:

AttributeError: Can't pickle local object 'RemoveSilence.<locals>._inner'

scart97 · May 27, 2020, 12:38am

This is a know problem (related issue) and I started working on the fix.

jerbly · June 1, 2020, 1:45pm

Note that you’ll need Pytorch v.1.5.1 as pickling was also broken in v1.5.0

scart97 · June 1, 2020, 9:36pm

Didn’t know about this problem with torch 1.5.0, but it’s unrelated with the export problem here because I can reproduce the error with torch 1.4.

I fixed some of the problems with the transforms that were causing trouble, but now I hit a wall that i couldn’t get past after some days debugging.

Saving either only the model, the dataset, or the transforms are working, but when torch tries to save the TfmdDL it breaks.

I made a notebook with the problem here

muellerzr · June 16, 2020, 9:52pm

I’m currently trying to do some audio work and I’m getting the following error:

RuntimeError: stack expects each tensor to be equal size, but got [1, 128, 121] at entry 0 and [1, 128, 111] at entry 2

My batch_tfms are [RemoveSilence, Resample] (first thing I tried to fix this issue) and item are

cfg_voice = AudioConfig.BasicMelSpectrogram()
a2s = AudioToSpec.from_cfg(cfg_voice)
crop_1000ms = CropSignal(500)
tfms = [crop_1000ms, a2s]

Is there something more I need? Thanks!

scart97 · June 16, 2020, 10:47pm

This error is happening because you are trying to batch spectrograms with different length. You are already crop/padding the signals to a fixed size before transforming them, so they probably have different sampling rates to cause this problem. Try to add only the Resample transform at your item_tfms, and it needs to be the first thing to happen. So,

tfms = [Resample(8000), crop_1000ms, a2s]

Where 8000 is the new sampling rate. This choice depends on what is present in the audio files.

scart97 · June 16, 2020, 10:49pm

For pure voice audio, 8 khz (8000) should be enough, but if you have other sources of sound besides voice you may want to use 16 khz (16000) or even 44.1 khz (441000). Those rates are directly related to the highest frequency present in your audio and the Nyquist theorem.

muellerzr · June 17, 2020, 12:02am

That fixed the error right away! Thank you so much @scart97

scart97 · June 17, 2020, 4:00am

@rbracco What if we change the AudioBlock to always include Resample and optionally DownmixMono and CropSignal ? That would help fixing the majority of problems users have when loading data.
The new signature would be:

def AudioBlock(sample_rate=16000, force_mono=True, crop_signal_to=None, cls=AudioTensor)

MadeUpMasters · June 18, 2020, 2:12pm

Hey, as mentioned in the PR I think it’s too much of a time bottleneck. I think possible alternatives are

Getting resampling working on the GPU might be fast enough
Caching results using some new caching system
Giving user a function that preprocesses (resample, remove silence…etc) and outputs to a new folder and that is the starting point for the ML pipeline (similar to caching, there may be a good way to do this within fastai2 as well)