Fastai v2 audio

>> 1st Fastai2 Audio Meetup <<
Sorry for the late notice, but in case interested, come join the meeting of the developers of the Fastai2 Audio.
Meeting Agenda and the link to Google Meet is in the Calendar event:
https://calendar.google.com/event?action=TEMPLATE&tmeid=MzU1NG52bmFlc3JhaGVpbjZqaGZzOHVwa2ogM2djNDI3dW45cWRsMTgwOThhbmU4OHRoMzRAZw&tmsrc=3gc427un9qdl18098ane88th34%40group.calendar.google.com

I would love to contribute to the this amazing repo
I see fastai_audio has a number of great tutorial notebooks, would you plan to migrate them to fastai2_audio?

1 Like

Yes we do plan to migrate them with a bit of restructuring and added features.

Hi folks, Iā€™ve got a question re transforms. Can we use the the standard vision batch_tfms augmentations on the audio spectrograms in a DataBlock like this? It runs, but Iā€™m just trying to understand whatā€™s going on behind the scenes if weā€™re using the AudioBlock instead of ImageBlock?:

dblock = DataBlock(blocks=(AudioBlock, MultiCategoryBlock),
                   splitter=RandomSplitter(valid_pct=0.2, seed=42)
                   get_x=get_x,
                   get_y=get_y,
                   item_tfms = item_tfms,
                   batch_tfms=[Normalize(),
                               RandomResizedCrop(256, min_scale=0.08, ratio=(1, 1))]
                  )

Thanks

We havenā€™t actually experimented that much with the standard image transforms. My intuition is that almost all of them would be a bad idea. Spectrograms and photos are very different. If you move a human face 20 pixels upward, itā€™s still a human face. If you move a lawnmower sound up 20 mel bins, is it still a lawnmower? I donā€™t know honestly, on one hand the pitch will change, but the pattern of sound in the time domain may be distinctive enough to still be identified, and maybe there are some recording conditions that cause a lawnmower to have higher/lower pitch so this would be a successful augmentation.

Other transforms, like skewing, seem like a very bad idea. RandomResizedCrop (Iā€™m assuming this crops a random section and expands it to the size of the original?) also seem like bad ideas for spectrograms. I think both the X and Y-axis needs to be of a constant scale for spectrograms to be compared by your model. That being said, try it out and see what results you get, because with deep learning, my intuition is almost always wrong, and some very silly stuff often turns out to be effective, including for audio. If you do try it out, please report back here if it was better or worse. Thanks.

2 Likes

I totally agree in terms of the flipping/shifting/warping distorting the spectrogram such that it loses meaning perceptually. However, Iā€™ve been reading the following write-up from a Freesound 2019 entry and there are some really interesting ideas in there:

The one that I was thinking about is the training on sub-sections of the overall clip. I was thinking that maybe using the standard transforms I could leverage the idea of cropping out smaller square sections of longer, maybe ~10s spectrograms (which would be either repeated clips using ā€˜Repeatā€™ mode on CropSignal, or complete sections of longer clips) and use those to train. I think this can be done with on of the following?:

RandomResizedCropGPU(256, min_scale=1, ratio=(1, 1), mode='bilinear', valid_scale=1.0)

or RandomCrop + Resize

Although Iā€™m not totally sure. Say if we set n_mels to 128, then Resize to 256px then would it be as if we are randomly cropping out 128px square sections and resizing to 256x256px and thereby possibly gaining the maximum amount of feature learning over the different batches? Iā€™ve been trying it by training on the Freesound 2019 curated train set and it seems to work well enough, but this doesnā€™t seem to translate to the test set, nor with Brightness or Contrast augmentations. Perhaps itā€™s not the best dataset to test this one, so I will try on ESC50 or something more standard later and update.

2 Likes

Cropping shorter subsections definitely works and is a valid augmentation, I almost always use this. Iā€™m not sure if repeat is better than the default pad mode we use (random zero-padding before and after).

We have tried the crop 128x128 and resample to 256x256 and it does show improved results. It still blows me away that that appears to work better than just taking 256 melbins (higher frequency resolution). I have no idea why bilinear interpolation would work better than actually adding more information, but it appears that it does.

Definitely keep experimenting and report back what you find. It is very early days for this type of stuff.

1 Like

Thatā€™s interesting re the 128 -> 256x256 resample. Is it that using the 128 mel bins -> 256 via bilinear interpolation is like a form of compression that clearly delineates the harmonic/temporal relationships, which essentially gives the network clearer edges to learn?

I have another query re V2, is there a quick method to see the same batch shown within dls.show_batch() after the batch transforms are applied? A bit lost in the docs with this as Iā€™m fairly new to it.

Did you end up recording this meeting? Thanks.

1 Like

BREAKING CHANGE ANNOUNCEMENT

I just merged a change into the fastai2_audio repository that alters the structure of the modules. Now, both core and augment submodules are split into multiple files, meaning that you can choose to use just a part of them (like import just the signal processing stuff and ignore spectrograms if you want).

That also means some imports need to be changed:

  • from fastai2_audio.core import * is now from fastai2_audio.core.all import *
  • from fastai2_audio.augment import * is now from fastai2_audio.augment.all import *

There is also a from fastai2_audio.all import * if you want to quickly import everything.

Hi folks, is there any way to export a trained audio learner currently? Iā€™m getting the following error:

AttributeError: Can't pickle local object 'RemoveSilence.<locals>._inner'

This is a know problem (related issue) and I started working on the fix.

Note that youā€™ll need Pytorch v.1.5.1 as pickling was also broken in v1.5.0

2 Likes

Didnā€™t know about this problem with torch 1.5.0, but itā€™s unrelated with the export problem here because I can reproduce the error with torch 1.4.

I fixed some of the problems with the transforms that were causing trouble, but now I hit a wall that i couldnā€™t get past after some days debugging.

Saving either only the model, the dataset, or the transforms are working, but when torch tries to save the TfmdDL it breaks.

I made a notebook with the problem here

Iā€™m currently trying to do some audio work and Iā€™m getting the following error:

RuntimeError: stack expects each tensor to be equal size, but got [1, 128, 121] at entry 0 and [1, 128, 111] at entry 2

My batch_tfms are [RemoveSilence, Resample] (first thing I tried to fix this issue) and item are

cfg_voice = AudioConfig.BasicMelSpectrogram()
a2s = AudioToSpec.from_cfg(cfg_voice)
crop_1000ms = CropSignal(500)
tfms = [crop_1000ms, a2s]

Is there something more I need? :slight_smile: Thanks!

This error is happening because you are trying to batch spectrograms with different length. You are already crop/padding the signals to a fixed size before transforming them, so they probably have different sampling rates to cause this problem. Try to add only the Resample transform at your item_tfms, and it needs to be the first thing to happen. So,

tfms = [Resample(8000), crop_1000ms, a2s]

Where 8000 is the new sampling rate. This choice depends on what is present in the audio files.

1 Like

For pure voice audio, 8 khz (8000) should be enough, but if you have other sources of sound besides voice you may want to use 16 khz (16000) or even 44.1 khz (441000). Those rates are directly related to the highest frequency present in your audio and the Nyquist theorem.

1 Like

That fixed the error right away! Thank you so much @scart97 :pray:

2 Likes

@rbracco What if we change the AudioBlock to always include Resample and optionally DownmixMono and CropSignal ? That would help fixing the majority of problems users have when loading data.
The new signature would be:

def AudioBlock(sample_rate=16000, force_mono=True, crop_signal_to=None, cls=AudioTensor)
2 Likes

Hey, as mentioned in the PR I think itā€™s too much of a time bottleneck. I think possible alternatives are

  • Getting resampling working on the GPU might be fast enough
  • Caching results using some new caching system
  • Giving user a function that preprocesses (resample, remove silenceā€¦etc) and outputs to a new folder and that is the starting point for the ML pipeline (similar to caching, there may be a good way to do this within fastai2 as well)