>> 1st Fastai2 Audio Meetup <<
Sorry for the late notice, but in case interested, come join the meeting of the developers of the Fastai2 Audio.
Meeting Agenda and the link to Google Meet is in the Calendar event:
https://calendar.google.com/event?action=TEMPLATE&tmeid=MzU1NG52bmFlc3JhaGVpbjZqaGZzOHVwa2ogM2djNDI3dW45cWRsMTgwOThhbmU4OHRoMzRAZw&tmsrc=3gc427un9qdl18098ane88th34%40group.calendar.google.com
I would love to contribute to the this amazing repo
I see fastai_audio has a number of great tutorial notebooks, would you plan to migrate them to fastai2_audio?
Yes we do plan to migrate them with a bit of restructuring and added features.
Hi folks, Iāve got a question re transforms. Can we use the the standard vision batch_tfms augmentations on the audio spectrograms in a DataBlock like this? It runs, but Iām just trying to understand whatās going on behind the scenes if weāre using the AudioBlock instead of ImageBlock?:
dblock = DataBlock(blocks=(AudioBlock, MultiCategoryBlock),
splitter=RandomSplitter(valid_pct=0.2, seed=42)
get_x=get_x,
get_y=get_y,
item_tfms = item_tfms,
batch_tfms=[Normalize(),
RandomResizedCrop(256, min_scale=0.08, ratio=(1, 1))]
)
Thanks
We havenāt actually experimented that much with the standard image transforms. My intuition is that almost all of them would be a bad idea. Spectrograms and photos are very different. If you move a human face 20 pixels upward, itās still a human face. If you move a lawnmower sound up 20 mel bins, is it still a lawnmower? I donāt know honestly, on one hand the pitch will change, but the pattern of sound in the time domain may be distinctive enough to still be identified, and maybe there are some recording conditions that cause a lawnmower to have higher/lower pitch so this would be a successful augmentation.
Other transforms, like skewing, seem like a very bad idea. RandomResizedCrop (Iām assuming this crops a random section and expands it to the size of the original?) also seem like bad ideas for spectrograms. I think both the X and Y-axis needs to be of a constant scale for spectrograms to be compared by your model. That being said, try it out and see what results you get, because with deep learning, my intuition is almost always wrong, and some very silly stuff often turns out to be effective, including for audio. If you do try it out, please report back here if it was better or worse. Thanks.
I totally agree in terms of the flipping/shifting/warping distorting the spectrogram such that it loses meaning perceptually. However, Iāve been reading the following write-up from a Freesound 2019 entry and there are some really interesting ideas in there:
The one that I was thinking about is the training on sub-sections of the overall clip. I was thinking that maybe using the standard transforms I could leverage the idea of cropping out smaller square sections of longer, maybe ~10s spectrograms (which would be either repeated clips using āRepeatā mode on CropSignal, or complete sections of longer clips) and use those to train. I think this can be done with on of the following?:
RandomResizedCropGPU(256, min_scale=1, ratio=(1, 1), mode='bilinear', valid_scale=1.0)
or RandomCrop
+ Resize
Although Iām not totally sure. Say if we set n_mels
to 128, then Resize to 256px then would it be as if we are randomly cropping out 128px square sections and resizing to 256x256px and thereby possibly gaining the maximum amount of feature learning over the different batches? Iāve been trying it by training on the Freesound 2019 curated train set and it seems to work well enough, but this doesnāt seem to translate to the test set, nor with Brightness or Contrast augmentations. Perhaps itās not the best dataset to test this one, so I will try on ESC50 or something more standard later and update.
Cropping shorter subsections definitely works and is a valid augmentation, I almost always use this. Iām not sure if repeat is better than the default pad mode we use (random zero-padding before and after).
We have tried the crop 128x128 and resample to 256x256 and it does show improved results. It still blows me away that that appears to work better than just taking 256 melbins (higher frequency resolution). I have no idea why bilinear interpolation would work better than actually adding more information, but it appears that it does.
Definitely keep experimenting and report back what you find. It is very early days for this type of stuff.
Thatās interesting re the 128 -> 256x256 resample. Is it that using the 128 mel bins -> 256 via bilinear interpolation is like a form of compression that clearly delineates the harmonic/temporal relationships, which essentially gives the network clearer edges to learn?
I have another query re V2, is there a quick method to see the same batch shown within dls.show_batch()
after the batch transforms are applied? A bit lost in the docs with this as Iām fairly new to it.
Did you end up recording this meeting? Thanks.
BREAKING CHANGE ANNOUNCEMENT
I just merged a change into the fastai2_audio repository that alters the structure of the modules. Now, both core
and augment
submodules are split into multiple files, meaning that you can choose to use just a part of them (like import just the signal processing stuff and ignore spectrograms if you want).
That also means some imports need to be changed:
-
from fastai2_audio.core import *
is nowfrom fastai2_audio.core.all import *
-
from fastai2_audio.augment import *
is nowfrom fastai2_audio.augment.all import *
There is also a from fastai2_audio.all import *
if you want to quickly import everything.
Hi folks, is there any way to export a trained audio learner currently? Iām getting the following error:
AttributeError: Can't pickle local object 'RemoveSilence.<locals>._inner'
This is a know problem (related issue) and I started working on the fix.
Note that youāll need Pytorch v.1.5.1 as pickling was also broken in v1.5.0
Didnāt know about this problem with torch 1.5.0, but itās unrelated with the export problem here because I can reproduce the error with torch 1.4.
I fixed some of the problems with the transforms that were causing trouble, but now I hit a wall that i couldnāt get past after some days debugging.
Saving either only the model, the dataset, or the transforms are working, but when torch tries to save the TfmdDL it breaks.
I made a notebook with the problem here
Iām currently trying to do some audio work and Iām getting the following error:
RuntimeError: stack expects each tensor to be equal size, but got [1, 128, 121] at entry 0 and [1, 128, 111] at entry 2
My batch_tfms
are [RemoveSilence, Resample]
(first thing I tried to fix this issue) and item
are
cfg_voice = AudioConfig.BasicMelSpectrogram()
a2s = AudioToSpec.from_cfg(cfg_voice)
crop_1000ms = CropSignal(500)
tfms = [crop_1000ms, a2s]
Is there something more I need? Thanks!
This error is happening because you are trying to batch spectrograms with different length. You are already crop/padding the signals to a fixed size before transforming them, so they probably have different sampling rates to cause this problem. Try to add only the Resample
transform at your item_tfms
, and it needs to be the first thing to happen. So,
tfms = [Resample(8000), crop_1000ms, a2s]
Where 8000 is the new sampling rate. This choice depends on what is present in the audio files.
For pure voice audio, 8 khz (8000) should be enough, but if you have other sources of sound besides voice you may want to use 16 khz (16000) or even 44.1 khz (441000). Those rates are directly related to the highest frequency present in your audio and the Nyquist theorem.
That fixed the error right away! Thank you so much @scart97
@rbracco What if we change the AudioBlock
to always include Resample
and optionally DownmixMono
and CropSignal
? That would help fixing the majority of problems users have when loading data.
The new signature would be:
def AudioBlock(sample_rate=16000, force_mono=True, crop_signal_to=None, cls=AudioTensor)
Hey, as mentioned in the PR I think itās too much of a time bottleneck. I think possible alternatives are
- Getting resampling working on the GPU might be fast enough
- Caching results using some new caching system
- Giving user a function that preprocesses (resample, remove silenceā¦etc) and outputs to a new folder and that is the starting point for the ML pipeline (similar to caching, there may be a good way to do this within fastai2 as well)