Fastai v2 audio

Hello everyone,

I am working on my masters which revolves around speech enhancement (noisy audio in -> model -> clean voice out). Been using fast.ai for few years now for vision, but mainly modeling audio with torch-audio.

Would be really happy to contribute to the library so that everyone, me including, could research audio problems at a faster pace.

To my knowledge (if I’m wrong - please share) there aren’t as many pre-trained audio models as in vision. Would be cool to train something resnet-ish (well, maybe VGG-ish for starters) that could be used for multiple problems and release it to everyone.

Maybe we could steal some ideas from the train Imagenet in 18mins guys and quickly train audio nets?

P.S. Seen a lot of people posting datasets in this thread. Here are my two cents on the subject:

6 Likes

Hi all,

Found this library while trying to generate some realistic sounds.
Posting about it here since it could be helpful for many audio tasks, especially augmentations.

Very nice API and documentation as well!

2 Likes

Was trying Zach Mueller 07_Audio notebook, then encounter this:
PS: Successfully installed colorednoise-1.1.1 fastai2-audio-0.0.1

=======================================

NameError Traceback (most recent call last)
in
----> 1 at = AudioTensor.create(fnames[0])

NameError: name ‘AudioTensor’ is not defined

Thank you and welcome. We have a few telegram groups, one for audio ML and the other for library development, please PM me if you would like to join either and I’ll get you added. Library development has been slow these days as we are all working on other projects, and things like torchaudio are rapidly advancing. For instance, a lot of the low-level transforms we spent time implementing are now available via torchaudio directly so we can probably ditch a lot of our low-level stuff and just wrap them and try to provide a good high-level api and good defaults for people who are new to audio.

I agree about having a pre-trained audio model and many of us have talked about working on something like that. We have wondered whether there could be one global pre-trained audio model, or if it would need to be split into several subdomains like voice. Also, slightly surprisingly, pretrained imagenet weights do fairly well for audio classification problems.

And thank you for the dataset, I think I will add spoken digit as the base tutorial, and we can go from there.

Thanks for reporting, I’m upgrading datasets and the library today and hopefully everything will be running smoothly with a working tutorial by today or tomorrow.

2 Likes

I was experimenting with fastai audio and found a small bug. If you are using AudioTensor without a Spectrogram Transformation, there is an error while training because tensor.new isn’t being passed sr in the training loop. The issue doesn’t show up in Dataset.summary, Dataloaders.show_batch or Dataloaders.one_batch.

Defaulting sr=None resolves this issue.

class AudioTensor(TensorBase):

    def __new__(cls, x, sr=None, **kwargs):
        return super().__new__(cls, x, sr=sr, **kwargs)

I’m aware of one model, trained on AudioSet, that is attempting to be an ImageNet for Audio: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. Here is a link to the paper and code on GitHub.

1 Like

In that same vein, here are two more potential models or starting points also trained on AudioSet.

VGGish for audio embeddings:


YAMNET for sound classifications.

VGGish also has a nice colab example: https://colab.research.google.com/drive/1TbX92UL9sYWbdwdGE0rJ9owmezB-Rl1C

1 Like

There is now a channel on the fastai discord server to discuss fastai_audio :slight_smile:

2 Likes

Thank you for reporting and sharing the paper. Issue was fixed by @scart97 in the new repo: https://github.com/fastaudio/fastaudio

can we do audio fingerprinting using fastai audio?.

Hey Robert, can you expand upon the potential data leakage concerns from the 250 speaker dataset?

I don’t fully remember what the issues were, just that when I did some EDA the dataset didn’t seem to be well composed for speaker recognition. I think the data leakage might have been only for 10 speakers? 250 had some weird issues like multiple languages. I also didnt like that, since it was a custom dataset, there was no way to benchmark it so early on it was hard for us to tell if our models were actually competitive.

Hi all, I’m trying to figure out how to achieve the following. If anyone has any pointers I’d be grateful:

image

This is from the paper @florianl posted a little while back on discord https://arxiv.org/pdf/1904.08990.pdf. I’ve tried implementing some of the networks within Lightning, with some interesting results, but couldn’t figure out how the aggregation of the predictions on the audio frames could be done within fastai? Within Pytorch I first used unfold to split the frames via a sliding window within my dataset, then a custom collate_fn to group them together for the batching. Then after performing the predictions on all sub-frames I grouped the predictions again (as each input item) and then find the mean of the softmax values over them, before passing to my cross entropy loss function (which I’ve used, instead of log MSE in the paper). So I suppose the bits I’m trying to figure out in fastai are:

  • Should I just use the custom dataset / get_items and collate_fn I’ve written in PyTorch?
  • The callback I’d need to re-group the predictions from the sub-frames together and perform averaging on them per example. This would also need to be done for both the training and validation stages.

Hi All! I’m wondering if someone can help me get started using fastai for my area of interest - sound and music. I’m having trouble understanding how to create a datablock (or maybe just datasets and dataloaders) from the Maestro database. It’s a large number of classical music pieces in both audio and midi files that are tightly synchronized. Perfect for trying to do tasks in automated music analysis! Below I have some code to show you how I have been processing the data.(edited)

import librosa import prettymidi as pm # load the audio data sig, rate = libroasa.load(audiofile) # create the VQT — representing the distribution of sound energy accross frequency vqts = librosa.core.power_to_db( np.abs(librosa.vqt(sig, sr=rate, hop_length = rate * 0.01 fmin=27, n_bins=84, gamma=1.5, bins_per_octave=12).T)) #normalize the VQTs X = spec_mag_db - np.mean(spec_mag_db) X /= np.std(X) #load the midi data midi_data = pm.PrettyMIDI(midifile) #turn midi data into one-hot chromagrams y = midi_data.get_chroma(fs=20).T y = y.astype(np.bool).astype(np.uint8)(edited)

I have a feeling that doing this with a fastai datablock is pretty straightforward, but I’m having trouble figuring out exactly how.

DataBlock(blocks=None, dl_type=None, getters=None, n_inp=None, item_tfms=None, batch_tfms=None, get_items=None, splitter=None, get_y=None, get_x=None)

Questions: what kind of blocks? get_x and get_y would download the audio and midi files, correct? I’m not sure exactly how to write them so that they they return the right thing. Transforms would do the VQT and normalizing for the audio, and turn the midi file into one-hot encoded chromagrams. The dataset is split into train and valid sets on the provided csv file – though for a starter subset I would probably split it differently. Probably the first 0.8 of the piece being train and the last 0.2 being valid

Hi @drlauren, sorry for the delay in responding to this. I dumped a ton of info in here so please reply w info if you get stuck somewhere and I will help in a much more timely manner. If you’re able to share a nb or repo with your code, that would be even better. We also have some tutorials with basic DataBlock examples

I reformatted your code below, you can share code easily here by surrounding it with three backticks ``` (on US keyboards this is above the tab key and left of the 1 key)

import librosa
import prettymidi as pm 
# load the audio data 
sig, rate = libroasa.load(audiofile) 
# create the VQT — representing the distribution of sound energy accross frequency 
vqts = librosa.core.power_to_db( np.abs(librosa.vqt(sig, sr=rate, hop_length = rate * 0.01 fmin=27, n_bins=84, gamma=1.5, bins_per_octave=12).T)) 
#normalize the VQTs 
X = spec_mag_db - np.mean(spec_mag_db) 
X /= np.std(X) 
#load the midi data 
midi_data = pm.PrettyMIDI(midifile) 
#turn midi data into one-hot chromagrams 
y = midi_data.get_chroma(fs=20).T y = y.astype(np.bool).astype(np.uint8)

Assuming your output is some type of classification of the audio (e.g. a label representing the genre), you would use a CategoryBlock, since your input is Audio, they would be passed in as follows:

blocks = (AudioBlock, CategoryBlock)

No, if you are using a csv file you would download the audio in advance and then do something like

# note you can only have one sample rate for all your audio, so if you have varying 
# sample rates you will need to resample all audios to one sample rate. Replace
# all references to "rate" below with your actual sample rate e.g. '16000'
def vqt_func(sig):
     return librosa.core.power_to_db( np.abs(librosa.vqt(sig, sr=rate, hop_length = rate * 0.01 fmin=27, n_bins=84, gamma=1.5, bins_per_octave=12).T)) 

# Resize Signal crops all audio signals to the same length in milliseconds, it is necessary to have
# inputs of equal size in order to use the gpu. 5000 in the example = 5000ms = 5s but can be changed
# to whatever.
item_tfms = [ResizeSignal(5000), vqt_func(sig)]

blocks = DataBlock(blocks=(AudioBlock, CategoryBlock),
                  # this reads the column of the csv that has the name of the audio file, if it is just 
                  # the file itself, you need to add the argument `pref=str(audio_path.resolve())` where audio_path
                  # is a pathlib object representing where your audio is stored. This is really confusing
                  # as I write it so if you have any questions please ask. 
                  get_x = ColReader('<name of column with your audio filenames>'),
                  get_y = ColReader('<name of column with your labels>'),
                  item_tfms = item_tfms
                  splitter=RandomSplitter(valid_pct=0.2, seed=42)
                  )
2 Likes

Hi.
I take part in Kaggle competition of audio classification. I have 3.6GB of audio in about 2000 files. I bulid my notebook on base of tutorial. All works prefectly, but very slow. 1 cicle of fine tuning of resnet 34 took 8 min. GPU processor do nothing. Seems all transformation perform on CPU.
list of transformations:

item_tfms = [Resample(28000), AudioToSpec.from_cfg(cfg),
CropTime(446, pad_mode=AudioPadType.Repeat), MaskFreq(), MaskTime()]
batch_tfms = [ SGRoll()]

list of lbraries: fastai-2.1.5 fastaudio-0.1.3 fastcore-1.3.4 librosa-0.8.0 soundfile-0.10.3.post1 torchaudio-0.7.2
How to use GPU for preprocessing?

Hello ,

thanks a lot for providing this audio domain abstraction!
I am new to deep learning and audio domain.
Assuming an .wav file randomly collected from an online source: what are the steps (if any) necessary to convert into a format that is usable with https://github.com/fastaudio/fastaudio/blob/master/docs/ESC50:%20Environmental%20Sound%20Classification.ipynb

Are the steps here complete?

Thanks,

after some more reading I’ve end up with this:

audio = AudioTensor.create(filename)

downmixer = DownmixMono()
inp, audio = apply_transform(downmixer, audio)


resampled = Resample(44100)
inp, audio = apply_transform(resampled, audio)

inp, audio = apply_transform(ResizeSignal(5000), audio)

pred,pred_idx,probs = learn_inf.predict(audio)

OR

execute all at once using Pipeline()

Gpu processing is coming for a lot of the transforms in fastaudio soon.

For faster and more detailed responses, please create an issue on github

Has anyone managed to export a fastai model and use it with JUCE? (I don’t mean, “Do think it’s possible and how might you do it” I mean, “who has actually done it?”)

This has been done with TensorFlow & PyTorch; just wondering if there are any examples from the Fast.ai community.

Ok…a different question:

I notice there’s talk of leveraging the GPU for transforms (e.g. @baz’s post), but doesn’t torchaudio already have a bunch of those, already GPU-enabled? I’m looking at this pull request and comparing to the torchaudio transforms page, and I’m seeing repeats, like:

  • ChangeVolume <–> Vol
  • MaskFreq <–> FrequencyMasking
  • MaskTime <–> TimeMasking

…is this because fastaudio & torchaudio were being developed somewhat concurrently & independently?

.

PS- The “Google Colab Notebook” notebook on the website is failing, even after “Restart Runtime” on Colab:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-1-9857e6a57a16> in <module>
      1 from fastai.vision.all import *
----> 2 from fastaudio.core.all import *
      3 from fastaudio.augment.all import *
      4 from fastaudio.ci import skip_if_ci

7 frames
/usr/lib/python3.7/ctypes/__init__.py in __init__(self, name, mode, handle, use_errno, use_last_error)
    362 
    363         if handle is None:
--> 364             self._handle = _dlopen(self._name, mode)
    365         else:
    366             self._handle = handle

OSError: /usr/local/lib/python3.7/dist-packages/torchaudio/_torchaudio.so: undefined symbol: _ZNK3c104Type14isSubtypeOfExtERKSt10shared_ptrIS0_EPSo

:man_shrugging:

2 Likes