Deep Learning with Audio Thread

Can contribute some stuff for scene recognition, underwater snapping shrimp. This is a very common underwater sound in tropical waters.

Has anybody played around with VGGish? It’s an audio embedding model trained on Google AudioSet (huge compilation of youtube audios). It’s a pretrained model that reduces each second of audio to a vector of 128 values that can be used as features for training.

I tried on ESC50 and got 62.25% accuracy (with no data aug) resnets get 67% but we’ve gotten as high as 88.75% with densenets+mixup. I want to try more with data augmentation, and also to see if I can get mixup working on the embeddings.

Here’s a notebook if anyone is interested in trying out vggish embeddings. Ignore the first few parts, they pull in data v2 style. Also ignore this branch of audio v2, it’s just a bunch of messy experiments with various audio stuff (ROCKET, vggish, raw audio training)

3 Likes

Has the pyTorch audio library been updated recently? And the fastai audio library is not following the updated torchaudio?

The imports from torchaudio.transforms are not working e.g. ``‘SpectrogramToDB’ which I see is not present in torchaudio.transforms

Should I be cloning some other fastai_audio library instead?

1 Like

Since torchaudio is moving fast and breaking things, and we are not doing a ton of maintenance on V1 due to focus on V2, we chose to freeze torchaudio in a previous version. SpectrogramToDB is a rename of the previously used AmplitudeToDB which is called automatically when you set to_db to be in the config. I think everything you’re trying to do should be achievable using the old version of torchaudio. Let us know if there’s something you are having trouble doing and we can find a way to do it.

V2 is unfinished at this point, we are going to build the high level API and some nice usability features. This shouldn’t take very long but I wouldnt recommend using it for anything major until it’s released as code will continue to change somewhat rapidly until then.

2 Likes

Do we have a seperate discussion thread for V2 audio ?.

2 Likes

Jumping in late! Where is a good starting point with fastai(some simple tutorial maybe) ?

There is a repo that contains code to work with audio and fastai here

2 Likes

In the last year I have been experimenting with sliding windows, various window sizes, resampling, normalizing, and general preprocessing necessary for “one size fits all” monophonic audio classification, which for me is useful when converting audio “snippets” to transcripts, or for driving application events. This thread is awesome. I also recommend using noise reduce, as setting thresholds to look for peaks won’t work if the audio has noise bursts (i.e. one person on a radio call transmits across an active channel to another person and it “pops”). I converted radio recordings to STT, and even Google was only coming back with 80% accuracy, so clearly there is much to be done with audio classification, and integration with phonetic sciences, as mentioned above, is clutch. Staying glued to these developments. Thank you so much.

2 Likes

Hey! can you invite me to this one? :slight_smile:

hey, did anyone face problem in importing librosa in GCP?

I dont why I am unable to import it. Re-installed it using pip install librosa. Doesn’t help.


ModuleNotFoundError Traceback (most recent call last)
in
1 from fastai.text import *
----> 2 from audio import *

~/audio/init.py in
1 from .audio import *
----> 2 from .data import *
3 from .learner import *
4 from .transform import *

~/audio/data.py in
1 from .audio import *
----> 2 from .transform import *
3 from pathlib import Path as PosixPath
4 from IPython.core.debugger import set_trace
5 import os

~/audio/transform.py in
9 import torch
10 import torch.nn.functional as F
—> 11 import librosa
12 import torchaudio
13 from librosa.effects import split

ModuleNotFoundError: No module named ‘librosa’

2 Likes

Is this v1 or v2? We are trying to straighten out our environments in v2 right now. If it’s v1, I’ve had no reported problems and I use GCP as well.

2 Likes

v1 - working now. I randomly added ‘sudo’ before chmod. Dont know why this should work though.
!git clone https://github.com/mogwai/fastai_audio > /dev/null 2>&1
!cd fastai_audio/ && sudo chmod +x install.sh > /dev/null 2>&1

we dont have any way to save and load audiolist databunch?
Tried the load_data, it throws some warning:

/usr/local/lib/python3.6/dist-packages/fastai/basic_data.py:262: UserWarning: There seems to be something wrong with your dataset, for example, in the first batch can't access any element of self.train_ds. Tried: 504,23312,27618,28118,21732...

and error while fetching the learner.

This looks very interesting for DL with audio data:

5 Likes

MicPie
WOW AWESOME!

cheers mrfabulous1:smiley::smiley::smiley:

Id love to work differentiable signal processing into fastai audio but unfortunately this is built on tensorflow and I doubt we’ll see a PyTorch implementation in time to use it. We may implement some related things such as learned filterbanks (as opposed to set frequency ranges in a linear or melspectrogram)

I’ve sorted out the colab notebook for people interseted in a quick way to get started with v1 of the library. We will create one of these for fastai2_audio as well

https://colab.research.google.com/drive/1HUVI1CZ-CThHUBO8l2lp6hySjrbs0SY-

3 Likes

I was able to successfully convert my fast.ai model to Caffe2 for deployment following this tutorial https://pytorch.org/tutorials/advanced/super_resolution_with_caffe2.html.

I now realize, I also need to convert my preprocessing steps as well. Currently I’m generating spectrograms on the fly as part of an Audio Databunch.

Has anyone had experience with stripping the fast.ai dependencies to be able to generate their spectrograms for inference in “pure” python (outside of fast.ai) in the same way that they are done in the databunch?

deep visual-semantic embedding model
identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. Has anybody played around with audio-semantic embedding ?.