Deep Learning with Audio Thread

I noticed that the transform to spectogram wasn’t expanding the channel dimension to 3 as is done by the library for the mnist dataset. I’ve added that into my fork of the project. You were replacing the first Conv2d layer which would loose all pretrainined learning? I may be wrong about this.

After making these tweaks I checked to see how well this approach performed on Free ST American English Corpus datset (10 classes of male and female speakers) I was able to get these results:

Screenshot%20from%202019-04-03%2017-38-03

98.3% accuracy

Here is the Notebook. It is derived from your AWS_LSTM notebook

3 Likes

Great job @baz!!

Actually we don’t completely “loose” the first layer: we’ve copied the “red” channel:

...
    # save original
    original_weights = src_model[0][0].weight.clone()
    new_weights = original_weights[:,0:1,:,:]

    # create new layes
    new_layer = nn.Conv2d(nChannels,64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    new_layer.weight = nn.Parameter(new_weights)
...

I’m working on a “multi spectrogram example” that shows how to create multiple images for the same sample.

I’ll commit as I finish :wink:

1 Like

I think there is too much silence in these samples.

A simple little hack would be slice up the audio based on silence.

I’ve created a notebook describing how to do this.

I’ll try to create a method that is more configurable

Timely…
(You can always just play with the dataset if not competitive)

3 Likes

I’ve published two notebooks:

  • SINGLE SPEC. : one focused on tuning your transformation to be sure that they won’t messe up your original data
  • MULTI SPEC: example of generating multiple spectrogram from augmented data :wink:

3 Likes

Seems the kernel showing 3d spectrogram, but it is actually just for convinient visualization. x-y is time-freq , z axis is amplitude which is the same like color in our 2d spectrograms… So no extra info added by using 3d…

I can imagine a useful 3d spectrogram like, multiple channels of sounds in the z axis. Like stereo, or more interestingly microphone array sound spectrograms concat. in the z axis for sound localization… I have thought about this before…

1 Like

I’d suggest using the tools/run-after-git-clone stuff in the fastai repo to avoid this. See the fastai docs for details. Also try the ReviewNB service - it’s great.

Totally agree, the 3d spectrogram is very much a “human” vis and not a different representation of the data.

Check out @ste’s branch of the fastai-audio repo for an example of using not quite a different representation, but a kind of multi scale representation, (ab)using the fact that we’re really only ever training on tensors whether us puny humans can see them or not.

I was thinking of another way of aggregating - and then visualising - the information in an audio clip besides a spectrogram, but it really does capture most of what’s there.

It would be interesting to experiment with different kinds of spectrogram (“raw” vs. mel, power vs. amp vs. db) and different values for the params (number of FFTs, number of bins, bin size…). Honestly we’re just trying to find what looks “good” to our (puny) human eyes; there’s no guarantee that the prettiest image does the best job of helping a NN discriminate. For my next experiment I want to try making really “high def” spectros to train on.

And also interesting to play with the effects of audio signal transforms vs spectrogram params on the accuracy of the network. For example, if you augment your data by downsampling, but keep your spectrogram #ffts & #num_mels constant, will the “image” presented to the network actually be substantially different?

We haven’t tackled normalisation yet, either; and that could cancel out some assumed audio transforms, eg if you add white noise to every sample, and then normalise, you’re basically removing the noise you added…

There’s a lot to be learned here - and now a Kaggle comp to learn it on :wink:

5 Likes

Awesome, I’ve been meaning to ask how you handle this workflow internally. We’ve just been using liberal branching + splatting + cleanup.

1 Like

Hey Baz, if you check out the “doc_notebooks” branch on the main git repo you’ll find a few changes. I changed all the notebooks to use the public dataset, changed the exception handling, made a few other cleanups, optimised some of the transforms, fixed a pretty critical bug with the spectrogram transform step, etc.

We’ll merge this in with the cool stuff Zac and Stefano have been doing on Monday, but thought I’d let you know in case you’re playing with it over the weekend.

And the “baseline” demo workbook now gets 98.4% accuracy :slight_smile: maybe it will go even higher with your model layer modification!

We’re still actively working on this; you’ll be able to see Stefano has been testing ideas “manually” in his notebooks when we merge them (or it might already be there in a branch!), and the DataAugmentation notebook in the doc_notebooks branch has a slightly improved comparison helper. Ideally I think we’d want a pretty rich display widget that let you hear the original & transformed audio, see the original & transformed waveforms, and let you see the final post-transform spectrogram that the network is actually seeing. It’s a little tricky - because the way we’ve handled the transform (using the wav-to-spectrogram as the final step) it’s hard to access the “transformed -1” state. We’re debating whether it’s best to change the way the AudioItem handles being transformed (eg. adding a concept like “transform groups”) or change the way it __repr__s itself. I’m wondering whether it’s best to dig into Jupyter’s custom display() handling (particularly _repr_html) to make a richer show. We’ve even thought of a sub-project to make an ipython widget based tool to help you test and select transforms on the fly! I think it’s definitely something to focus on, as there is so much to experiment with, we’ll get high leverage from making tools that ease experimentation. Feel free to help!

3 Likes

FWIW, I did this last night, and it wasn’t good. It took ages to train because I had to use a tiny batch size, and was overall less accurate than using the relatively lo-res ones. So, not recommended.

There are many other variables to consider, as well; for example, the naive “pad to max” we use in the current demo notebook adds a LOT of zeros to the vast majority of samples, so doing a smarter uniformity selection would probably be advantageous (something like pad from end by average length"). (I suspect the reason the higher res specros were worse is because the relative amount of 0 bins was higher).

I’d also be interested to try progressive resizing - train the model on low-res spectros first, then generate higher and higher res ones to see if it made a difference. It would be interesting to do this at the audio level too (i.e. downsample to 8KHz first).

1 Like

hey @kmo, of course! Please PR away.

Keep in mind though that we are trying to adhere as closely as possible to the recommended FastAI workflow. There are probably things that we’re doing, such as writing the code in notebooks and aspects of code style that may break with typical software engineering practices and PEP8.

But this has been a conscious decision – because the hope is that this work can be harmoniously brought into the course somehow. Hopefully that doesn’t discourage you from PR’ing and collaborating :slight_smile:

Some ideas for possible audio augmentation from a new paper:

Adversarial data augmentation and Stratified data augmentation

1 Like

Creating a Microphone Recording

import pyaudio
import wave
 
FORMAT = pyaudio.paInt16
CHANNELS = 2
RATE = 44100
CHUNK = 1024
RECORD_SECONDS = 5
WAVE_OUTPUT_FILENAME = "file.wav"
 
audio = pyaudio.PyAudio()
 
# start Recording
stream = audio.open(format=FORMAT, channels=CHANNELS,
                rate=RATE, input=True,
                frames_per_buffer=CHUNK)
print "recording..."
frames = []
 
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
    data = stream.read(CHUNK)
    frames.append(data)
print "finished recording"
 
 
# stop Recording
stream.stop_stream()
stream.close()
audio.terminate()
 
waveFile = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
waveFile.setnchannels(CHANNELS)
waveFile.setsampwidth(audio.get_sample_size(FORMAT))
waveFile.setframerate(RATE)
waveFile.writeframes(b''.join(frames))
waveFile.close()
4 Likes

How does your current audio work handle stereo wav files?

I was exploring https://www.kaggle.com/c/freesound-audio-tagging-2019 which seems like a good dataset to learn about multilabel audio classification.

Last weekend I found the older fastai-audio package someone was working on and tried to build a simple pipeline to classify this dataset.
I kept getting tensor size mismatches and suspect this is because the library didn’t handle stereo audio.

1 Like

Actually we’re not supporting stereo audio - I’ll take a look at it later…
Btw there are multiple way to cope with a “pair” of sounds instead of one:

  1. mixing them in the “SoundSpace”
  • take the average
  • summing them
  • concatenating them
  • ?..
  1. mixing them in the “SpectrogramSpace”
  • generating two separate spectrogram (one per channel)
  • concatenate spectrograms
  • ?..

What you think is the best one?

Please also note the fastai style suggestions: https://docs.fast.ai/dev/style.html

1 Like

These are all awesome questions.
Any audio experts have suggestions?
I understand that we’re using spectrograms because resnet is already trained on images, and we want to leverage that pretraining.
I personally wonder if we could get better results by working with raw waveforms. I know there’s been a lot of work with stuff like waveNet, and recently I saw a neural Vocoder [0] that seem to deal with audio in a fundamentally different way.

Also, when first approaching this problem I read [1] which suggested in its abstract that feeding raw waveforms worked better than Spectrograms, and that was from 2009 and seems to have some … pretty important authors.

[0] https://gfx.cs.princeton.edu/pubs/Jin_2018_FAR/
[1] https://papers.nips.cc/paper/3674-unsupervised-feature-learning-for-audio-classification-using-convolutional-deep-belief-networks.pdf

Thread forming here discussing just that. There are a few good links to papers that model raw audio.

3 Likes

Re. Stereo, I’m currently working on modifying the paradigm to use a “preprocessor” concept, like the fastai.text library, in order to ensure your model can take arbitrary audio inputs and have them normalised before you start applying your transforms. Supporting stereo data would be part of that - in the first instance I assume I would just downmix to mono if it detects a stereo input. Everything after that should “just work” once the LoadAudioData preprocessor has had its way with the inputs. We could later enable this option once the rest of the processes can handle stereo.

As for @ste’s question about how to “deal” with stereo data, I think the simplest way would be to store them as a rank 2 tensor where the stereo channels are handled akin to RGB channels in images. Both in the signal space and the spectrogram space. So your input signal would be of shape [2, <length of file / sample rate>] and your spectrogram would have 2 channels instead of 1 (or 3 like an RGB image).

This would take some tweaking to the display of spectrograms but that’s simple enough, just by cat’ing them together either in array-space or in plt.add space. Here’s what SOX does, for example.

I’d be very keen to explore working with raw waveforms, I think a lot of us want to try it out. For example you could “just” do convolutions with a kernel shape of [3,1] instead of [3,3] to start with… or have your first layer be [16000, 1] to “convolve” across a second at a time of 16KHz audio. I’d be keen to try this… as soon as the audio framework is “built enough” :wink:

The twitter thread @MadeUpMasters pointed out will no doubt bring up some great material!

4 Likes