Deep Learning with Audio Thread

I wanted to share a short guide I wrote on how to use fastai’s parallel function for speeding up preprocessing. It helped me generate spectrograms for audio classification nearly 3x faster! Feedback, especially of the critical variety, is much appreciated!


Thanks for sharing! BTW if you add your twitter handle to Medium then it’ll automatically credit you when sharing.

1 Like

Hello! :slight_smile: I’m interested in voice identification. I noticed that most people are using the spectrogram approach but the examples of doing this are using the raw audio samples and putting them into 1D convolutions:

I know that this isn’t possible with the current fastai library but hopefully when is out we will be able to do this stuff.

In particular, I’m interested in recreating something that can identify the voices in audio sample. This article uses a Siamese Network which uses trains 2 identical encoders to compare two things (say 3s audio clips or pixels from a face). It then take a euclidean distance of the two output vectors and trains the network to decrease the distance of the output encodings if they are the same and increase it if they aren’t.

I want to be able to take totally new voice and enrol a person and then from further audio samples use this enrolment to identify the speaker similar to the article above.

What might be the best way to approach this with the fastai library? I would probably need to create a custom Module?


A lot to catch up on in this thread :slight_smile: I just jumped in to say that TuriCreate (Apple’s ML creation framework) just pushed a new version which includes a sound classifier, and their code is on github under BSDv3. I don’t know about license compatibility with, but it’s got some interesting implementations. Roughly speaking it looks like they’re slicing the input samples into overlapping windows and then using a CNN directly on the audio data. (It looks like they’re also using transfer learning). It would be interesting to reimplement their approach in and compare results, both between the libraries and between the to-image-then-convnet method. I think it would take a fair bit of time & effort to do justice to, though, including understanding the structure of the turicreate system to reimplement comparably in


Awesome! What dataset do they use for transfer learning? We’ve talked about trying to create a soundnet/speechnet/musicnet for this purpose. So far I haven’t found one but I also haven’t looked that hard.

They won’t be scaled down without data loss. As @marcmuc said, you’ll lose potentially useful info when scaling down so much.

The overload error is probably because those images are way too big. Instead of downsizing to (100,177), here’s what you should do – and this is customised to your problem.

height = 1400
width  = 1900
aspect_ratio = width / height 
# 1900 / 1400 ≈ 1.36

def to_img_size(base_size):
    return (base_size, int(base_size * (1900/1400)))

to_img_size(1400) # returns (1400, 1900)
to_img_size(400)  # returns (400, 542)
to_img_size(100)  # returns (135)

Maintaining the aspect_ratio of your data ensures that it isn’t squished or morphed in unexpected ways. As for the size of images, you are really only limited by the size of your GPU memory. Play around with different sizes to see what works (if the image is too big, you will get an error)


This is great! I will definitely try it out, thank you very much.

So excited for fastai audio module to get going. I’m doing the tensorflow speech challenge, got 98.7% accuracy on my validation set but only 67% accuracy on competition testset. I wrote my own little EDA for audio so that I can listen to the audio instead of just seeing the melspectrogram.

def display_audio_prediction():
    rand_file = test_files_list[random.randint(0, num_files-1)]
    clip, sr = librosa.load(path_test_audio/rand_file, sr=None)
    img_filename = rand_file + ".png"
    image = open_image(path_test_spectrogram/img_filename)
    pred = learn.predict(image)
    print(f"Prediction: {pred[0]}")
    for idx, pct in enumerate(pred[2]):
        if(pct.item() > 0.1):
            print(f"{data.classes[idx]}: {round(pct.item()*100, 2)}%")
    display(Audio(clip, rate=sr))

Then each time I call it I get a new random file from my test set that I can listen to and see what my model thinks is the most likely class. It’s really helping me understand my model and where my training set is falling short. Looking forward to when this is built in!

Here are some leads I took away from it, hopefully it will help some other audio newbies to find problems in their models.

  • Pure silence causes a bug that is always wrong (I have no pure silence in my training set, only white/pink/random noise)

  • The unknown class in my train and validation set are very different

    • Need wider variety of words that includes all phonemes (supplement with TIMIT corpus)
    • This is a great time to start trying data augmentation (controlled training set, wild test set)
    • Words starting with “r” are always predicted as the word “right” because my training set doesnt have other r words.
  • My model never guesses “no” (0/158,000)

  • My model almost never guesses yes (0.08%)

  • Unknown is 46.6% of test set but only 19% of training/validation


Wanted to follow up on this as I worked on it more today. My main improvement came from discovering that my yes and no spectrograms were somehow generated using a different higher max frequency, leading to a black bar across the top of all my yes/no training and validation spectrograms and resulting in my model never predicting those two classes for the test set. Lesson: always sanity check your data/inputs.

I also used the TIMIT corpus to generate more “unknown” examples as well as silence examples. Upon further inspection it appears this is outside of the rules of the competition, but they allowed people to use the test set in training as long as it was unsupervised. I’m new to ML but this seems crazy to me for a competition. I will probably try again in the future staying within the rules, but my goal is to build stuff, not win comps, so the practice of generating my own data is very useful.

The new data to supplement the “unknown” class was compiled by grabbing 0.4s clips at random from the timit speech corpus (so it was often chunks of two words or a word and half sometimes) and then padding both sides with random numbers found in silence clips. Because frequency is more about the relationship between successive values in the sound than the absolute magnitude itself, this sounds more like white noise/pink noise. I could probably do a better job of mimicking the test set here.

I also supplemented my silence class by downloading some sounds from with background noise, saws…etc and generated 1 second clips with no speech and various levels of volume. These moved me from 67% to a bit over 72%.

Next I fixed the yes/no problem by generating the spectrograms correctly and it pushed me from 72% to over 86%. That puts me in the top 20%, winner had 91.07% an amount I’d eventually like to surpass as I keep working on audio and speech recognition.


Hello there everyone, such a great thread i could not resist making my first post !
I’m trying to catch up with all the stuff and materials up here (and fastai courses too), it will take some time i guess!
But Meanwhile I got 2 question bugging me.

I noticed that fastai_audio uses a single layer , is it right ?
Why that choice instead of using the other 2 layers feeding em with same sized spectrograms but with different parameters, or some different kind of fuction like the intonation or the auto-correlation ? Can’t they be seen like other layers of the same audio ?

And then, some times ago i had to play with the cross-correlation function for an audio project, and noticed how it was lovely immune to noise, so i tought that even the convolution would act in a similar way , making background noise a good way to augment data (other than simple volume gain/loss ) without changing the data… but for what i did read until now it seems not really the case, am i wrong here too ?

Going back reading

1 Like

Hi all, very interested in this thread, thanks for making it!

I am working with some other in-person participants here in SF on some audio stuff (fast ai classes + data augmentation) that will hopefully be useful once we polish it a bit more.

In the mean time, here’s an article I wrote recently on Wav2Letter’s unique loss function that may be of some interest.

1 Like

It would be interesting to reimplement their approach in and compare results, both between the libraries and between the to-image-then-convnet method.

Do you know how well their approach performs so we have a baseline?

We’ve talked about trying to create a soundnet/speechnet/musicnet for this purpose.

Couldn’t our speechnet just be these CNNs we’re all making trained on TIMIT? Am I wrong to think that we’re basically saying: ‘this is what the building blocks of English speech look like’.

Has anyone tried training on TIMIT and then doing a separate label/word classification task with the pretrained model?

It would be interesting to reimplement their approach in and compare results, both between the libraries and between the to-image-then-convnet method.

Do you know how well their approach performs so we have a baseline?

Nope :slight_smile: That would have to be part of the test. Honestly I don’t know if I’m interested enough to fully replicate it; but I’m working on a basic speaker identification model, and want to play with TuriCreate a bit more anyway, so it probably wouldn’t be too hard to do a simple replication with the same dataset… sometime in the next 6 weeks :slight_smile:

That’s a great writeup by the way!

@MadeUpMasters, I don’t know what dataset the turicreate team have used to train the model they use for transfer learning. But I was looking around and found two promising datasources:

  1. VoxCeleb looks really cool. It’s millions of utterances from thousands of celebrities, compiled from YouTube, so they all have attendant video. Nothing stopping you from just using the audio, though. They have one particularly cool demo, where they link up the lip movement of a speaker with the audio to let you either mute or isolate a specific speaker even during overlapping speech - very impressive.
  2. Librivox is a source of free audiobooks. Basically just lots and lots of human speech :slight_smile: This blog post describes some of the gotchas if you’re trying to use it to identify specific speakers, e.g. not all books are read by the same person.

I’m working on something to do “speaker diarisation” which is apparently industry jargon for “who’s speaking when”. My target goal is to eliminate the speaker of one particular person in a podcast :slight_smile: We’ll see how we go…

Edit to add: I think podcasts in general could be a really fruitful data source and study area. They’re generally high sound quality, LOTS of audio available, and because of the variety of podcasts available you could find lots of data to reflect whatever domain you’re looking at: single speaker voice to text, multi-speaker identification, speaker diarisation, noisy environments, overlapping speech, etc etc.

1 Like

If you haven’t seen this yet re: datasets and performance…


Some naive questions when thinking of approaching this problem:

  1. Do all the audio clips have to be of the same length?
  2. Does it make a difference if the spectrograms are greyscale vs. colour?
  3. Does transfer learning from ImageNet help when working with spectrograms?
  4. Do any of fastai’s default image transforms (augmentations) help when working with spectrograms?

I looked at some of these with my part 1 homework project (notebook, blog post). I just took a super-naive approach to classifying sounds from the BBC sound effects archive (blog post on how I got & prepped the sounds). I tried to classify “vehicle noises” e.g. cars, trains, plains & boats. After a few tweaks the model was pretty accurate, <0.1 error rate. I did use transfer learning, and that was better than without. I got the best results from using some of the transforms, but not flip or rotate.

I did a lot of things “wrong” though - the sounds were all different lengths, but I just used the “SQUISH” transform to make all the spectrograms the same (square) size; there’s only a small sample size, ~200 of each class; I used transforms which “don’t make sense for audio” e.g. skew, perspective warp; I normalised my training & validation datasets according to the imagenet stats, not their own stats; etc etc.

Now obviously this isn’t a kaggle comp so I have nothing to compare it to, so I don’t actually know if <0.1 is “good”, but it feels pretty good to me… I don’t know if this is a lesson in “try the dumb thing first” or if it’s just self-deception, but it’s given me enough encouragement to keep trying things the dumb way first & improving from there. :slight_smile:


Hi all, @ste and I collaborated on a notebook that’s a first attempt at data augmentation relevant to audio. Check it out here.

If you download the notebook and the TIMIT dataset, you can easily listen to all the augmentations. There’s so many more we could try, maybe someone else wants to write a couple??

@ste is working to export all the functions in the notebook so we can integrate them with some Audio classes written to fit into a typical fastai workflow. We’ve trained one model so far pretending to use ‘FastAI Audio’ and it went great!


Very cool @zachcaceres & @ste!

A couple more ideas to note “out loud” - one thing I want to try is the equivalent of progressive resizing, but with audio sampling rates. E.g. downsample your audio to 4KHz, train the network, then freeze it, reload your data at 8KHz & retrain the head, etc etc. for 16, 32, 44.1…

A decent, easier proxy would probably be to just do progressive resizing of the spectrogram images :slight_smile:

I also wonder about the shape and size of kernels. The information spectrograms contain is fundamentally different to what images of the real world contain, and phenomena are represented differently. So it might be interesting to experiment with different types of kernels to try to capture different phenomena to what’s effective in a pure vision context. I’m thinking very tall, thin kernels to capture harmonics, for example.

Edit to add some other random thoughts:

  • Are there any other kinds of audio visualisations besides a spectrogram that could be useful?
  • Is there any way we could use the RGB values in a spectrogram to mean something useful?

Awesome notebook Zach and @ste, added to the original post under specific. Also I had no idea torchaudio could read NIST Sphere (TIMIT format) directly), I had to jump through hoops to convert to wav (they are listed in the dataset as wav files, but are not formatted according to the standard). Good to know.

Have you guys done any work checking how long the augmentations take? @johnhartquist was working on this and told me that he found pitch shifting with librosa to be pretty slow for large datasets.


Hey, thanks for all the great ideas. There’s loads of experimentation to be done here. Is there a way we could organize it to work on collaboratively so we don’t all do the same thing?

I’ve also thought about progressive resizing/resampling but haven’t given it a shot yet.

As far as other audio visualizations, 3d spectrogram is a thing, and is briefly shown as a possibility here (kaggle kernel). Not sure if it adds anything, or what the generation time is like compared to normal spectrogram. Would be cool to try training on a 2d spec, 3d spec, and then to ensemble the models and see if there’s any improvement.


Just came across this cool blog post on using wavenet to generate audio (both voice and music).
It has some really cool click-to-listen examples comparing standard text-to-speech mechanisms with their network!