Deep Learning with Audio Thread

I’ve found very little audio content on the forums, so I thought I’d start a thread for all things audio where we can post resources, find people working on similar projects, and help each other out. Maybe we could get a separate study group or slack/telegram chat going as well. Note: I am early in and have only studied the audio->image->CNN route, if anyone else has experience with using RNNs in audio, please help contribute some resources. specific

Introduction to Sound:
Note: These are all fairly advanced, we need more intro stuff.

  • Jack Schaedler - Compact Primer on Digital Signal Processing - This is a fantastic ~30 page tutorial of interactive diagrams explaining an introduction to sound, signal processing, fourier-transforms and lots of other concepts you’ll see in the sound processing world.

  • 3Blue1Brown - A Visual Intro to Fourier Transforms

  • Awesome Collection of Notebook Tutorials for Audio - A big thanks to Google Engineer Steve Tjoa and other github contributors for writing and open sourcing an absolute treasure trove of well maintained notebooks on audio processing techniques in python. Signal analysis, fourier transforms, STFT, MFCC, NFC, spectrograms. While the focus is music, this is likely a good place to start for any audio newbies.

  • Mel-Frequency Cepstral Coefficient Tutorial - This article on MFCC by James Lyons is a thorough, detailed, and at times difficult description of MFCCs which are an important part of Speech Processing. Luckily for you, they are implemented with one line in Librosa using librosa.feature.mfcc, so this isn’t 100% required reading.

  • Coursera course on audio signal processing. Don’t get sucked into this one unless you really need it. It appears to be a great resource for people who want a truly deep dive on audio, but it is absolutely not necessary for building working audio projects with

Speech Processing:

  • How to do Speech Recognition with Deep Learning - A great introduction from Adam Geitgey to the full ASR pipeline. Starts with what audio is and basic feature extractions all the way through audio RNNs and CTC (defined below).

  • DeepLearning for Speech Recognition Talk (90 min) - This talk from Adam Coates of Baidu is from late 2016, but is the best resource I’ve found on how to build a complete production quality ASR speech recognition system. It leaves out a few details but the decisions and design choices he does choose to talk about are extremely important and will help you avoid pitfalls as you build your system. 100% a worthy time investment for anyone working on an ASR project, especially at scale.

  • Tensorflow Speech Recognition Challenge - (Non-active) competition to recognize 1 of 30 one word voice commands with 65000 samples.

  • Davids1992 - Speech Representation Kernel - An excellent kernel that will show you how to do essential speech processing.

  • Mozilla Speech Datasets - Multiple open source, multilanguage datasets. Main one is 22GB, 582 validated hours. Also includes a link to github repo with TensorFlow implementation of Baidu’s DeepSpeech.

  • DARPA-TIMIT Acoustic Phonetic Speech Corpus - An extremely detailed and well-curated speech dataset with 6300 sentences from 630 speakers from 8 major dialect regions of the United States. Set includes accurate English and IPA (International Phonetic Alphabet) transcripts. Note: The files in this dataset appear to be .wav but are actually a special format you’ll need to convert. StackOverflow - Reading TIMIT wav files

  • Distill - Intro to Connectionist Temporal Classification - This is an excellent intro to CTC, a method for recognizing sequences without prior alignment. An example is speech, if you take a clip of speech and chop it up into very short clips and pass it to a model to recognize sound, you’ll get it back with overlap. “HELLO” may come back as HHHHHEEELLLLLLLLOOOOOO, this method helps us collapse that sequence of sounds back to “HELLO”.

  • Zach Caceres Explains Wav2Letter - In 2016, Facebook Research published a speech to text model called Wav2Letter, that takes an alternate approach to alignment that requires less computation. Zach breaks down the paper and explains it’s implications in a clear, concise way.

General Audio Processing

  • The Ultimate Python Audio List - A true threadkiller of audio tools. This has links to python packages for audio processing in every domain. Many thanks to it’s maintainer, Fabian-Robert Stöter.
  • Kaggle 2019 Freesound Audio Tagging - Active (through June 3, 2019) Kaggle competition for classifying audio clips with labels from 80+ classes. It is a multiclass problem as some clips contain multiple classes of sound. Let’s win this one in the name of fastai!
  • Kaggle Freesound Audio Tagging - (non-active) competition to classify 41 categories of general sound like “applause”, “cough”, “meow”, “scissors”, “tearing”…etc.
  • Zafar Beginner Audio Kernel - Another awesome kernel to get you started with sound processing.
  • Bachir - FreeSound Competition using libraries - Bachir went back and did the freesound competition but using libraries. Very useful for and audio beginners!

Music Processing:

  • Google Magenta Nsynth Dataset - A large-scale and high-quality dataset of annotated musical notes. Over 300,000 4 second clips of various instruments and notes.

  • OpenAI MuseNet - FastAI Alum @Mcleavey’s latest project, a deep neural network for improvising musical compositions across genres. Think GPT2 for music. Contains some really cool examples so even if music isn’t your domain, check out her work!

  • Isolating Vocals and Instruments From Music - @alekoretzky’s detailed writeup on training a CNN to separate the different elements of a musical composition to a high degree of accuracy. He shows some really fun examples with Daft Punk and Adele.

Data augmentation for audio:

  • Google SpecAugment - A recent (April 2019) paper by Google about doing data augmentation directly on spectrograms (much more efficient and they got fantastic results) Simple and easy to read, I think this will be a very important paper in the field. We also have our own version of SpecAugment in Pytorch implemented by @zachcaceres and @JennyCai. Hopefully it will be merged into fastai audio soon.

Some other papers discussing audio data augmentation

Other Cool Stuff in Audio

Post here and share what you’re working on and what techniques you’ve found helpful!


Personally I’m most interested in speech analysis, so I’ve been toying with problems in that area. I’m a total beginner (on lesson 4 of version 1). One thing I’ve tried is syllable counting, with some minor success (81% accuracy on an unbalanced set of 3600 words with 1-6 syllables, distribution is 1 syllable: 1028 words, 2: 1327 words, 3: 790 words, 4: 361 words, 5: 95 words, 6: 12 words)

I used librosa to convert to melspectrograms (log scale) and don’t really know where to go from here.

### Credit to John Harquist for this starter code
def log_mel_spec_tfm(fname, src_path, dst_path):
    x, sample_rate = librosa.load(src_path)
    n_fft = 1024
    hop_length = 256
    n_mels = 64
    fmin = 20
    fmax = 12000
    mel_spec_power = librosa.feature.melspectrogram(x, sr=sample_rate, n_fft=n_fft, 
                                                    n_mels=n_mels, power=1.5, 
                                                    fmin=fmin, fmax=fmax)
    mel_spec_db = librosa.power_to_db(mel_spec_power, ref=np.max)
    dst_fname = dst_path / (fname[:-4] + '.jpg')
    plt.imsave(dst_fname, mel_spec_db)

Some open questions for me are:

  1. What do optimal parameters for a spectrogram of human speech look like?
  2. What techniques can we use to preprocess/produce the images to get the distinguishing features (english phonemes, syllable pauses…etc) to stand out more?
  3. Are there any really good resources or open-source projects we can learn from ?

I worked on audio classification on the freesound audio with good results - link

1 Like

Small world, I read your post while working this morning. I’ve added it to the main post as a resource, thanks.

Did you play around at all with the parameters you pass into librosa.feature.melspectrogram to try to achieve better results? Also you said that 16 bit PCM mono audio files were ideal for processing. What is the advantage of mono? I’ve googled but found almost nothing.

Have you done any other work in audio since then? Any advice? Cheers.

Another beginner question is what do we need to do to normalize sound data? If I am analyzing a bunch of melspectrograms of individual words, but those words vary in length (time on the x-axis), how is that best handled?

+1 for an audio group. I worked on the freesound competition with a few other guys from fastai last summer. It also sounds like will be an official thing sometime soon.

In terms of audio normalization, I haven’t found a good answer for what works best. The “ref=np.max” param in the librosa decibel conversion performs some normalization, although you could also use “ref=1.0” if you normalized the waveforms in the time domain. It really depends on how long your chunks are and your specific use case - are your smallest segments made up of phonemes, words, or sentences? This also is a factor when you are using batches - do you normalize each clip or the batch as a whole? Definitely an interesting problem…


No I didn’t do anything with after this, though have couple of ideas. With mono it’s just easier otherwise you will have to be creative in combining the stereo audio sources (e.g. generate one image per source).

1 Like

Thanks for the reply. Is there a platform (slack/telegram/something else) you think would be best for an audio group? I’m really just starting out with both AI and audio but I’m happy to mess around and document what seems to work best for various applications.

@MadeUpMasters I was about to start a discussion and luckily I found this thread.
This week, I have been trying Dog VS Cat classification using conversion to a spectrograms and then running the simple Lesson 1 code over it.

I have no experience with audio, only thing I’ve realised is, the ResNets overfit badly.
After spending a few hours inspecting the spectrograms, I couldn’t tell the difference between the classes.

1 Like

I’m going to roundup what I’ve learned so far this week (and hopefully a bit more) and put together a jupyter notebook on audio processing for total beginners. I’ll post here when it’s complete. If anyone wants to get a general audio chat going, just let me know and I’ll set it up on telegram/slack.


New update, it turns out there are a load of fantastic resources out there. I’ve updated the original post to be more of an audio wiki and as I learn and go I will try to fill in any gaps I see in the available material with my own notebooks/contributions.

One thing I think would be helpful is a deeper dive on librosa, python’s most used music and audio analysis package. I’ll start working on a jupyter notebook that explores the various useful functions for audio processing, and tries to go deeper on the parameters they take and how we can best use them.

1 Like

I don’t know a lot about the topic but I noticed the upcoming DCASE2019 workshop challenge with interesting 2018 commentary at


+1 for audio group. I am working with marine mammal data, and my data is extremely messy. It is real world recordings with noise, unrelated, irrelevant sounds and marine mammals (manatee) calls that I am trying to identify.

Our data is labeled, for example the file is labeled as containing a call (or calls) or containing “nothing”. The problem is that “nothing” basically means “anything but calls”, so these files are extremely diverse.

Another problem I still cannot resolve is that my spectrograms are rectangular 10 sec files. At this stage, the task is not to differentiate between call type 1 and call type 2, but to grab a raw file and determine if it contains a call (or calls) or not. As I was told, rectangular images are cropped randomly, somewhere in the middle to make them 224 by 224. It is not a problem for “nothing” files, but a big problem for “call” files because some calls could be cut out and it will confuse the model. So I cropped “call” files, making them 224 by 224 and making sure each file at least contains a call.

My results are all over the place, I run the same model with the same data and get different results, sometimes model confuses 34 images (calls confused for “nothing”), sometimes 10, I cannot understand why it happens. It also appears that the model struggles with unusual calls and very faint calls.

Jeremy mentioned in lecture 1 that someone used fast ai for auditory scene analysis to tease out some saw sounds in the forest indicating illegal logging, this is very similar to what I am trying to do.

Jeremy also mentioned that we will be working with rectangular images in part 2, but I am still very confused why images absolutely have to be squares.

In most cases I see people working with almost perfect, clear sound files, free of noise, but this is not the case with my data. Any suggestions, advice would be appreciated, I am honestly not sure what is the best way to approach it.


Definitely could be an interesting place to test out some skills, thanks for the links!

Hey, your work sounds really interesting. Where do your spectrograms come from? Are they given to you or do you generate them yourself? If given to you, do you know what preprocessing is done? Do you have access to the original audio?

I’ve seen some tutorials on librosa that might be of use to you with the noise. They involve separating foreground and background noise (in this case, vocalist and instruments). Librosa guide to vocal separation. It might not be of any use to you but maybe worth a look.

Are you okay with Telegram? If so, I’ll start a group there and we can add people as they discover the thread.

1 Like

I generate all spectrograms using matplotlib, and also have all original wav files. No special preprocessing is done. The idea was to see if the model can successfully identify calls in raw, unfiltered data, but I will definitely check the librosa vocal separation script, thank you. Here is an example of a “good” file that model seems to identify correctly as having calls in it (NFFT=1024).

I am also ok with Telegram.

P.S. Saving the spectrogram of 10 sec wav file as 224 by 224 messes up the signals (as expected), so it is definitely the issue in my case.

DM sent about Telegram group. Anyone else who wants to join please send me a DM with your Telegram info and I’ll add you.

Can you post the code from matplotlib? Is there a way to adjust the axes in the original so that it has a square shape and doesnt get so distorted on resize? Also, what do the spectrograms that have some other noise (but no calls) look like?

I’m also interested in Audio and have played with audio->image and fastai before. Also a lot of things from the time series/sequential data study group thread might be interesting to you guys if you haven‘t seen it.

After all, audio is a special form of time series.
I am also okay with telegram…


If you are creating the spectograms yourself there is a lot of parameters for you to play with in order to make the output conform to your needs like limiting min and max frequencies, windows etc. to produce square outputs.

From my experience it is also by far better to scale the images (squish them together) then cropping them. Also most image augmentations should be switched off, as it is important „where in the image“ things are found as everything corresponds to frequencies on a fixed scale. It is therefore also very important to always use the exact same settings when producing the images and not mixing different.

Sectrograms might also be a prime example for trying out CoordConv layers as positions are important in spectrograms and the translation invariance of regular convs might actually not be the best for this type of Problem.

CoordConv explained incl. videos and link to arxiv paper:

In case anyone reading is a beginner and now thinking : WTF is „Translation Invariance“?


Thanks Marc, that looks like an awesome resource. I agree about the need to play with the parameters and figure out what they all do. I’ve been spending a lot of time with librosa and hope to do a jupyter notebook showing the various options.

The audio group is officially up and running with 2 members. Shoot me a message with your Telegram info and I’ll get you added.