I’ve found very little audio content on the forums, so I thought I’d start a thread for all things audio where we can post resources, find people working on similar projects, and help each other out. Maybe we could get a separate study group or slack/telegram chat going as well. Note: I am early in fast.ai and have only studied the audio->image->CNN route, if anyone else has experience with using RNNs in audio, please help contribute some resources.
Fast.ai specific
-
FastAI Audio V2 - Currently undergoing unstable development with plans to be stable/feature complete around March 2020. Contributors of all levels are welcome and encouraged! Feedback of all types appreciated. Discussion held here: V2 Audio Thread and in the developer telegram chat. PM @MadeUpMasters to get added.
-
Unofficial FastAI Audio Module - This library, originally forked from #2 below, is meant to make it as easy as possible to train audio models without domain expertise. Features include automatic spectrogram/MFCC/delta generation, caching, silence removal, mixup/SpecAugment, and much more. It includes several notebook tutorials including an Intro to Audio for FastAI Students, a QuickStart Guide a Full Features Tutorial. The library was used to achieve a New SOTA on the ESC-50 Dataset with minimal hyperparam optimization/fine-tuning. Currently maintained by @baz, @KevinB, and @MadeUpMasters, and welcoming contributors of all skill levels.
-
Data Augmentation for Audio by Zach Caceres, Ste & Thom Mackey - A fastai module for audio. While not yet formally integrated into fastai, it is in a pretty usable form, and follows the structure of other fastai libraries. It generates spectrograms and performs a diverse array of audio data augmentations on the fly with reasonable speed, and is much more convenient than generating spectrograms via preprocessing.
-
John Hartquist - Audio Classification using FastAI and On-the-Fly Frequency Transforms - John tried to build audio processing directly into fast.ai libraries and gives an excellent report of how he did it, and what work remains. John Hartquist - Fastai Audio Github Repo
Introduction to Sound:
Note: These are all fairly advanced, we need more intro stuff.
-
Jack Schaedler - Compact Primer on Digital Signal Processing - This is a fantastic ~30 page tutorial of interactive diagrams explaining an introduction to sound, signal processing, fourier-transforms and lots of other concepts you’ll see in the sound processing world.
-
Awesome Collection of Notebook Tutorials for Audio - A big thanks to Google Engineer Steve Tjoa and other github contributors for writing and open sourcing an absolute treasure trove of well maintained notebooks on audio processing techniques in python. Signal analysis, fourier transforms, STFT, MFCC, NFC, spectrograms. While the focus is music, this is likely a good place to start for any audio newbies.
-
Mel-Frequency Cepstral Coefficient Tutorial - This article on MFCC by James Lyons is a thorough, detailed, and at times difficult description of MFCCs which are an important part of Speech Processing. Luckily for you, they are implemented with one line in Librosa using librosa.feature.mfcc, so this isn’t 100% required reading.
-
Coursera course on audio signal processing. Don’t get sucked into this one unless you really need it. It appears to be a great resource for people who want a truly deep dive on audio, but it is absolutely not necessary for building working audio projects with fast.ai.
Speech Processing:
-
How to do Speech Recognition with Deep Learning - A great introduction from Adam Geitgey to the full ASR pipeline. Starts with what audio is and basic feature extractions all the way through audio RNNs and CTC (defined below).
-
DeepLearning for Speech Recognition Talk (90 min) - This talk from Adam Coates of Baidu is from late 2016, but is the best resource I’ve found on how to build a complete production quality ASR speech recognition system. It leaves out a few details but the decisions and design choices he does choose to talk about are extremely important and will help you avoid pitfalls as you build your system. 100% a worthy time investment for anyone working on an ASR project, especially at scale.
-
Tensorflow Speech Recognition Challenge - (Non-active) competition to recognize 1 of 30 one word voice commands with 65000 samples.
-
Davids1992 - Speech Representation Kernel - An excellent kernel that will show you how to do essential speech processing.
-
Mozilla Speech Datasets - Multiple open source, multilanguage datasets. Main one is 22GB, 582 validated hours. Also includes a link to github repo with TensorFlow implementation of Baidu’s DeepSpeech.
-
DARPA-TIMIT Acoustic Phonetic Speech Corpus - An extremely detailed and well-curated speech dataset with 6300 sentences from 630 speakers from 8 major dialect regions of the United States. Set includes accurate English and IPA (International Phonetic Alphabet) transcripts. Note: The files in this dataset appear to be .wav but are actually a special format you’ll need to convert. StackOverflow - Reading TIMIT wav files
-
Distill - Intro to Connectionist Temporal Classification - This is an excellent intro to CTC, a method for recognizing sequences without prior alignment. An example is speech, if you take a clip of speech and chop it up into very short clips and pass it to a model to recognize sound, you’ll get it back with overlap. “HELLO” may come back as HHHHHEEELLLLLLLLOOOOOO, this method helps us collapse that sequence of sounds back to “HELLO”.
-
Zach Caceres Explains Wav2Letter - In 2016, Facebook Research published a speech to text model called Wav2Letter, that takes an alternate approach to alignment that requires less computation. Zach breaks down the paper and explains it’s implications in a clear, concise way.
General Audio Processing
- The Ultimate Python Audio List - A true threadkiller of audio tools. This has links to python packages for audio processing in every domain. Many thanks to it’s maintainer, Fabian-Robert Stöter.
- Kaggle 2019 Freesound Audio Tagging - Active (through June 3, 2019) Kaggle competition for classifying audio clips with labels from 80+ classes. It is a multiclass problem as some clips contain multiple classes of sound. Let’s win this one in the name of fastai!
- Kaggle Freesound Audio Tagging - (non-active) competition to classify 41 categories of general sound like “applause”, “cough”, “meow”, “scissors”, “tearing”…etc.
- Zafar Beginner Audio Kernel - Another awesome kernel to get you started with sound processing.
- Bachir - FreeSound Competition using Fast.ai libraries - Bachir went back and did the freesound competition but using fast.ai libraries. Very useful for fast.ai and audio beginners!
Music Processing:
-
Google Magenta Nsynth Dataset - A large-scale and high-quality dataset of annotated musical notes. Over 300,000 4 second clips of various instruments and notes.
-
OpenAI MuseNet - FastAI Alum @Mcleavey’s latest project, a deep neural network for improvising musical compositions across genres. Think GPT2 for music. Contains some really cool examples so even if music isn’t your domain, check out her work!
-
Isolating Vocals and Instruments From Music - @alekoretzky’s detailed writeup on training a CNN to separate the different elements of a musical composition to a high degree of accuracy. He shows some really fun examples with Daft Punk and Adele.
Data augmentation for audio:
- Google SpecAugment - A recent (April 2019) paper by Google about doing data augmentation directly on spectrograms (much more efficient and they got fantastic results) Simple and easy to read, I think this will be a very important paper in the field. We also have our own version of SpecAugment in Pytorch implemented by @zachcaceres and @JennyCai. Hopefully it will be merged into fastai audio soon.
Some other papers discussing audio data augmentation
- Audio Augmentation for Speech Recognition
- Data augmentation for low resource languages
- Vocal Tract Length Perturbation (VTLP) improves speech recognition
- Data Augmentation for Deep Neural Network Acoustic Modeling
Other Cool Stuff in Audio
- Nvidia Noise Supression - Real time noise suppression technology.
Post here and share what you’re working on and what techniques you’ve found helpful!