Fastai v2 audio

Well done on all the work here folks, just after watching @muellerzr’s run through, this lib looks super useful!

I am hoping to use it in kaggle’s deepfake comp as some of the videos also have fake audio. Just wondering if anyone has any suggestions on the easiest way to extract audio from mp4 files? And is there a preferential format I should save them to?

1 Like

ffmpeg is a great tool to manipulate video and audio via the command line on linux, the usage may look scary at first but it’s very powerful. To extract the audio from only one video:

ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ac 1 -ar 16000 out.wav

Here, we have:

  • -i video.mp4 is the input file;
  • -vn means no video output;
  • -acodec pcm_s16le is the audio coded used;
  • -ac 1 is to use only one channel (mono audio);
  • -ar 16000 is the sampling rate
  • out.wav the last argument is the output file.

If you search on the internet you’ll find some posts listing all of the different ways you can use ffmpeg like this one. To process multiple files, it’s just a matter of using a bash loop:

for vid in *.mp4; do ffmpeg -i "$vid" -vn -acodec pcm_s16le -ac 1 -ar 16000 "${vid%.mp4}.wav"; done

About the format, .wav with this coded is a common choice for audio data. The only parameters that you should change are the channels to 2 if you want to use stereo audio, and the sampling rate. For pure voice audio, 8 khz (-ar 8000) should be enough, but if you have other sources of sound besides voice you may want to use 16 khz (-ar 16000) or even 44.1 khz (-ar 441000). Those rates are directly related to the highest frequency present in your audio and the Nyquist theorem.


Amazing, appreciate it! Its only voice, although maybe I’ll us 16 khz because the goal is identify fake/manipulated voice, so maybe some crazy artefacts show up beyond the expected 8 khz…thanks again!

Hi all,

I thought I’d introduce myself after lurking for enough time! My background is in acoustic consultancy/engineering and I’m currently making a career change towards ML. I’m currently doing the Udacity ML Engineer Nanodegree and will (hopefully) be going to Georgia Tech to start the OMSCS ML specialization later in the year.

First of all, I absolutely love the work you all have done - machine listening is such a fascinating area, so I would love to contribute however I can. I also have my own personal project working on bird sound recognition for an area next to a national park in Colombia, near where I’m lucky enough to live (Bogotá), so will have a play around with V2 and feedback in due course. I used V1 late last year and it worked pretty well with mel-spectrograms on a dataset of xeno-canto recordings of 134 bird species ranging from excellent to pretty dodgy quality, so I’m excited to see how V2 can do.

I would like to use the library for my Udacity Capstone project, would you recommend I stick with V1 for now or go ahead with V2?



1 Like

Hi all, I’m having some trouble running my code on the google tpu using a colab notebook. I thought you might have some more experience in this field and I’m trying to ask here.

I’m trying to run a pytorch script which is using torchaudio on a google TPU. To do this I’m using pytorch xla following this notebook, more specifically I’m using this code cell to load the xla:

!pip install torchaudio
import os
assert os.environ['COLAB_TPU_ADDR'], 'Make sure to select TPU from Edit > Notebook settings > Hardware accelerator'

VERSION = "20200220"  #@param ["20200220","nightly", "xrt==1.15.0"]
!curl -o
!python --version $VERSION

import torch

import torchaudio

import torch_xla

however this is incompatible with the version of torchaudio that I need as: ERROR: torchaudio 0.4.0 has requirement torch==1.4.0, but you'll have torch 1.5.0a0+e95282a which is incompatible.

I couldn’t find anywhere how to load torch 1.4.0 using pytorch xla.

I tried to use the nightly version of torch audio but that gives the error as follows:

!pip install torchaudio_nightly -f

import os
assert os.environ['COLAB_TPU_ADDR'], 'Make sure to select TPU from Edit > Notebook settings > Hardware accelerator'

VERSION = "20200220"  #@param ["20200220","nightly", "xrt==1.15.0"]
!curl -o
!python --version $VERSION

import torch
import torchaudio

import torch_xla
ImportError                               Traceback (most recent call last)
<ipython-input-2-968e9d93c06f> in <module>()
     10 import torch
---> 11 import torchaudio
     13 import torch_xla

/usr/local/lib/python3.6/dist-packages/torchaudio/ in <module>()
      4 import torch
----> 5 import _torch_sox
      7 from .version import __version__, git_version

ImportError: /usr/local/lib/python3.6/dist-packages/ undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_29E


So how would I go to load the stable version or 1.4.0 version of pytorch using xla or is there any other workaround for this situation?

Thanks a lot for your help!

Hi everyone,

I just came across this audio extension for fastai and I was amazed. I’m trying to write a naive app to classify between 2 data sources. The model trains well, thanks to the notebook provided on GitHub.

I’m trying to load a single wav file and get predictions but I’m doing something wrong here.

I created this single file batch to get predictions and the types for learner.x and my single input are the same

But I’m getting the below error

I don’t understand why the learner is looking for an AudioTensor file and when I simply pass the path to test file it can’t process it. I’m sure I’m missing a key understanding of Data Block API here, please help.

Notebook here

Following up, I was able to collect my sample as an AudioTensor but the predict method still doesn’t work.

@muellerzr Maybe you can help, I picked the AudioTensor creation part from your video tutorial. My apologies, I usually don’t at-mention at all but I’m fighting this for the last 6 hours and going crazy. And I just found a similar thread and I’m not sure if it’s a fastaiv2 issue.

Not a problem, try upgrading fastcore? Or try doing the dev installs


pip install git+

(And repeat for fastcore)

Thanks for the quick response, I tried updating everything but still the exact same issue

!pip install git+
!pip install git+
!pip install packaging
!pip install git+
1 Like

Awesome, thank you for investigating :slight_smile: I’ll look into it tonight (as it seems like a fastai issue on a whole)

1 Like

@PranY I looked into it, I can predict normally (with regular fastai data) so I think something updated that needs to be adjusted in the audio library :slight_smile:

1 Like

Thanks again. In case you want a reproducible check for the audio library, the notebook has all the components. I’ll keep looking on the audio side now.

Update: I checked last 8 commits for fastcore and the error is not related. Will look back further, I think Slyvian will know how to fix this.

Predict does not work because your loss does not have an activation and a decodes method. The first one should do the softmax, the second should do the argmax (look at CrossEntropyFlat() in fastai2 for inspiration)


Thanks! I should have observed that when torch.nn.CrossEntropy failed and then it worked with CrossEntropyLossFlat. Your reply was very helpful and saved a lot of time. I wrote a similar extension as FocalLossFlat by adding activation and decode method and that part works. Although I have some new attribute errors with the AudioSpectogram, I think I should be able to fix them :slight_smile:

1 Like

Is there a way to process audio files of different duration?
The transforms API helps me get Spectrograms for every file but the dimensions are all over the place. Using CropSignal doesn’t help as well all.

Here is the pipeline snapshot

Here is the output and for some reason, it plots the graphs twice
Update: There are double graphs because some .wav files are composed of 2 signals and AudioTensor reads them separately. An easy fix is to average the tensor across dim=0 and pass that method to pipeline

Here is the output with CropSignal and repeat padding

1 Like

You’d want to pad the signals

But CropSignal(X) creates the signal and then pads with zero until X millisecs, right?
Am I completely missing something? Apologies in advance as I’m super new to the audio deep learning.

Update: After reading the source code for AudioTensor and CropSignal I found out that the shapes of my spectograms are different because each file has a different sampling rate. CropSignal is doing it’s job perfectly.

Update2: torchaudio.load does not allow us to change the sampling rate or at least it’s not one of the parameters in the load method. My hack for that is as below (I’ll try to write a multi-threaded version for this later)



mkdir temp


for fn in $(find . -name "*.wav"); do

  TMPFILE=$TMPDIR/$(basename $fn)

  sox $fn $TMPFILE rate 16000

  mv $TMPFILE $fn


I am in. Willing to help and learn. I have some experience with fastai and fastai2, both using and contributing.

My personal area of interest – Text-To-Speech (TTS) models, datasets preparation.
Would like to contribute to the Fastai v2 audio development as well.

Is there another platform used by the developers? Like chat in Telegram?
Here is my nick in Telegram: @kdorichev. Please add/invite.

1 Like

I’m sorry for the delayed response Pranjal, I haven’t been working on fastai audio the past 2 weeks as I have been focusing my efforts on covid19 tracking/risk assessment tools.

-Nice work with mergeSignal, we have a similar built in transform, DownmixToMono that will take n channels and avg down to 1.
-CropSignal indeed does crop pad to the desired length
-Your issue is most likely due to varying sample rates in your datasets, try using Resample in your pipeline before CropSignal

About the last part, we have on our todo list a plan to warn users about multiple sample rates as this is a tricky problem to detect in current version, sorry we didn’t get it out sooner and please let me know if that was indeed the issue.

1 Like

That was precisely the issue and I wish I had read about DownmixToMono and Resample. Although, I did read Resample in the augment file but got confused by the name and I thought it was about re-sampling tensors in a loaded batch, should have read the code.

I already mentioned the bash script I used but to make things more python-ish, I wrote the below method as a part of my pipeline. Adding below in case some needs the same

def convert_file(infile: Path, bitrate: int = 16000) -> None:

    """Convert one file to the bitrate provided in the arguments.

    This file does not return anything but raises exception if sox file fails.


    :param infile: Input file path to convert.

    :param bitrate: The desired bitrate.

    :raises: OSError, SubprocessError


    infile = Path(infile)

    tempfile = infile.parent / '.temp.wav'

    # Call sox to convert the file.['sox', str(infile), str(tempfile), 'rate', str(bitrate)], 

                    check=True, stdout=subprocess.PIPE, stdin=subprocess.PIPE)


    # Remove the tempfile and replace the infile with the modified bitrate.['mv', str(tempfile), str(infile)], 

                    check=True, stdout=subprocess.PIPE, stdin=subprocess.PIPE)

def convert_all(input_dir: str, bitrate: int = 16000) -> None:

    """Converts all .wav files directly inside the folder to the desired bitrate.


    :param input_dir: Input directory with wav files to convert.

    :param bitrate: The desired bitrate.


    for infile in Path(input_dir).glob('**/*.wav'):

        convert_file(infile, bitrate)

I’m happy to hear that you were busy helping mitigate the covid19 problem :slight_smile:. I’ll finish few things on my end and then jump-in to contribute to this library in my capacity. I must tell you that I only started working with audio data 4 days ago and found out about this library. So I’m limited in my ability to understand the details of signal processing methods.

1 Like

>> 1st Fastai2 Audio Meetup <<
Sorry for the late notice, but in case interested, come join the meeting of the developers of the Fastai2 Audio.
Meeting Agenda and the link to Google Meet is in the Calendar event: