Deep Learning with Audio Thread

Just an idea, maybe try to do some augmentation with noise? When signal is buried in noise it could be a problem.

2 Likes

did you check out the notebook? there is a white noise and similar functions for augmentation in there.

Hi all,

Our experiments with a nascent Fastai Audio continue. We reorganized our notebooks and exported relevant code cells. With these new notebooks and some install scripts, FastAI Audio is starting to feel more like a real module (thanks to @ThomM @ste @simonjhb and @amir!)

Improvements

  1. Data augmentation support using the datablock API. You can call get_audio_transforms and configure a suite of augmentations.

  2. AudioDataBunch now handles spectrogram and signal output for (hopefully) maximum learner reusability. This was probably the most interesting change: finding a convenient way to handle both signal and spectrogram audio formats. You can pass spectro=True to your get_audio_transforms call and get spectrograms with data augmentation. Examples in notebooks.

  3. We settled on an atomic unit AudioData, which contains a sig and sr (sample rate). This vastly simplified our I/O flow when working with the audio classes. It also means we don’t have to remember to pass the proper sample rate all over the place, because it’s embedded with the audio signal.

Experiments

Initial experiment was speaker classification (10 classes) with 3-6 second clips of people reading. 84% accuracy using standard CNN with first layer replaced for 1 channel input.

Two interesting findings trying to beat this benchmark:

  • We had poor results using too much data augmentation. We first fed the signal through every augmentation we had. The signal seems to have been reduced to noise by the end and we couldn’t get above 40% accuracy.
  • Padding matters a lot. We chose to use a fixed value to pad our signals to the same size (5 seconds at 16000 sample rate). This was arbitrary. But we also tried 1 and 3 second values. The model improved unambiguously as we approached the maximum length of our clips. This suggests, at least for this use case, that we should probably pad to the maximum length in our clip set, perhaps barring an extreme outlier.

With our mx_to_pad at 80000 and using a single data augmentation, white noise, we beat our benchmark by nearly 10%, ending up around 92-93% accuracy. Data augmentation works! :slight_smile:

Things that you could help with:

  • We still need to improve how the signal and spectrogram are represented (think __str__ and __repr__), as we’re still breaking things when data is printed at various phases.
  • We have no data augmentation on spectrograms, only for the signal. What does this even look like?
  • There are many more things we could do to augment the signal.
  • We have not explored other spectrogram formats, only melspectrograms.
  • We have not yet tried RNNs or other models on the signal.
  • Performance with a lot of transforms is not great, even though FastAI runs them in parallel. My guess is that too much is being done in Python (torchaudio delegates down to librosa for example). But we don’t want to write files to disk if possible, which is what other tools like sox and ffmpeg seem to require.
12 Likes

Amazing work. Have you spoken with Jeremy at all as he’s mentioned they are also working on a fastai audio module? Maybe you guys could share info/success/pitfalls.

Is there any benefit to spectrogram augmentation? You never see spectrograms from another angle, or slightly shifted, and it seems like all possible changes would be better made in the time domain. The only reason I can think of to augment spectrograms is to be faster than augmenting signal and then generating the spectrogram after.

I don’t know what your time benchmarks are like and whether it’d be worth it. Like is there any way you could take something like pitch shifting that works in the time domain and rewrite it to work in the frequency domain so it can be applied directly to spectrograms?

2 Likes

Thanks! Yes, we’re in contact with Jeremy since we’re all doing FastAI on site in SF.

Definitely makes sense. I agree, but I figured why not stay open to the possibility that there’s some opportunity for the image?

Maybe so, I don’t know enough about signal processing to have a definitive answer. Intuitively we probably wouldn’t want to operate on the spectro in that way because of the lossy (binned) nature of how the frequency bands are constructed. I don’t know.

It may not matter. And if the performance is really bad we could operate on the signal if we can find a lower-level tool that doesn’t only work with file-in-file-out. Maybe there’s some way to use pybind11 to pass the signal arrays to a fast C++ library for audio processing.

3 Likes

Thanks for the interesting heads up. I wondered a lot on how to solve the translation invariance for my project. I heard that VGG has less problem with this than resnets. I am working on EEG spectrum analysis. It is just like audio waves but in the lower freq. range (2-80 Hz).

I’ve got a few of insights on how things can be improved with spectrograms. EEG signals are extremely noisy and extremely faint. Basically it is usually buried in the background noise that it is difficult usually to be noticed even by the human eye. But the slight variations in color for CNN means a lot :slight_smile:

For instance one of the powerful methods to filter out signals is to make baseline correction. The most simplest scenario is where you decide that the 1st second of all your dataset in the frequency domain (i.e., the spectrum) is the baseline. And you will subtract the whole spectrum in each wave file, the amplitudes of the average baseline of that file. By this you will have much cleaner signal. Basically, this is like an adaptive filter in the time domain. We are removing the noise but even the noise that happens within the freq. range of the signal itself. And this is a very difficult research problem when we are not allowed to use baseline correction (like filtering mother’s ECG signal and picking up only the fetal heart ECG which is much fainter than that of the mother). Good systems will be able to give you a very clean fetal ECG without mother’s ECG.

Think about the active noise cancellation headphones. They are removing (or trying its best to decrease) the amplitude of the outer speech and without affecting the headphone audio playback of the speech that is played in your headphone. Filtering out such noise, cannot be done by a simple FIR filter (like an equalizer in your Hi-Fi stereo).

@kodzaks
You can get a lot from such adaptive filtering. And do not worry about resizing the whole 10sec rectangular into square. It does not matter for the NN, unless there are certain features that happens within not more than 100ms. Increasing the resolution will help too if you are concerned that the squish is making certain sounds disappear. Squish for you will be just like playing youtube in 2x speed. (Well for you will be ~5x maybe). But it should be fine if your NN baby found himself in a world that everybody speaks in 5x speed. He will learn that language in that speed.

I love audio and signal processing… This is something that I am very excited about for 20-30 years…

Also an important tip for augmentation

I am okay with Telegram group… But I feel very sorry that our discussion and insights will be buried in closed social media… If we make our discussion only here will be much better. Even Google is indexing this forum and solutions will appear in google searches of other folks. We can help others doing so. Other fastai folks will see your chit chat and may help with an idea or other. Even Jeremy sometimes chime in, and you will lose all of that in a social media like Telegram or Slack.

I have worked with MNE EEG processing python library, and it provides several ways of baseline correction in time and in frequency domain. I haven’t check whether this is a thing in librosa, but even if it isn’t, it is easy to replicate the idea in your audio data. Or even use MNE library for your audio which is quite flexible. Just change the freq min and max into what you want… But I highly encourage to stick with audio libraries, because there are many things related to EEG that you should understand if you want to dive in this library which is completely irrelevant to you (like epochs, channel locations, event related potentials …etc.)

Notice how this baseline period between -0.5s to 0s is almost void of any signal:

Source

Actually this means that any continuous background noise that appears in the baseline and after that will disappear. Think like speaking where heavy machinery is continuously working in the background. Catch its frequency spectrum in 1 sec period before you speak and subtract the amplitude of each frequency of this baseline from your speech segment, and viola, your speech is without that background noise.

11 Likes

Going to try out your notebooks today and take another crack at the tensorflow speech comp using data augmentation. I’d like to make a notebook for the competition that serves as an intro to audio for people who are interested. Would it be okay to share your work there (with credit of course)? I’ll do benchmarks and see how it performs in the wild, ideally isolating the effects of the various transforms to give you guys better feedback on what types of augmentation are the most promising. It should also give me an idea of what, if anything, I can contribute to your project from your “things you could help with” section. I’m still very new (picked python back up 8 months ago after 8 years off, my only ML experience is Andrew Ng’s intro course and fastai (started 9 weeks ago)).

Edit: Quick question, the best way to do this is to clone your repo and stick it inside my fastai/fastai folder? @ThomM @zachcaceres. Thanks

1 Like

Wow I never considered this, very cool suggestion. I look forward to playing around with it.

Yea Jeremy pointed this out to us and I agree. Right now the telegram chat is more chitchat than anything, and isn’t super active. If we post anything that could be of use, I try to repost it here, but I agree it’s not the most effective method. If it feels like stuff is slipping through the cracks and not making it here, we will just move the chat here, but I don’t want every random piece of chat to be indexed for eternity :smiley:

1 Like

Something possibly worth crossposting from telegram thread. I altered my code above to include spectrograms. Makes it much more helpful in my opinion. This grabs one random file from my test set, makes a prediction, shows the percents, the spectrogram, and allows you to play the audio file. Really helpful for figuring out what your model is getting wrong.

def display_audio_prediction():
    rand_file = test_files_list[random.randint(0, num_files-1)]
    clip, sr = librosa.load(path_test_audio/rand_file, sr=None)
    img_filename = rand_file + ".png"
    image = open_image(path_test_spectrogram/img_filename)
    pred = learn.predict(image)
    show_image(image)
    print(f"Prediction: {pred[0]}")
    for idx, pct in enumerate(pred[2]):
        if(pct.item() > 0.1):
            print(f"{data.classes[idx]}: {round(pct.item()*100, 2)}%")
    display(Audio(clip, rate=sr))

2 Likes

Totally agree with both playing sound and showing the Spectrogram!
Tuning the spectrogram is as important as data augmentation because is actually what the model see.

Our chain of transformations is divided into two groups:

  1. augment the audio signal (volume, noise…)
  2. transform the augmented audio signal into an image (ie: Mel)

So the ideal thing to show is both the result of (1) and (2).
The problem is that in our implementation all the transformations are made together because actually the result of (1) is not used by the model, but useful to “hear” the impact of audio signal augmentation.

Looks great @MadeUpMasters, I’ve been wanting something similar but as @ste says the implementation as we’ve done it makes it a little tricky. It’s definitely something to aim for though! As for getting started, yep, just clone the repo. In fact I’d suggest forking it first, that way it’ll be easier to submit a PR if you want to contribute. Technically it doesn’t need to be in the fastai folder, it’s standalone. Take it for a spin and let us know what you think.

I’ve been playing with my “speaker eliminator” project and got my model to 99.3% accuracy yesterday. Having proven that it’s possible to distinguish the speakers, now comes the hard part - working out how to apply this over a sequence.

In general I’m not sure of the ideal way to model the speaker diarisation problem. It seems most approaches look for “speaker change events”, which makes sense. There are a million considerations with every approach though (how long a clip do you depend on? What’s the smallest amount of time to classify? What do you do with overlapping speech? Non-speech sound? Do you preprocess the whole clip first or have it as part of the pipeline? Etc etc). I’m trying to work out what the simplest, dumbest solution is.

Yesterday I took 10sec clips of labelled data to train the model. I’m thinking today I might split the whole episode into 10sec clips with a 1sec overlap, and classify each clip. Clips that get high confidence of a single speaker can then be reassembled. This is a very inefficient approach but I think it’s the simplest to get started with.

I’m still interested in the concept of treating a whole audio file’s spectrogram as an image and treating it as an image masking problem. If we can classify parts of a photo as building, road, cyclist etc., then why couldn’t we classify parts of a spectrogram as speaker A, speaker B etc…? Something to experiment with.

2 Likes

This work is super exciting. Where are you seeing their notebooks? I would love to try some out as well!

Linked in the original post under fastai specific

1 Like

I’m going to put my thoughts/suggestions for the fastai notebook here and I’ll come back and edit as I go.

Dataloading.ipynb

  • First off, for anyone else following the notebooks, dataloading.ipynb is the notebook you will want to start in and it helps you to install any libraries/dependencies.
  • The cell with %% bash and ./install.sh didn’t work for me, I installed manually in terminal. I also needed to run apt-get update before to get it to work properly.
  • untar_data hung for me, but seems to have done it’s job (passed sanity check). This could be on my end, I’m having a lot of issues with Gradient lately.
  • The data augmentation section doesn’t

AWD-LSTM.ipynb

  • When I call slices = get_audio_slices('/notebooks/storage/ST-AEDS-20180100_1-OS/f0004_us_f0004_00446.wav', slice_length) pandas can’t handle an iterator (what zip returns), wrapping it in list() fixes it.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-ae88dcbddf7e> in <module>
----> 1 slices = get_audio_slices('/notebooks/storage/ST-AEDS-20180100_1-OS/f0004_us_f0004_00446.wav', slice_length)
      2 slices.head()

<ipython-input-20-ca416f61678e> in get_audio_slices(file_p, slice_len)
      2     sig,sr = torchaudio.load(file_p)
      3     spl = split_arr(sig, slice_len)
----> 4     return pd.DataFrame(zip(spl, [sr] * len(spl)), columns=['sig', 'sr'])

/opt/conda/envs/fastai/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    405                 mgr = self._init_dict({}, index, columns, dtype=dtype)
    406         elif isinstance(data, collections.Iterator):
--> 407             raise TypeError("data argument can't be an iterator")
    408         else:
    409             try:

TypeError: data argument can't be an iterator

example_word_and_phoneme_classification.ipynb

  • In test_extractItems_PHN (change to test_extract_items_PHN in my opinion), I have a failed assert. assert 54==len(ret_df) the length im getting is 41.
  • A few lines down another failed assert here (i’m getting # rows as 80)
def test_createItemsDataFrame():
    tdf = createItemsDataFrame(path, all_files_df.head(10), 'WRD')
    print(f'Rows: {len(tdf)}')
    tdf.info()
    assert 86==len(tdf)
  • For this line %time all_words_df = createItemsDataFrame(path, all_files_df, 'WRD'), maybe add a comment with how long it is expected to take (took 2 min 42 seconds for me)
  • Typo here src['LENGHT'] = src.SIG.apply(lambda x: x.shape[0]) should be LENGTH
  • This notebook is calling tfm_spectro (nb_DataAugmentation.py), with extra arguments (ws, n_fft, to_db_scale) that causes an error.
  • Why are the spectrograms so small? Have you experimented with other sizes?

AudioCommon.ipynb

class AudioData:
    '''Struct that holds basic nformation from audio signal'''
    def __init__(self, sig, sr=16000): 
        self.sig = sig.reshape(-1) # We want single dimension data
        self.sr = sr
  • Maybe test that it’s 1D data and raise a ValueError if it isn’t. It’d be a really tough bug to track down if someone mistakenly fed in non 1D data and it was flattened without telling them
  • raise f'File not fund: {fileName}' - Typo, fund should be found

It looks awesome by the way, I’m really excited to plug this up with my tensorflow speech challenge and see how it goes. I’ll keep reporting back.

Edit: Was working on something else today as well as establishing a baseline model to compare your fastai audio against so I didn’t get too far, but I’ll be working on it tomorrow (Wednesday)

2 Likes

I’ve updated the DataAugmentation notebook, adding some tweaks and examples to the “spectrogram transformation” part.

Here is the code:

As soon as I finish the training I’m going to publish in the same repo a notebook with a complete example on timit dataset.

1 Like

I’ve committed a new notebook with a complete example: that show how to use the fast.ai to classify words from audio signal.

NB: In the notebook there is a section dedicated to spectrogram fine tuning, that helps you to fine tune it.

With a “tuned” spectrogram I’ve been able to increase the performance of the model of 5%-10% .

4 Likes

Awesome work @ste, @zachcaceres. Just ran through everything and is working well for me. In fact the accuracy I’m getting is much higher than what you’ve reported in the notebooks!

  1. I’m getting some errors due to the way you are raising exceptions. Would be happy to create a PR to change these

  2. I really like how you’ve created and tested the classes in notebooks but wouldn’t it be easier to just have the actual code for the package separate and use ?? to see the source code. You wouldn’t need to run the buildFastAiAudio.sh script then either. This also makes creating a pull request to change the basic classes such as AudioData and AudioItem a lot easier. Merging notebooks can be a nightmare from my experience as the outputs are saved. PR

  3. Would it be possible to have the timit dataset downloaded as done in the AWD_LTSM notebook.

  4. Do you think it would be better to have a small set of wav files tracked in git for getting people up and running faster with the notebooks.

  5. What do you intend to do with show_batch have spectograms show with audio underneath?

  6. Currently you have add the custom layer manually on top of the cnn learner. A custom learner might be good here?

1 Like

I noticed that the transform to spectogram wasn’t expanding the channel dimension to 3 as is done by the library for the mnist dataset. I’ve added that into my fork of the project. You were replacing the first Conv2d layer which would loose all pretrainined learning? I may be wrong about this.

After making these tweaks I checked to see how well this approach performed on Free ST American English Corpus datset (10 classes of male and female speakers) I was able to get these results:

Screenshot%20from%202019-04-03%2017-38-03

98.3% accuracy

Here is the Notebook. It is derived from your AWS_LSTM notebook

3 Likes

Great job @baz!!

Actually we don’t completely “loose” the first layer: we’ve copied the “red” channel:

...
    # save original
    original_weights = src_model[0][0].weight.clone()
    new_weights = original_weights[:,0:1,:,:]

    # create new layes
    new_layer = nn.Conv2d(nChannels,64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    new_layer.weight = nn.Parameter(new_weights)
...

I’m working on a “multi spectrogram example” that shows how to create multiple images for the same sample.

I’ll commit as I finish :wink:

1 Like

I think there is too much silence in these samples.

A simple little hack would be slice up the audio based on silence.

I’ve created a notebook describing how to do this.

I’ll try to create a method that is more configurable