Deep Learning with Audio Thread

@zachcaceres https://github.com/marii-moe/petfinder/blob/master/petfinder-fastai.ipynb

petfinder github with how i loaded data into a databunch

1 Like

I’ve been working on audio classification, leveraging the excellent work from @zachcaceres
From this base, I have added the following:

  • Management of stereo files. It can be either converted to mono or one spectogram is created by channel and then both channels are fed into CNN. The latter seems to increase the accuracy of models.
  • Adapting Fastai method “show batch” to show the original sound, the processed sound, all spectograms. This is useful to see the impact of data augmentation methods (especially for noise addition).
  • Creation of a new method plot_audio_top_losses, inspired by plot_top_losses, to give an interpretation on the focus of the model during predictions

Finally, I’ve also compare the performances in different dataset (AudioMNIST, SLR45, UrbanSound8k).
On UrbanSound8k, the performance is really close to the benchmark published last year in the KDD conference last year (96.8% without cross-validation vs 97.5% with cross-validation)

I hope this would be useful.

You can check out this work in this link https://github.com/cccwam/fastai_audio

6 Likes

It would be great if we can add cutout/occlusion type of augmentation, it could be super useful for things like auditory scene analysis or classification of multiple categories of sound present in a sound file, etc.

@kodzaks Any chance you know http://www.ece.neu.edu/fac-ece/purnima/ ? I know her group has done some fun work around counting whales using deep learning.

2 Likes

Great! We’re creating a new folder with the conversion of all’ main notebooks to Part 2/2019 - we’ll release it soon :wink:

1 Like

BTW: try to run AudioTransformManager notebook in fast-ai audio root.

https://forums.fast.ai/t/tfmsmanager-tune-your-transforms/43741/4

1 Like

hey. Is the group still a thing? How can I participate. Thanks in advance

1 Like

Hey Andrei, the telegram group is still a thing but it’s not super active as we are defaulting to posting most things here to keep solutions searchable for all. You’re welcome to join us, there’s still some chat and people to ask questions/share with when you don’t want to clog the thread. You can also just post here, as there are lots of active contributors. If you want to join the telegram group just send me a PM here and I’ll add you. Cheers.

1 Like

Awesome! please feel free to PR any new modules that you feel are well integrated into the overall approach.

We’ve pushed the conversion to Part2 2019 of AudioDataBlock :wink:

4 Likes

Sure I will do it when I’ve got a moment.

1 Like

No, but it looks super interesting, thank you!

Just want to post this here for anyone working with TIMIT (I know the fastai-audio guys are using it a lot). From a phonetics perspective it’s pretty messy. In American English, there are 35-40 phonemes (base sounds). You can further subdivide them by distinguishing between stops/closures and other small differences, but often you wouldn’t want to. TIMIT uses 61 phonemes (including 3 varieties of silence), and distinguishes between very similar sounds that other linguists would group together (See: Arpabet)

Unsurprisingly, in my work with phoneme recognition I found that my most common errors were usually between two sounds that could be considered the same. I made a dictionary that links the timit symbols with the more common IPA ones. This reduced the classes from 61 to 44. I’m now training with the reduced label set and will report back if there’s a large increase in accuracy. For now, here is the dictionary in case anyone else finds it useful.

{'p': 'p',
 'ay': 'aɪ',
 'k': 'k',
 'uh': 'ʊ',
 's': 's',
 'zh': 'ʒ',
 'pau': 'silence',
 'ao': 'ɔ',
 'm': 'm',
 'er': 'ɝ',
 'oy': 'ɔɪ',
 'q': 'ʔ',
 'ey': 'eɪ',
 'eh': 'ɛ',
 'w': 'w',
 'pcl': 'p',
 'y': 'j',
 'tcl': 't',
 'ax-h': 'ə',
 'g': 'g',
 'f': 'f',
 'uw': 'u',
 'n': 'n',
 'r': 'ɹ',
 'hv': 'h',
 'axr': 'ɚ',
 'kcl': 'k',
 'jh': 'dʒ',
 'ow': 'oʊ',
 'iy': 'i',
 'hh': 'h',
 'ng': 'ŋ',
 'el': 'l',
 'dx': 'ɾ',
 'ah': 'ʌ',
 'b': 'b',
 'ux': 'u',
 'en': 'n',
 'dh': 'ð',
 'ih': 'ɪ',
 'eng': 'ŋ',
 'l': 'l',
 'epi': 'silence',
 'aa': 'ɑ',
 'th': 'θ',
 't': 't',
 'ix': 'ɪ',
 'nx': 'n',
 'h#': 'silence',
 'ae': 'æ',
 'd': 'd',
 'bcl': 'b',
 'ax': 'ə',
 'z': 'z',
 'dcl': 'd',
 'v': 'v',
 'sh': 'ʃ',
 'gcl': 'g',
 'ch': 'tʃ',
 'aw': 'aʊ',
 'em': 'm',
 'wh': 'w'}

and here is how it was generated.

vowel_maps = {
    'aa': 'ɑ', 'ae':'æ', 'ah':'ʌ', 'ao':'ɔ', 'aw':'aʊ', 'ax':'ə',
    'axr':'ɚ', 'ay':'aɪ', 'eh':'ɛ', 'er':'ɝ', 'ey':'eɪ', 'ih':'ɪ',
    'ix':'ɪ', 'iy':'i', 'ow':'oʊ', 'oy':'ɔɪ', 'uh':'ʊ', 'uw':'u', 'ux':'u',
}

#dx is the flap like tt in butter, arpabet says it translates to ɾ in ipa
#but im not so sure
#nx is another one to be careful with, it translates to either ng or n as in winner
#wh is meant to be wh like why/when/where but most ipa consider it a w
cons_maps = {
    'ch':'tʃ', 'dh':'ð', 'dx':'ɾ', 'el':'l', 'em':'m', 'en':'n', 'hh':'h',
    'jh':'dʒ', 'ng':'ŋ', 'nx':'n', 'q':'ʔ', 'r':'ɹ', 'sh':'ʃ', 'th':'θ',
    'wh':'w', 'y':'j', 'zh':'ʒ'
}

#these are maps that only timit uses, not arpanet
timit_specific_maps = {
    'ax-h':'ə', 'bcl':'b', 'dcl':'d', 'eng':'ŋ', 'gcl':'g', 'hv':'h', 'kcl':'k',
    'pcl':'p', 'tcl':'t', 'pau':'silence', 'epi':'silence', 'h#':'silence',
}

def get_timit_to_ipa_dict():
    # note that if using this code you will need to generate your own list of phonemes
    # as mine comes from a special directory I set up with folders for each phoneme
    timit_phonemes = [x.stem for x in path_phoneme.ls()]
    timit_to_ipa_dict = {k:k for k in timit_phonemes}
    for k,v in vowel_maps.items(): timit_to_ipa_dict[k] = v
    for k,v in cons_maps.items(): timit_to_ipa_dict[k] = v
    for k,v in timit_specific_maps.items(): timit_to_ipa_dict[k] = v
    return timit_to_ipa_dict
7 Likes

Also has anybody else working with TIMIT noticed that the word timestamp alignments sometimes overlap? For example

7470 11362 - she
11362 16000 - had
15420 17503 - your

How can ‘had’ end at sample 16000 but your begins at 15420?

Also for some reason exp.nb_FastWidgets isn’t getting built for me. I’m trying to run through a few of your example notebooks but I can’t because it tries to import and fails. I rebuilt using install.sh and buildFastAiAudio.sh but it didn’t fix it. Any ideas?

Last post then I’ll stop spamming the thread :slight_smile:

What do you guys think the best approaches are for phoneme recognition with real world data? In TIMIT we just chop the phoneme out of a file and the result is that we have all different lengths. I’ve seen @ste try to fix this by having 3 spectrograms of different resolutions and putting them together as a 3 channel image. But will this generalize to phoneme recognition on actual speech data where we don’t have data telling us where each phoneme stops and ends?

What is a better approach? Trying to design a model that can separate phonemes, and then passing the resulting splits to our recognition model? Or passing an overlapping/rolling window over the real data and then trying to classify which phoneme is occurring in the window? I’m going to go hunt for some papers on this topic now, but would love to hear other people’s thoughts. Thanks!

1 Like

I haven’t read or thought much about this :slight_smile: so take it with a grain of salt, but I guess it depends what the base problem is; I suspect that for speech recognition you’d often be better off taking a seq2seq approach rather than a discrete classifier approach. I know that CTC has been used successfully in this domain - see this Distill article, and this Stanford course material.

If you’re trying specifically to do phoneme identification/classification - say for doing automated transcription for phonemic analysis of a language from field recordings - I’m also not sure. The naive way that I’ve conceived of it for speaker diarisation (similar problem - “which segments of this signal have which class”) is the same idea you had. Train the classifier on the smallest possible effective segment size (say, 2ms), then chop the target inference sequence into overlapping segments of that size, classify each segment remembering their position, and take “argmax” of contiguous regions of the input signal. This feels inefficient and brute-force, but I feel like it could be the “first dumb thing” to try.

I don’t know how it’s done in industry/research - there’s probably an accepted method.

Edit to add: In the tiny amount of speaker diarisation literature that I’ve read, the accepted method seems to be basically your second idea :slight_smile: First run a “speaker change detection” model over the sequence, then classify each segment that model discovers. They don’t seem to get very good results this way, though, and I suspect that it would be much harder to find an effective “phoneme change detection” model than for different voices.

I’ve also thought about trying unsupervised methods (using some kind of clustering) to come up with the segments to classify, but ultimately you still have the problem of trying to discretise a continuous signal; hence why seq2seq approaches feel like the right way to go. Sadly I don’t know anything about them yet :slight_smile:

2 Likes

Last night I found out about a method of processing audio using Non-negative Matrix Factorisation aka NMF. It’s usually applied for audio separation - i.e. trying to split a signal into different components, whether they’re instruments (guitars vs. drums), or speakers vs. background noise. I thought it might be helpful for the problem you’re working on, @kodzaks? Perhaps you could find an NMF transformation that was reliable for separating manatee sounds from the ambient noise?

It seems like it’s typically used as a form of “classification” directly, or perhaps as a pre-processing step before further classification. I don’t know how useful it could be as a data augmentation, but it’s possible it could be; I’d guess that e.g. different speakers’ voices would decompose differently, so if you applied the same transform to many speakers you’d get differing signals in the output. However, intuitively, I kind of feel like a deep CNN would be just as likely to pull out these aspects… but I don’t understand it well enough to be sure. If nothing else, it would output a high-dim tensor which would be nicely stackable. I might experiment with it!

A few resources:

4 Likes

I was looking for a way to improve the performance on phoneme and word recognition so I borrowed the idea of @oguiza time series approach, trying to use multiple representation of the same original signal.
I’ve tried a couple of different multi channel approaches:

  1. Multi resolution: the first I’ve tried: very focus on the fact that the vast majority of things to classify are ‘small chunks’.
  • PROS: improves my accuracy.
  • CONS: pretty slow and “repetitive” (feed the beginning of signal multiple time at multiple resolution.
  1. Sliding window: better and more generic approach, useful all the
  • PROS: slightly better performance than previous. Faster (you compute the spectrogram once for the whole signal and then sample it in multiple overlapping windows). Easy to adapt to multiple situation (ie: the last example fo my notebook on AudioTfmsManager - there is a parametric multi-spectrogram transform that spits out 8 channels).
  • CONS: you need to adapt initial layer of your nwk according to the number of samples you pass.

BTW: I’ve tried to use the sliding window approach on @ThomM classification nbk (“LoadingAndClassification.ipynb”) but didn’t see any improvements…

1 Like

Hey, thanks for the awesome reply and resources. I’m most interested in helping language students with their pronunciation. I taught in a rural area of Brazil for 3 months and spent a lot of time doing one on one pronunciation work with students, but there’s only one of me and millions of students, plus all the people learning Spanish, Mandarin, French…etc.

So I guess my problem would be checking the accuracy of pronunciation for short (ranging from one word to a complex sentence), preknown prompts. It’s a much simpler problem as I know what they are trying to say, and just have to identify the parts where they are not producing the correct phoneme. There are some other complexities like handling false starts, but it seems doable.

I’m still not sure how to approach it. At every phone, the problem is now reduced from “which phone is this” to “is this the phoneme I’m expecting at this point”. But if you take that binary approach do you then have to have ~40 models for ‘b’/‘not b’ ‘ə/not ə’ and so on? How can we consolidate that into one model where we pass it the expected phoneme and it returns a boolean?

I’m just starting to dig into the research and industry standards for this type of stuff, I’ll report back what I find. So far for the problem of uninformed phoneme recognition on TIMIT, there was a paper that got published in 2018 with 71% accuracy using a CNN, and it referenced several papers as SOTA.

  • 82.3% Bidirectional LSTMs (Graves, Mohamed and Hinton, 2013)
  • 80.9% DNN with stochastic depth (Chen, 2016)
  • 82.7% RNN-CNN hybrid based on MFCC features (Zhang, 2016)
2 Likes