Deep Learning with Audio Thread

Awesome! please feel free to PR any new modules that you feel are well integrated into the overall approach.

We’ve pushed the conversion to Part2 2019 of AudioDataBlock :wink:

4 Likes

Sure I will do it when I’ve got a moment.

1 Like

No, but it looks super interesting, thank you!

Just want to post this here for anyone working with TIMIT (I know the fastai-audio guys are using it a lot). From a phonetics perspective it’s pretty messy. In American English, there are 35-40 phonemes (base sounds). You can further subdivide them by distinguishing between stops/closures and other small differences, but often you wouldn’t want to. TIMIT uses 61 phonemes (including 3 varieties of silence), and distinguishes between very similar sounds that other linguists would group together (See: Arpabet)

Unsurprisingly, in my work with phoneme recognition I found that my most common errors were usually between two sounds that could be considered the same. I made a dictionary that links the timit symbols with the more common IPA ones. This reduced the classes from 61 to 44. I’m now training with the reduced label set and will report back if there’s a large increase in accuracy. For now, here is the dictionary in case anyone else finds it useful.

{'p': 'p',
 'ay': 'aɪ',
 'k': 'k',
 'uh': 'ʊ',
 's': 's',
 'zh': 'ʒ',
 'pau': 'silence',
 'ao': 'ɔ',
 'm': 'm',
 'er': 'ɝ',
 'oy': 'ɔɪ',
 'q': 'ʔ',
 'ey': 'eɪ',
 'eh': 'ɛ',
 'w': 'w',
 'pcl': 'p',
 'y': 'j',
 'tcl': 't',
 'ax-h': 'ə',
 'g': 'g',
 'f': 'f',
 'uw': 'u',
 'n': 'n',
 'r': 'ɹ',
 'hv': 'h',
 'axr': 'ɚ',
 'kcl': 'k',
 'jh': 'dʒ',
 'ow': 'oʊ',
 'iy': 'i',
 'hh': 'h',
 'ng': 'ŋ',
 'el': 'l',
 'dx': 'ɾ',
 'ah': 'ʌ',
 'b': 'b',
 'ux': 'u',
 'en': 'n',
 'dh': 'ð',
 'ih': 'ɪ',
 'eng': 'ŋ',
 'l': 'l',
 'epi': 'silence',
 'aa': 'ɑ',
 'th': 'θ',
 't': 't',
 'ix': 'ɪ',
 'nx': 'n',
 'h#': 'silence',
 'ae': 'æ',
 'd': 'd',
 'bcl': 'b',
 'ax': 'ə',
 'z': 'z',
 'dcl': 'd',
 'v': 'v',
 'sh': 'ʃ',
 'gcl': 'g',
 'ch': 'tʃ',
 'aw': 'aʊ',
 'em': 'm',
 'wh': 'w'}

and here is how it was generated.

vowel_maps = {
    'aa': 'ɑ', 'ae':'æ', 'ah':'ʌ', 'ao':'ɔ', 'aw':'aʊ', 'ax':'ə',
    'axr':'ɚ', 'ay':'aɪ', 'eh':'ɛ', 'er':'ɝ', 'ey':'eɪ', 'ih':'ɪ',
    'ix':'ɪ', 'iy':'i', 'ow':'oʊ', 'oy':'ɔɪ', 'uh':'ʊ', 'uw':'u', 'ux':'u',
}

#dx is the flap like tt in butter, arpabet says it translates to ɾ in ipa
#but im not so sure
#nx is another one to be careful with, it translates to either ng or n as in winner
#wh is meant to be wh like why/when/where but most ipa consider it a w
cons_maps = {
    'ch':'tʃ', 'dh':'ð', 'dx':'ɾ', 'el':'l', 'em':'m', 'en':'n', 'hh':'h',
    'jh':'dʒ', 'ng':'ŋ', 'nx':'n', 'q':'ʔ', 'r':'ɹ', 'sh':'ʃ', 'th':'θ',
    'wh':'w', 'y':'j', 'zh':'ʒ'
}

#these are maps that only timit uses, not arpanet
timit_specific_maps = {
    'ax-h':'ə', 'bcl':'b', 'dcl':'d', 'eng':'ŋ', 'gcl':'g', 'hv':'h', 'kcl':'k',
    'pcl':'p', 'tcl':'t', 'pau':'silence', 'epi':'silence', 'h#':'silence',
}

def get_timit_to_ipa_dict():
    # note that if using this code you will need to generate your own list of phonemes
    # as mine comes from a special directory I set up with folders for each phoneme
    timit_phonemes = [x.stem for x in path_phoneme.ls()]
    timit_to_ipa_dict = {k:k for k in timit_phonemes}
    for k,v in vowel_maps.items(): timit_to_ipa_dict[k] = v
    for k,v in cons_maps.items(): timit_to_ipa_dict[k] = v
    for k,v in timit_specific_maps.items(): timit_to_ipa_dict[k] = v
    return timit_to_ipa_dict
7 Likes

Also has anybody else working with TIMIT noticed that the word timestamp alignments sometimes overlap? For example

7470 11362 - she
11362 16000 - had
15420 17503 - your

How can ‘had’ end at sample 16000 but your begins at 15420?

Also for some reason exp.nb_FastWidgets isn’t getting built for me. I’m trying to run through a few of your example notebooks but I can’t because it tries to import and fails. I rebuilt using install.sh and buildFastAiAudio.sh but it didn’t fix it. Any ideas?

Last post then I’ll stop spamming the thread :slight_smile:

What do you guys think the best approaches are for phoneme recognition with real world data? In TIMIT we just chop the phoneme out of a file and the result is that we have all different lengths. I’ve seen @ste try to fix this by having 3 spectrograms of different resolutions and putting them together as a 3 channel image. But will this generalize to phoneme recognition on actual speech data where we don’t have data telling us where each phoneme stops and ends?

What is a better approach? Trying to design a model that can separate phonemes, and then passing the resulting splits to our recognition model? Or passing an overlapping/rolling window over the real data and then trying to classify which phoneme is occurring in the window? I’m going to go hunt for some papers on this topic now, but would love to hear other people’s thoughts. Thanks!

1 Like

I haven’t read or thought much about this :slight_smile: so take it with a grain of salt, but I guess it depends what the base problem is; I suspect that for speech recognition you’d often be better off taking a seq2seq approach rather than a discrete classifier approach. I know that CTC has been used successfully in this domain - see this Distill article, and this Stanford course material.

If you’re trying specifically to do phoneme identification/classification - say for doing automated transcription for phonemic analysis of a language from field recordings - I’m also not sure. The naive way that I’ve conceived of it for speaker diarisation (similar problem - “which segments of this signal have which class”) is the same idea you had. Train the classifier on the smallest possible effective segment size (say, 2ms), then chop the target inference sequence into overlapping segments of that size, classify each segment remembering their position, and take “argmax” of contiguous regions of the input signal. This feels inefficient and brute-force, but I feel like it could be the “first dumb thing” to try.

I don’t know how it’s done in industry/research - there’s probably an accepted method.

Edit to add: In the tiny amount of speaker diarisation literature that I’ve read, the accepted method seems to be basically your second idea :slight_smile: First run a “speaker change detection” model over the sequence, then classify each segment that model discovers. They don’t seem to get very good results this way, though, and I suspect that it would be much harder to find an effective “phoneme change detection” model than for different voices.

I’ve also thought about trying unsupervised methods (using some kind of clustering) to come up with the segments to classify, but ultimately you still have the problem of trying to discretise a continuous signal; hence why seq2seq approaches feel like the right way to go. Sadly I don’t know anything about them yet :slight_smile:

2 Likes

Last night I found out about a method of processing audio using Non-negative Matrix Factorisation aka NMF. It’s usually applied for audio separation - i.e. trying to split a signal into different components, whether they’re instruments (guitars vs. drums), or speakers vs. background noise. I thought it might be helpful for the problem you’re working on, @kodzaks? Perhaps you could find an NMF transformation that was reliable for separating manatee sounds from the ambient noise?

It seems like it’s typically used as a form of “classification” directly, or perhaps as a pre-processing step before further classification. I don’t know how useful it could be as a data augmentation, but it’s possible it could be; I’d guess that e.g. different speakers’ voices would decompose differently, so if you applied the same transform to many speakers you’d get differing signals in the output. However, intuitively, I kind of feel like a deep CNN would be just as likely to pull out these aspects… but I don’t understand it well enough to be sure. If nothing else, it would output a high-dim tensor which would be nicely stackable. I might experiment with it!

A few resources:

4 Likes

I was looking for a way to improve the performance on phoneme and word recognition so I borrowed the idea of @oguiza time series approach, trying to use multiple representation of the same original signal.
I’ve tried a couple of different multi channel approaches:

  1. Multi resolution: the first I’ve tried: very focus on the fact that the vast majority of things to classify are ‘small chunks’.
  • PROS: improves my accuracy.
  • CONS: pretty slow and “repetitive” (feed the beginning of signal multiple time at multiple resolution.
  1. Sliding window: better and more generic approach, useful all the
  • PROS: slightly better performance than previous. Faster (you compute the spectrogram once for the whole signal and then sample it in multiple overlapping windows). Easy to adapt to multiple situation (ie: the last example fo my notebook on AudioTfmsManager - there is a parametric multi-spectrogram transform that spits out 8 channels).
  • CONS: you need to adapt initial layer of your nwk according to the number of samples you pass.

BTW: I’ve tried to use the sliding window approach on @ThomM classification nbk (“LoadingAndClassification.ipynb”) but didn’t see any improvements…

1 Like

Hey, thanks for the awesome reply and resources. I’m most interested in helping language students with their pronunciation. I taught in a rural area of Brazil for 3 months and spent a lot of time doing one on one pronunciation work with students, but there’s only one of me and millions of students, plus all the people learning Spanish, Mandarin, French…etc.

So I guess my problem would be checking the accuracy of pronunciation for short (ranging from one word to a complex sentence), preknown prompts. It’s a much simpler problem as I know what they are trying to say, and just have to identify the parts where they are not producing the correct phoneme. There are some other complexities like handling false starts, but it seems doable.

I’m still not sure how to approach it. At every phone, the problem is now reduced from “which phone is this” to “is this the phoneme I’m expecting at this point”. But if you take that binary approach do you then have to have ~40 models for ‘b’/‘not b’ ‘ə/not ə’ and so on? How can we consolidate that into one model where we pass it the expected phoneme and it returns a boolean?

I’m just starting to dig into the research and industry standards for this type of stuff, I’ll report back what I find. So far for the problem of uninformed phoneme recognition on TIMIT, there was a paper that got published in 2018 with 71% accuracy using a CNN, and it referenced several papers as SOTA.

  • 82.3% Bidirectional LSTMs (Graves, Mohamed and Hinton, 2013)
  • 80.9% DNN with stochastic depth (Chen, 2016)
  • 82.7% RNN-CNN hybrid based on MFCC features (Zhang, 2016)
2 Likes

oh, wow, thank you! It is definitely above my head at the moment, but it looks very interesting. Is it a bit similar to the PCA? It could be good for types of calls classification maybe?

came across this paper http://www.danielpovey.com/files/2015_interspeech_augmentation.pdf has anyone checked it out? Only paper I’ve seen wholly focused on data augmentation for audio, and they’re good people and explain the sox methods they use for each technique :smiley:

cc: @ThomM @ste

5 Likes

I hadn’t seen it, thanks. Key takeaway:

A relative improvement of 4.8% was observed on the total Hub5 ’00 evaluation set, when using speed perturbed training data.

Sounds like only “speed perturbation” was effective for them. But it is no longer the only paper you’ve seen wholly focused on data augmentation for speech, because it has quite a few references :slight_smile:

Also interesting that

Mel-frequency cepstral coefficients (MFCCs) ([13]), without cepstral truncation, were used as input to the neural network.

Just like the example notebook above… seems like that approach is definitely worth trying out.

As an aside, I think we ought to set up another example using different data to the american english speakers dataset, as it seems too easy to get >99% accuracy :slight_smile:

1 Like

Let me know if you can’t find another open source diariazation Dataset. I’ve been thinking about making one from an open source language website called tatoeba. They have speakers of many languages translating sentences and recording audio. Licenses for use vary but most are creative commons and just ask for attribution. I worked with the data for an English teaching project and already have some code written for pulling down and parsing.

It would be cool to have multiple languages as well as multiple dialects of English. You could also have an arbitrary number of speaker classes for diariazation. I’d be happy to throw something together if you can tell me what kind of dataset would be ideal.

2 Likes

Has anyone here used kaldi? It’s an open source speech recognition library used in c++. I believe Daniel Povey (author of the paper you cited) is the creator.

1 Like

Yes I’ve used it a couple of times. It’s a very cool program for generating alignments for un-aligned audio. If I recall correctly it uses HMM/GMM (which I don’t really understand) to do the alignments. I was impressed with how well it handles most audio, especially if you include a real (good) transcription to align with.

If you’re on OSX I definitely recommend the version with a desktop UI.

But you technically don’t need these alignments to do speech recognition :slight_smile:

1 Like

@jeremy Can we please wikify the original post to continue to add resources? I’m no longer able to edit, I think too much time has passed. Thank you.

2 Likes

Done.

1 Like