Speech To Text / STT


I am an alumni of course 1, midst in course 2, working on Speech To Text, and I have a few challenges there :slight_smile:

I want to correctly understand messages from / with

  • speakers with (even strong) accents / local dialects
  • non native speakers
  • low audibility / mobile phones
  • various mobile phones (cheap / expensive)
  • text in a foreign language (Ǵerman)

The system should avoid biases.

I should be in a position to create a Librispeech dataset for German, and add some dialect / accent data if necessary.

Also, I have problems with generalisation - message understanding seems to depend on the distance of the speaker from the microphone, speaker type, accents, etc. WER (word error rate) seems to be good as long as the test data is from the same dataset as the training data, but even slight modifications seem to make the understanding much worse.

Can you give me a hint where to look for? Currently, I am using a Pytorch implementation of Deep Speech 2 (https://github.com/SeanNaren/deepspeech.pytorch).

It looks to me like there was not much progress published on STT after DeepSpeech 2 (2015), but there is clearly some progress in the field - e.g. the Google STT works great!

Are there recommended STT algorithms that can generalize better than others? Is there a conference I could / should attend?

Kind regards and a big thank you for the great fastai library, courses and forums!!!


End-to-end STT only generalizes/works well if you have a lot of data, the magic number is around 10000 hours. Some companies like Google, Baidu and Microsoft apply end-to-end and people like Andrew Ng (worked at Baidu’s DeepSpeech Project) reinforce this need for a large corpus.

The approach for small companies is an hybrid solution using HMM/DNN, usually using some toolkit like Kaldi. This approach involves preprocessing the data, working with phonemes and some alignment before feeding it all to training.



thank you very much for your input, this is very helpful. There seems to be a new hybrid solution out there: Pytorch-Kaldi from Bengios group (https://github.com/mravanelli/pytorch-kaldi). We will look at that.

BTW - is the 10.000 hours real spoken audio, or can it be speech plus augmented speech?


You can count the augmented speech, the right real/augmented speech ratio is highly dependent on the data you have, so you have to test it out.


Thanks, we will try!