I am an alumni of course 1, midst in course 2, working on Speech To Text, and I have a few challenges there
I want to correctly understand messages from / with
- speakers with (even strong) accents / local dialects
- non native speakers
- low audibility / mobile phones
- various mobile phones (cheap / expensive)
- text in a foreign language (Ǵerman)
The system should avoid biases.
I should be in a position to create a Librispeech dataset for German, and add some dialect / accent data if necessary.
Also, I have problems with generalisation - message understanding seems to depend on the distance of the speaker from the microphone, speaker type, accents, etc. WER (word error rate) seems to be good as long as the test data is from the same dataset as the training data, but even slight modifications seem to make the understanding much worse.
Can you give me a hint where to look for? Currently, I am using a Pytorch implementation of Deep Speech 2 (https://github.com/SeanNaren/deepspeech.pytorch).
It looks to me like there was not much progress published on STT after DeepSpeech 2 (2015), but there is clearly some progress in the field - e.g. the Google STT works great!
Are there recommended STT algorithms that can generalize better than others? Is there a conference I could / should attend?
Kind regards and a big thank you for the great fastai library, courses and forums!!!