Speech-to-text and Speech Synthesis

@jeremy , is there any room for ‘requests’ for the remaining portion of the curriculum? Along with the wonderful world of NLP and sentiment analysis, I think a wonderful “full circle” coverage of language would be ‘speech-to-text’ and ‘speech synthesis’. I know LyreBird and WaveNet are both very amazing results obtained through NNs. Is that already/possible to be in the remaining lessons for Part 2?

I’m very interested to learn some of the tips/tricks of this area. If anybody’s done anything so far, I’m all ears.


Even if these don’t make it into official curriculum, this would be interesting to discuss here. I have been pondering the idea of a speech-to-speech model that could take speech from one language and turn into into a machine generated version of your own voice on the other side in the other language.

1 Like

I’m afraid not - it takes me months of research to get to a point I have useful things to show in a lesson, so the main pieces have to be done well ahead of time. I would love to spend time on audio in the future however, and would be interested to hear if any students take a look at this area.


I started reading more about recent text to speech systems. In my limited exploration I think the easiest to understand/implement is Voiceloop from facebook https://github.com/facebookresearch/loop. Git repo is in pytorch 0.1 but I was able to run it in v0.4. Right now I am trying to understand how they did preprocessing.

Few good things about Voiceloop

  • Simple architecture
  • Ability to generate voices for different speakers
  • Ability to learn from noisy data
  • Working code

Speech data is very complicated when compared to image or text. Lot of preprocessing and feature engineering is required before Neural Nets can be applied.


Hi Saurabh,
thanks, the Loop sounds very promising, I was not aware of that !

Kind regards