Architecture for speech-to-text of language with very limited vocabulary

Hi! I’m trying to find a good architecture to convert speech to text for a use case where only a subset of a natural language is used. The model would have to deal with around 40 unique characters, 15000 unique words and 6000 unique sentences.

I want the model to map audio to text with the given characters even if the resulting text is not present in the given vocabulary (which is probably different from the usual speech-to-text problem). I think the vocabulary is still needed here for correctly spelling words that match the vocabulary.

I’m a software engineer without any proper background on ML and just got started with fastai last week, so apologies in advance if I have trouble communicating clearly :sweat_smile:

Would really appreciate it if the awesome folks here can help me guide my thoughts in the right direction. Thanks!

Here’s my journey so far in case anyone is interested:

  • I’ve a labelled dataset of all the unique words and sentences recorded
  • I assumed training the model with words first would help it recognize sentences better later on, so I generated spectograms of the unique words and fed it to a simple CNN model (one I copied from the convolutions tutorial notebook)
  • The results weren’t very promising (not exactly sure why - i just started playing around with fastai and just tried to follow the tutorials so might have missed something)
  • I assumed I should probably let the model recognize unique characters first and then build on top of that. So I updated my Datablock to have MultiCategoryBlock of 40 characters instead of CategoryBlock of 16000 words (Now that I think about it, it could be the huge number of categories that made the learning very slow in the previous step?)
  • This worked pretty well and I ended up with an accuracy of 97% (I verified on a separate test dataset). But, for example, for a recording saying “banana”, my model would output [‘b’, ‘n’, ‘a’] so I still needed to figure out the order and frequency of the characters
  • I thought I could use the weights of the trained model when trying to figure out words from audio and then use those weights to figure out sentences from audio. Long story short, I realized that weight transferring wouldn’t it be possible if i update the datablock. I tried pre-training from my simple model too but couldn’t make it work (maybe I should extend nn.Module to create my model?)
  • after some more research into RNNs and Seq2Seq models, I’m thinking if that’s the next thing I should try. I’ve previously glossed over them thinking I don’t really need the model to “remember” context, I just want it to transcribe whatever sound it is receiving