I’m trying to build a speech to text model for transcribing medical prescriptions (symptoms, diagnosis, suggested drugs and tests) as narrated by a physician. Wondering how to start this off

Hi Amit
This is an area where I have no knowledge but here is my idea.
You need recording of a sentence containing the words.
Mary had a little lamb. Take the recording and produce a image of the sound wave over time.
So Mary had a little lamb would probably look like Mar ri h’ad a lit tel la mb. So 9 mini waves.
Then produce more waves Mary had a little lamb. So 5 waves.
Then we treat each wave image as an image and use a CNN to repesent it. So it is similar to finding multiple objects in a picture.

So Mary had a liitle lamb and Mary had a sheep dog should have common images for Mary had a .

So we are building sound objects to represent sequences of words.

Let me stress I no nothing about this subject but I think that would be the approach. I think that medical terminology might be less varied than ordinary speech.
Here is Mary had a liitle lamb sung by me.

Microsoft has a research team called the Garage and their speech to text is quite good but I suspect they might have a large database. One interesting idea might be music download because the words are recorded.

Regards Conwyn

That sounds cool but I suspect different words can have the same pictorial representation or waveform would be similar for two different words (as it only measures intensity over time).

Hi Amit
I wonder if there is merit in using a fourier transform which would give a frequence image. I think as with any speech homophones are a problem. The red books which have been read were simply placed in a box labelled red read. Hence recording the sentence as a whole (commonly uttered) would hopefully cover the ambiguity.
Regards Conwyn

Have a look at the Deep Learning with Audio thread. Deep Learning with Audio Thread
There’s plenty there and the contained links that are good beginner reading for audio tasks.