I am looking into summarising meeting records and I am not sure what is the best way to approach the problem.
We split the problem in three phases :
- Speaker diarization, maybe after reducing the noise
- Speech to Text, possibly followed by some correction
- Text summarisation
I am a bit concerned that by the time we get to summarising, the model will not be able to understand the coreferences in the dialogues.
Also, I am not sure what architecture I should use. I know that as output we should have a list of words and as input, I imagine converting the audios in spectrograms.
Any ideas or experience , you could share on this topic ?