Speech diarization and summary

Hi everyone,

I am looking into summarising meeting records and I am not sure what is the best way to approach the problem.
We split the problem in three phases :

  1. Speaker diarization, maybe after reducing the noise
  2. Speech to Text, possibly followed by some correction
  3. Text summarisation
    I am a bit concerned that by the time we get to summarising, the model will not be able to understand the coreferences in the dialogues.
    Also, I am not sure what architecture I should use. I know that as output we should have a list of words and as input, I imagine converting the audios in spectrograms.

Any ideas or experience , you could share on this topic ?



1 Like