Hi,
I want to create a nn that compares two short audio files (containing between 3-7 spoken words said by different people) to determine whether they are saying the same thing.
I was wondering if anyone has any idea how I would go about that? (I haven’t really found any promising leads online using similar search terms)
My first thought was to convert them into waveform images, however for a basic image classifier to work, I would need to have one of the audio files already to train the model, and then compare the other file to the first to see if the model classifies it as the same. This wouldn’t work for me however as I’ll be dealing with potentially millions of different audio clips and therefore it needs to be generic.