Comparing speech samples


I want to create a nn that compares two short audio files (containing between 3-7 spoken words said by different people) to determine whether they are saying the same thing.

I was wondering if anyone has any idea how I would go about that? (I haven’t really found any promising leads online using similar search terms)

My first thought was to convert them into waveform images, however for a basic image classifier to work, I would need to have one of the audio files already to train the model, and then compare the other file to the first to see if the model classifies it as the same. This wouldn’t work for me however as I’ll be dealing with potentially millions of different audio clips and therefore it needs to be generic.

Sounds similar to face recognition with siamese networks. You could train a sound classifier so that it learns to extract features from sounds, then look at the distance between the features of the speech clips.

Sounds like it should work. I was also thinking about concatenating the two waveforms into one and then putting them through a cnn which outputs two classes (similar and dissimilar). Would that work in theory as well?