Recommendations for Audio Clustering

Hi all,

I’m attempting to chop podcast audio data into smaller segments, grouped by speaker.

For example, given a chunk of audio the output would be:
[(Speaker 1, 0:00-0:45), (Speaker 2, 0:46-1:22), …]

I’m trying to do this in an unsupervised manner. My initial approach was chopping up the audio into smaller windows and then running clustering algorithms like DBSCAN. My hope was that each window (approximately 20ms to 1s) would cary some characteristics that a clustering algorithm could pick up on, but that did not end up being the case.

I found a few papers on unsupervised audio segmentation, but they all seemed to solve a different problem (extract the n-different voices that are speaking over each other into their own audio file).

Any pointers on what I should look into to tackle this problem? I’m new to audio processing so I feel like there may be something I’m overlooking.

Here are my findings if anyone is interested:
Some approaches I saw for “audio clustering” tried converting spectrograms into features using techniques like WNDCHRM or getting the activations from one of the last layers of a CNN. I tried both of these methods on windows of 50ms-1s spectrograms but was not able to find a relationship between these feature vectors and the person currently speaking.


A hacky approach that worked well for me was to use the video data from the podcast to roughly estimate when a speaker was talking. The podcast data I’m working with has only a few different camera angles, most of the time, the camera is focused on one person, and that person is typically the person currently speaking.

cap = cv2.VideoCapture('podcast.mp4')
frames = [cap.read for i in range(1000)]
np.mean([frames[1:], frames[:-1]], axis=0)
>> [25.3, 23.4, 89.4, 82.1, 85.2, 83.4, ...]

Because speakers are generally still, the averages frame to frame will generally be similar. From here you can use something like KDE to cluster the frames by their averages.


I found that I lacked the terminology to find relevant papers (googling “speaker clustering” gives very unsatisfying results). The key term I lacked was " Diarization", I’ve also seen papers that refer to this as “speaker recognition”. “Speaker diarisation is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity.” (Wikipedia). Kaldi has a diarization tool that looks promising.