Is audio segmentation possible with FastAI?

micahjon · December 18, 2020, 5:32am

I’ve seen many examples of classifying individual audio files (e.g. 1 second clips), but am looking for an example of segmenting a longer audio file into distinct parts.

I have a couple hundred audio files and have labelled parts of them as different types of content (e.g. 0-10s is label A, 10-12s is label B, etc). Given the nature of the content, I think an LSTM or GRU would be most helpful, since the labels are somewhat contextual and if you isolated a 1 second segment (without knowing what was before/after it), it’d be hard to classify.

Presumably I can split up the audio file into tons of tiny pieces (e.g. Mel-Spectra), then feed each piece through a neural net to train it. I’ve read several papers that do this (often for speaker diarization), but can’t find any approachable code examples.

If you have any suggestions for where to start I’d really appreciate it. Thanks!

spiyer99 · December 23, 2020, 7:21am

Hey @micahjon, I recently built a Longform audio classifier using Fastaudio: https://towardsdatascience.com/longform-audio-classification-in-fastaudio-76d81825d29b. Tbh it didn’t work as well as I’d hoped, but I think it’s a decent start.

In terms of approachable code I’m taking a look at: https://github.com/pyannote/pyannote-audio.

I’m not an expert in this field but I find it pretty interesting. Let me know what you think!