How to do Speech Detection? (on/off, not STT)

I have a video (or audio) file and I need to align already existing subtitles to the audio exactly. It’s not a “crop” function that adjust the start and end, but each subtitle needs to be accurate.

I’m building a Japanese learning tool for students and I have a DVD video with subs available, but they are not accurate enough timing wise.

So far I’ve found a tutorial for but the Freesound dataset seems like it’s a bit useless for my purpose, but it does have some speech audio in there as well.

Since I’m an absolute beginner I’ll just try and follow the tutorial and update my progress. If anyone’s got tips I’d be happy to listen. In any case hopefully this topic will be of help to people.

To wrap up here’s a few more useful resources: