Video modeling, identifying speaker

I want to be able abstract an image of the face of each speaker in a video. The tool I’m building, I think, will at some point need to have a classifier that takes as input an image of a face and determines if the face is or is not speaking. For this, I’ll need one of two things. 1. integrate into my code an existing model for classifying an image of a face as speaking or not 2. Collect a dataset of images of faces speaking and not speaking to build the classifier.

I’m wondering, where would I search for either a dataset of faces speaking or not speaking or a model that classifies faces as speaking or not speaking?

yeah there would be alot resources in youtube you can use voxceleb which has the face bounding boxes from you tube videos and in youtube all the speakers are speaking almost
and then train a model from it and also use a picutre to create a video from it which is not speaker just moving a round and use it for not speaking create a model which tracks sequence to input 60 frames from a video and trian it
or use MTCNN or mediapipe for face landing marks and then calculate the distance betwenn marks and use it as is speaking or is not speaking algorithm