Video modeling, identifying speaker

I want to be able abstract an image of the face of each speaker in a video. The tool I’m building, I think, will at some point need to have a classifier that takes as input an image of a face and determines if the face is or is not speaking. For this, I’ll need one of two things. 1. integrate into my code an existing model for classifying an image of a face as speaking or not 2. Collect a dataset of images of faces speaking and not speaking to build the classifier.

I’m wondering, where would I search for either a dataset of faces speaking or not speaking or a model that classifies faces as speaking or not speaking?