How best to handle video training data

I’m currently working with videos, I’m making a classifier of whether or not a detected face in a video is speaking at given time range. My current approach involves a step where for every word in that time range, I take 5 frames in the middle of the word, and create a concatenation of each person’s face in the 5 frames for each face existing in all 5 frames. Examples are below.

A question I have is, is this process of combining video frames and then using an image of concatenated frames (or concatenated parts of frames) the best or standard way to detect something in video? Are there models that take as input a training point representing a block of video rather than image models that I should be looking at?