I'm also interested in this. I've started building it myself.
I'm running a truncated CNN on the frames of the video to get a vector of ~25k numbers from the final bottleneck representing each frame. Then just run an LSTM over the vector of frames to compute a vector for the snippet. Also working on stacking LSTMs. Then just use the final hidden state of the LSTM to fit a final softmax layer into the class you're predicting.