Classification of video

I was wondering if there is any project/code I can use to classify video snippets.
What I want to do is mark a juggling video by identifying the individual juggling tricks in it. Anyone has any idea on how to do this?
I’m interested in the theory side (which architecture) and the practical side as well (if there’s any code out there that I can use to build a prototype of this quickly).

4 Likes

I’m also interested in this. I’ve started building it myself.

I’m running a truncated CNN on the frames of the video to get a vector of ~25k numbers from the final bottleneck representing each frame. Then just run an LSTM over the vector of frames to compute a vector for the snippet. Also working on stacking LSTMs. Then just use the final hidden state of the LSTM to fit a final softmax layer into the class you’re predicting.

2 Likes

Does it work so far?

@markovbling: Any progress?

yeah, it works! initially just did resnet on each frame and then did a sort of filter over the outputs to smooth the predictions. LSTM added decent improvement too. Writing a paper on it will post when draft done…

4 Likes

I would be very interested in seeing a video with the results from this!

@markovbling did you end up writing a paper on this? I’m very interested!

I’m hoping to do something similar: my goal is to find all instances of a particular event in a youtube video, and generate a timestamp for each event. My instinct is to use a CNN for shot boundary detection. This video was helpful: https://www.youtube.com/watch?v=xRLeLQV8kL8

1 Like

@cmac ICYMI, the speaker in the video was one of the first to release a paper on SBD (Shot Boundary Detection). Since then, there’s been a couple of notable papers in the field that offer more robust results (with more complex models and datasets):

  1. Large-scale, Fast and Accurate Shot Boundary Detection through Spatio-temporal Convolutional Neural Networks. Code is available here.
  2. Fast Video Shot Transition Localization with Deep Structured Models. I vaguely understand the model they introduced here… one could argue it’s a bit overkill. They also introduced a new, complex dataset that’s much more challenging than the speaker’s.
    They never released their pretrained model but did release pretrained models of other architectures on their dataset, which are available on their github repo

I personally would like to work on deploying their pretrained models to split my own videos, but haven’t gotten around to doing it yet. There is some non-DL work I have done though, available here.