Classifying actions in Video clips
The Moments in Time dataset includes labeled 3 second .mp4 videos, involving people, animals, objects or natural phenomena. The target task is to recognize actions and events produced by humans, animals, objects or nature. Notice that an action (ie verb) is more difficult to identify than an object (noun). For example, an action like “opening” can be done by doors, curtains, eyes, etc.
3 seconds are (usually) enough for a human to identify an action in a video clip.
See Moments in Time Dataset: one million videos for event understanding for details. The full dataset includes 1M 3-second videos in 339 classes, and a “mini”-dataset with 130K videos in 200 classes in less than 10GB.
To generate sample .jpg from the mp4s I used
ffmpy a python wrapper to FFMpeg,
concurrent.futures ThreadPoolExecutor to run multiple videos in parallel.
There are in the literature methods to create better “summaries” of such brief videos in a single image, eg Dynamic Image Networks. Further, we can use the video sound as another dimension to improve classification.
In this task, there are many possible “good” outcomes for a single action. Examples that came at the top of my confusion-matrix include “cooking”/“barbecuing” and “bicycling”/“spinning”. These are examples where people may disagree on what is the “best descriptive” action. So a better classification metric is to use top-N accuracy, eg top-2 or top-4, etc.