I am trying to build a CV model for detecting objects in videos. I have about 6 videos that have the content I need to train my model. These are things like lanes, other vehicles, etc. that I’m trying to detect.
I’m curious about the format of the dataset I need to train my model with. I can have each frame of each video turn into images and create a large repository of images to train with or I can use the videos directly. Which way do you think is better?
I have seen arguments on both sides (it’s just images vs it’s a flow/video).
From the more recent papers I have glanced at it seems that treating it as a video is much more helpful in terms it can correlate items from image to image and thus has both reduced work to do and greater accuracy.
If you are thus trying to track objects that are likely to move rapidly then I think treating it as video is likely to give you improved results b/c it can map them between images and use the info from prior images for greater accuracy. This also helps deal with motion blur and object occlusion since it can reference the object in a more general sense.
Here’s a couple very recent papers of relevance and claiming SOTA on some datasets: