I’m training a video classification model with 8 classes. Each video contains 64 frames and each frame is 600x600 size.
since every video is quite big I can only use batch size of 16 on 8 V100 GPU’s (each gpu gets 2 videos randomly) - therefor the BatchNormalization layers calculated for 2 videos and not on the entire 16 videos which gives me low results.
Hi, I’m also interested in working with video. Would you mind telling me how you got started? I’ve done classification for images, but unsure how you’d apply that concept to video (e.g. detecting events/actions that occur over multiple frames, like kicking a ball)
Any material or resources you could point me to would be a great help.
From what i know and read most people that are working with video are using 3D CNN architectures.
What helped me to understand how to work with video is to know that video eventually is a set of images (frames) one after the other, eventually it’s an array of images. If you think about image classification each sample is a single image now instead of a single image you have an array of images as a single sample.
If you are interested there are already implementations of I3D trained on Kinetics dataset in pytorch here: kinetics-i3d-Pytorch and here kinetics_i3d_pytorch i’m using the second one.
Also there is another implementations of ResNet, ResNext, Densnet and WideResnet architectures (3D cnn modules) in pytorch for video here: 3D-ResNets-PyTorch
Thank you so much for your information. I am doing a project that need to realtime detecting action using a camera, for example: detect a person put something in the cart. Do you think it can work with the model you suggest above ? Because I think video might be not concern much about the real time effect.
I Believe it can work, the model i suggested above (I3D and S3D) are working with optic flow which increase the accuracy of the classification problem in Kinetics dataset.
So if i had to detect if a person put something in the cart for example i would add the optic flow as well.