Video Classifcation


I’m training a video classification model with 8 classes. Each video contains 64 frames and each frame is 600x600 size.
since every video is quite big I can only use batch size of 16 on 8 V100 GPU’s (each gpu gets 2 videos randomly) - therefor the BatchNormalization layers calculated for 2 videos and not on the entire 16 videos which gives me low results.

Anyone has an idea how to solve this?

Best Regards,

1 Like

You can try to use Group Normalization instead of BatchNorm.

Thanks for your answer, I will try to use since it’s available now on pytorch 0.4.

Do you know how can i convert the BatchNorm weights (which are “running_mean” and “running_var”) to GroupNorm weights (which are “weight” and “bias”)?


I don’t think that’s possible because BatchNorm and GroupNorm normalize along different axes.

Hi, I’m also interested in working with video. Would you mind telling me how you got started? I’ve done classification for images, but unsure how you’d apply that concept to video (e.g. detecting events/actions that occur over multiple frames, like kicking a ball)

Any material or resources you could point me to would be a great help.

1 Like


From what i know and read most people that are working with video are using 3D CNN architectures.
What helped me to understand how to work with video is to know that video eventually is a set of images (frames) one after the other, eventually it’s an array of images. If you think about image classification each sample is a single image now instead of a single image you have an array of images as a single sample.

You have this paper of I3D and S3D architectures: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

If you are interested there are already implementations of I3D trained on Kinetics dataset in pytorch here:
kinetics-i3d-Pytorch and here kinetics_i3d_pytorch i’m using the second one.

Also there is another implementations of ResNet, ResNext, Densnet and WideResnet architectures (3D cnn modules) in pytorch for video here:

Best Regards,


Hey, thanks very much for this great response, it’s really helpful.

Hi @thadar,

Thank you so much for your information. I am doing a project that need to realtime detecting action using a camera, for example: detect a person put something in the cart. Do you think it can work with the model you suggest above ? Because I think video might be not concern much about the real time effect.

Thank you in advance,

1 Like


I Believe it can work, the model i suggested above (I3D and S3D) are working with optic flow which increase the accuracy of the classification problem in Kinetics dataset.
So if i had to detect if a person put something in the cart for example i would add the optic flow as well.

Hope this helps,

1 Like

Thank you very mich @thadar