Video Classifcation

yanagar25 · April 26, 2018, 7:44am

Hi,

I’m training a video classification model with 8 classes. Each video contains 64 frames and each frame is 600x600 size.
since every video is quite big I can only use batch size of 16 on 8 V100 GPU’s (each gpu gets 2 videos randomly) - therefor the BatchNormalization layers calculated for 2 videos and not on the entire 16 videos which gives me low results.

Anyone has an idea how to solve this?

Best Regards,
Yana

emilmelnikov · April 26, 2018, 9:19am

You can try to use Group Normalization instead of BatchNorm.

yanagar25 · April 26, 2018, 11:41am

Thanks for your answer, I will try to use since it’s available now on pytorch 0.4.

Do you know how can i convert the BatchNorm weights (which are “running_mean” and “running_var”) to GroupNorm weights (which are “weight” and “bias”)?

Yana

emilmelnikov · April 27, 2018, 8:19am

I don’t think that’s possible because BatchNorm and GroupNorm normalize along different axes.

fortydegrees · April 27, 2018, 9:39am

Hi, I’m also interested in working with video. Would you mind telling me how you got started? I’ve done classification for images, but unsure how you’d apply that concept to video (e.g. detecting events/actions that occur over multiple frames, like kicking a ball)

Any material or resources you could point me to would be a great help.

thadar · April 29, 2018, 12:48pm

Hey,

From what i know and read most people that are working with video are using 3D CNN architectures.
What helped me to understand how to work with video is to know that video eventually is a set of images (frames) one after the other, eventually it’s an array of images. If you think about image classification each sample is a single image now instead of a single image you have an array of images as a single sample.

You have this paper of I3D and S3D architectures: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

If you are interested there are already implementations of I3D trained on Kinetics dataset in pytorch here:
kinetics-i3d-Pytorch and here kinetics_i3d_pytorch i’m using the second one.

Also there is another implementations of ResNet, ResNext, Densnet and WideResnet architectures (3D cnn modules) in pytorch for video here:
3D-ResNets-PyTorch

Best Regards,
Tal

fortydegrees · April 30, 2018, 8:43am

Hey, thanks very much for this great response, it’s really helpful.

dhoa · November 4, 2018, 10:07pm

Hi @thadar,

Thank you so much for your information. I am doing a project that need to realtime detecting action using a camera, for example: detect a person put something in the cart. Do you think it can work with the model you suggest above ? Because I think video might be not concern much about the real time effect.

Thank you in advance,

thadar · November 5, 2018, 3:22pm

Hey,

I Believe it can work, the model i suggested above (I3D and S3D) are working with optic flow which increase the accuracy of the classification problem in Kinetics dataset.
So if i had to detect if a person put something in the cart for example i would add the optic flow as well.

Hope this helps,
Tal

dhoa · November 5, 2018, 3:59pm

Thank you very mich @thadar

MikeGallimore · August 4, 2025, 9:56pm

Hi, has anyone tried to use the fastai library to detect video events?

I’m currently trying to detect events in video at faster than realtime.

The video decoding seems to be a real bottleneck. I’m not very experienced in measuring where bottlenecks are (disc read/write speed, bus bandwidth overload, decoding time, inference time) but I assume there must be some tools and techniques to help with this - does anyone have any advice on pipeliene optimization or some monitoring tools which could show me which processes are taking the longest and why? I noticed that if I start using the swap memory instead of RAM then things slow right down.

I’m wondering whether to convert all the frames of the video to jpeg first, then pass these into a model. Storing the frames as raw tensors takes up too much space (1GB per 1 second of 1080p video) and so perhaps using ffmpeg to resize then save in a compressed format like jpeg will be quick and not take up too myuch space, then these jpegs could be read by the dataloader without the need to decode the video. The reason I thing this might end up being faster is that the dataloader doesn’t have access to ffmpeg, so I could use CPU multithreading to speed up the decoding stage if I use ffmpeg first and save to jpegs, then use fastai to read the image files.

Another approach I’ve tried is to store a dictionary of torchcodec video decoder objects, then have the dataloader look up the correct video frames from these decoder objects. I though this would be a good option since the decoder objects don’t load the video into memory - they can be queried for specific frames and they return tensors - but this is quite slow too since they still have to decode the video up to that frame.

Hardware accelertation for video decoding using NVDEC on a consumer GPU is an option I’m looking into too.

If anyone has some tips on fast video decoding and inference strategies I’d be interested to hear! In the meantime i’ll keep trying out these approaches.

For inference from video data, I thought about usin a 3d CNN kernel, and was told that this might be suprisingly slow. I’ll check it out though.

Another approach i have found is this image sequence classification tutorial from fastai:

It looks promising.

I also thought about subtracting tensors from oneanother in sequence, to detect any motion, then just pass the resulting subtracted tensor into an image classification model and train on that.

Thanks for any ideas and advice! Let me know if you’re working on a similar project and we can share some ideas.