Hi, has anyone tried to use the fastai library to detect video events?
I’m currently trying to detect events in video at faster than realtime.
The video decoding seems to be a real bottleneck. I’m not very experienced in measuring where bottlenecks are (disc read/write speed, bus bandwidth overload, decoding time, inference time) but I assume there must be some tools and techniques to help with this - does anyone have any advice on pipeliene optimization or some monitoring tools which could show me which processes are taking the longest and why? I noticed that if I start using the swap memory instead of RAM then things slow right down.
I’m wondering whether to convert all the frames of the video to jpeg first, then pass these into a model. Storing the frames as raw tensors takes up too much space (1GB per 1 second of 1080p video) and so perhaps using ffmpeg to resize then save in a compressed format like jpeg will be quick and not take up too myuch space, then these jpegs could be read by the dataloader without the need to decode the video. The reason I thing this might end up being faster is that the dataloader doesn’t have access to ffmpeg, so I could use CPU multithreading to speed up the decoding stage if I use ffmpeg first and save to jpegs, then use fastai to read the image files.
Another approach I’ve tried is to store a dictionary of torchcodec video decoder objects, then have the dataloader look up the correct video frames from these decoder objects. I though this would be a good option since the decoder objects don’t load the video into memory - they can be queried for specific frames and they return tensors - but this is quite slow too since they still have to decode the video up to that frame.
Hardware accelertation for video decoding using NVDEC on a consumer GPU is an option I’m looking into too.
If anyone has some tips on fast video decoding and inference strategies I’d be interested to hear! In the meantime i’ll keep trying out these approaches.
For inference from video data, I thought about usin a 3d CNN kernel, and was told that this might be suprisingly slow. I’ll check it out though.
Another approach i have found is this image sequence classification tutorial from fastai:
It looks promising.
I also thought about subtracting tensors from oneanother in sequence, to detect any motion, then just pass the resulting subtracted tensor into an image classification model and train on that.
Thanks for any ideas and advice! Let me know if you’re working on a similar project and we can share some ideas.