for some time I am thinking about a case for DL to be used for segmentation of sport video (e.g. tennis) where the algorithm should split the long video into many small segments corresponding to individual rallies, i.e. the parts of the video where the “ball is in play” and where the ball is not in play (majority of the time). Ideally, it should use not only video input but also the corresponding audio input.
I would greatly appreciate any hints how to approach this problem or directions towards such case.
Thanks a lot, kind regards,