Efficient MP4 to Images for training

Hi folks,

I am working on a classification project with a large number (1000s) of short videos. These have a footprint of ~20 GB of data. I have been trying to find an efficient way of processing the videos to be used with Fast.AI’s DataSet loader. My basic idea:

  1. Load each video with OpenCV
  2. Grab a few frames per second
  3. Either save as JPEG (StackOverflow1 or convert directly to tensors (StackOverflow2) and save to disk
  4. Train classifier on these transformed images.

For an 11 MB video, I got roughly 2 GB when trying to save as a numpy tensor (‘uint8’) or as JPEGs.

This seems like a really inefficient process to me. Any guidance on how I might better approach this project? Thanks!

2 Likes

How much did you get when saving as JPEGs? It should be smaller than saving as numpy arrays.

My suggestion would be to actually find a way to keep them compressed as videos, since that’s the best compression you can get. Without it, you’d lose the temporal encoding/compression.

You could write a DataLoader that keeps somewhere the file path to each video and their amount of frames (you can get that using ffprobe or OpenCV. Now, sampling from them depends on your application. If you’re doing video classification, you will probably sample videos from the video list, and from each video you extract a snippet, say 25 frames, making a tensor of [N, 25, 255, 255, 3]. Another alternative would be to have the total amount of frames and sample an integer that gives you the video + frame_id. However, I think the first option is better because it helps to select more diverse videos in case some have more frames than the others.

It’s a bit expensive to make random accesses in a video (they’re better sequential reads), but pre-fetching them with more workers will help with throughput.

2 Likes

Thank you for the great response @konwnad. Coding up my own dataloader sounds intimidating, but I will check out the source code to see if I can grok it.

1 Like

I did some work with video recently and used ffmpeg to export the frames to JPEG. However, upon closer inspection, those images contained unwanted artifacts (even with minimal compression). Exporting as PNG solved this but it obviously also made the image files much larger.

So if you’re going the JPEG route, make sure the exported frames are good enough quality.

(Personally, I’d try writing a data loader that decompresses the movies on-the-fly, as suggested earlier.)

3 Likes

I’m in this process as well and I can’t believe nobody has faced these issues before. Maybe the rest of the world simply uses pytorch dataloaders (that should support video)?

It sounds like Nvidia DALI is what you’re looking for.
Example

class vidSet(Dataset):
    def __init__(self, videos_path):
        self.video_paths = videos_path.ls()

        self.caps = [cv2.VideoCapture(str(split_path)) for video_path in self.video_paths]
        self.images = [[capid, framenum] for capid, cap in enumerate(self.caps) for framenum in range(int(cap.get(cv2.CAP_PROP_FRAME_COUNT)))]
    
        self.labels = [label for i in range(len(self.images))] # Whatever your needs are
    
    def __len__(self):
         return len(self.images)

   def __getitem__(self, idx):
       capid, framenum = self.images[idx]
       cap = self.caps[capid]
       cap.set(cv2.CAP_PROP_POS_FRAMES, framenum)
       res, frame = cap.read()

       img = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
       label = self.labels[idx]
    
       img_tensor = torch.from_numpy(img).permute(2,0,1).float() # /255, -mean, /std ... do your things with the image
       label_tensor = torch.as_tensor(label)
    
       return img_tensor, label_tensor

train_path = Path(‘train’)
valid_path = Path(‘valid’)

vidset_train = vidSet(train_path)
vidset_valid = vidSet(valid_path)

vidloader_train= DataLoader(vidset_train, batch_size=64, shuffle=True)
vidloader_valid= DataLoader(vidset_valid, batch_size=64, shuffle=False)

data = DataBunch(vidloader_train, vidloader_valid, device=‘cuda’)

3 Likes

Can I modify this code to do a two frame input, one frame output network?

For the input:

Modify in init:
self.images = [[capid, framenum] for capid, cap in enumerate(self.caps) for framenum in range(int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) - 1)]

Modify in getitem:

 res_1, frame_1 = cap.read()

 img_0 = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
 img_1 = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

 img_0_tensor = torch.from_numpy(img_0).permute(2,0,1).float() # /255, -mean, /std ... do your things with 
 img_1_tensor = torch.from_numpy(img_1).permute(2,0,1).float() # /255, -mean, /std ... do your things with the image
 img_tensor = torch.cat((img_0_tensor, img_1_tensor), dim=1) # I cannot test this row of code right now

Labeling goes the same way…

Got it, thanks!