Image + sequence of coordinates as inputs

Hi! I’m planning on creating a video tracker application similar to the “find the center of a face” problem of Lesson 3, but taken a bit further. The goal is to simulate eye movement by predicting the coordinates of the point where the viewer’s eyes are focused on while watching a video. The correct coordinates depend not only on the frame itself but also on the focus point coordinates of previous frames.

So the input to the learner consists of the frame (image) and the previous coordinates (tuple). The output would then be the coordinates (tuple) for the current frame. Any idea how to format the input to include these two types of data? The following couple of lines are the closest I’ve found (taken from Lesson 3), but it only handles images as X and coordinates as Y, while I need images AND coordinates in X at the same time:

data = (PointsItemList.from_folder(path)
        .split_by_valid_func(lambda o: o.parent.name=='13')
        .label_from_func(get_ctr)
        .transform(get_transforms(), tfm_y=True, size=(120,160))
        .databunch().normalize(imagenet_stats)
       )

Thanks a lot for helping out a beginner!