Question about data pipeline?

Hi friends !

I’m dealing with a hand keypoints dataset and I’d like to know what you guys think my dataset pipeline makes sense.

  • The data is in various folders, no validation set is specified.
  • Each folder has .jpg files and the corresponding .json.
  • I was thinking of using the ImageDataBunch.from_df, it is seemingly the most straightforward to use in this case.
  • Thus, I need to create a csv with the image path in the first column, and (since we’re doing a regression on vectors) the n next column should contain the vectors we wish to predict.

Does that sound correct ?
Thanks a lot !

You should use the data block API, it would make your life easier :slight_smile:

Hey, thanks for answering !

Yop, that’s what I wanna do. But it does need a csv of a pandas df for it, doesn’t it ?

Nope, if it’s in various folders, you can just use ImageList.from_folder, then split randomly and use a label_from_func to open the corresponding json.

I hope this isn’t too pedantic, but it may be worth creating train/valid folders rather than a random split if some of these images come from the same photoshoot (i.e. the same person in the same location). You don’t want your model to recognize the person or setting in the image, just the hand keypoints.

I’m assuming you’re using this. If not, maybe your dataset doesn’t have this problem.

Exactly, that’s the one :wink:

Great, label_from_func that’s what I was looking for, I suppose ! (:

Thanks !