Question about data pipeline?

Mehdi63 · May 28, 2019, 4:41pm

Hi friends !

I’m dealing with a hand keypoints dataset and I’d like to know what you guys think my dataset pipeline makes sense.

The data is in various folders, no validation set is specified.
Each folder has .jpg files and the corresponding .json.
I was thinking of using the ImageDataBunch.from_df, it is seemingly the most straightforward to use in this case.
Thus, I need to create a csv with the image path in the first column, and (since we’re doing a regression on vectors) the n next column should contain the vectors we wish to predict.

Does that sound correct ?
Thanks a lot !

sgugger · May 28, 2019, 4:43pm

You should use the data block API, it would make your life easier

Mehdi63 · May 28, 2019, 5:00pm

Hey, thanks for answering !

Yop, that’s what I wanna do. But it does need a csv of a pandas df for it, doesn’t it ?

sgugger · May 28, 2019, 5:03pm

Nope, if it’s in various folders, you can just use ImageList.from_folder, then split randomly and use a label_from_func to open the corresponding json.

leviritchie · May 28, 2019, 7:09pm

I hope this isn’t too pedantic, but it may be worth creating train/valid folders rather than a random split if some of these images come from the same photoshoot (i.e. the same person in the same location). You don’t want your model to recognize the person or setting in the image, just the hand keypoints.

I’m assuming you’re using this. If not, maybe your dataset doesn’t have this problem.

Mehdi63 · May 29, 2019, 6:36am

Exactly, that’s the one

Great, label_from_func that’s what I was looking for, I suppose ! (:

Thanks !