Is it possible to split ImageDataBunch.from_df into Train/Valid/Test?

sgmiriuka · April 19, 2019, 2:29pm

Hi everybody,
need some help. In the way I set up my experiment, I got one big dataframe with labels, and one single folder with all images in there.

Now, I run my analysis for the train df (and with a valid percentage). Before that, I create a randomly subset df for test. But I haven’t been able to figure out how this new test df can be used in my analysis.

I tried ‘learn.get_preds(is_test=True)’ as in these thread, but I get an error with is_test=True ('got an unexpected keyword argument ‘is_test’, I’m using version 1.0.45).

thanks in advanced for any hint

cc @sgugger

muellerzr · April 19, 2019, 3:13pm

You should be able to. I’ve been doing this for tabular problems relating to time-series. You make three imagelist’s, one having 70, one having 20, and one having 10%. Make the largest training, second largest as a validation, and smallest as 10%. I’ll show you my code for doing it with tabular, let me know if you need help merging it to images. I ran this on each CSV document I brought in from pandas:

class CombineData2:
  def __init__(self, df1, df2):
    self.train = df1.train.append([df2.train])
    self.valid = df1.valid.append([df2.valid])
    self.test = df1.test.append([df2.test])

This takes in two pandas dataframes and merges their training valid and test sets. To get those split I used the following function:

class PrepData:
  def __init__(self, dataframe, activity):
    self.dataframe = dataframe
    dataframe['Activity'] = activity
    self.lenTrain = int(len(dataframe)/100*70)
    self.lenValid = self.lenTrain + int(len(dataframe)/100*20)
    self.lenTest = self.lenValid + int(len(dataframe)/100*10)
    self.train = dataframe.iloc[:self.lenTrain]
    self.valid = dataframe.iloc[self.lenTrain:self.lenValid]
    self.test = dataframe.iloc[self.lenValid:]

I passed in a dataframe and a string for activity as my data was split by files instead of having the class listed but the idea should still be the same. Then generation of the databunch was as follows:

training = TabularList.from_df(initialClassificationData.train, path = '', cat_names = cat_vars, cont_names = var, procs=procs).split_none().label_from_df(cols=dep_var, label_cls = CategoryList)
valid = TabularList.from_df(initialClassificationData.valid, path='', cat_names = cat_vars, cont_names = var, procs=procs).split_none().label_from_df(cols=dep_var, label_cls = CategoryList)
test = TabularList.from_df(initialClassificationData.test, path='', cat_names = cat_vars, cont_names = var, procs=procs).split_none().label_from_df(cols=dep_var, label_cls = CategoryList)

training.valid = valid.train
training.test = test.train
initialClassificationDatabunch = training.databunch()

You should be able to repeat this for images. If you have issues tell me and I can try to work something out after my classes today. The key here is split_none() so we just get the first 70, next 20, and next 10% of data stored within it

sgmiriuka · April 20, 2019, 9:33am

Thanks a lot Zachary, looks like I should go for the API to get it. Now, based on this part,

training.valid = valid.train
training.test = test.train
initialClassificationDatabunch = training.databunch()

Do you end up with just one training.databunch for all three groups (Train/Valid/Test)?
Thanks again

muellerzr · April 20, 2019, 3:00pm

Yep! So just reassign it to whatever name you want if you need to. I had 3-4 different databunches I was dealing with so I named them all differently but you could keep it as training.databunch(). That databunch will have your train, validation, and test sets!

neumann · May 2, 2019, 10:04am

Thanks for your post, it saved me a lot of headache, here is the approach i ended up doing.

# explicitly define training and test split

df_train = df[df['train_test_split'] == 'train'][['filename_cropped','label']]
df_test = df[df['train_test_split'] == 'test'][['filename_cropped','label']]

training = ImageList.from_df(df_train, path=ROOT_PATH).split_none().label_from_df(cols='label', label_cls = CategoryList)
valid = ImageList.from_df(df_test, path=ROOT_PATH).split_none().label_from_df(cols='label', label_cls = CategoryList)
training.valid = valid.train
data = (training.transform(tfms,size=128).databunch().normalize(imagenet_stats))

Sayak · May 2, 2019, 12:01pm

Thank you. It was a massive relief. I was also looking for ways to do the three splits. Very good suggestion and work around provided by @muellerzr.

muellerzr · May 2, 2019, 12:24pm

@sgugger do we have anything like an OrderedItemList in fastai that could do this? If not I think it could be valuable for cases like this where we split one into 70/20/10 when we care about the order the items show up in. Let me know if this could be valuable to put into the library and I can get working on it! I would have it work with both image folders and panda dataframe lists if I did (and anything else people recommend would be applicable to this sort of function)

Edit: or perhaps a new split() function?

sgugger · May 2, 2019, 1:15pm

There is no need for a new ItemList to do this. As far as I can tell, you can use the split_by_idx method to split between train and valid and then you can use add_test to add your test TabularList. If I take your example above:

il = TabularList.from_df(df.iloc[:lenValid], path = '', cat_names = cat_vars, cont_names = var, procs=procs)
sd = il.split_by_idx(list(range(lenTrain, lenValid)))
ll = sd.label_from_df(cols=dep_var, label_cls = CategoryList)
ll = ll.add_test(TabularList.from_df(df.iloc[lenValid:], path = '', cat_names = cat_vars, cont_names = var)
data = ll.databunch()

Note that it’s weird to put labeled data in a test set because the test set is unlabeled in fastai. You should make a second data object with that set a validation set.

muellerzr · May 2, 2019, 1:20pm

I see! Thank you! And you are correct, I had it labeled so I knew the ground truth of my test set and could ‘test’ the model on unseen data but know how it did accuracy-wise on them.

neumann · May 2, 2019, 4:35pm

I’m not sure exactly what happened but with the above method the model wasn’t converging. I tried finding what the issue could be but it wasn’t working.

Anyway thanks for the suggestions, I changed my code to use “split_from_df” the training seems to be going ok so far.

training = ImageList.from_df(df, path=ROOT_PATH).split_from_df('is_valid').label_from_df(cols='label')
data = (training.transform(tfms, size=299).databunch(bs=bs).normalize(imagenet_stats))