Concatenation of two ImageLists

(Abhik) #1

Hi @sgugger, based on your advice, I used following logic to concatenate two ImageLists:

data_wiki = (ImageList.from_df(df_age, path, cols=['full_path'], folder ='../wiki-face-data/wiki_crop/wiki_crop/').split_none().label_from_df(label_cls=FloatList)) .add(ImageList.from_folder(path_utk).split_none().label_from_func(extract_age, label_cls=FloatList)) .split_by_rand_pct(0.2, seed=42) .transform(tfms, resize_method=ResizeMethod.CROP, padding_mode='border', size=224) .databunch(bs=64*2,num_workers=0) .normalize(imagenet_stats)

But I got below error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-33-b6c65e6de23c> in <module>()
      7                                               symmetric_warp(magnitude=(-0.1,0.1)) ])
      8 
----> 9 data_wiki = (ImageList.from_df(df_age, path, cols=['full_path'], folder ='../wiki-face-data/wiki_crop/wiki_crop/').split_none().label_from_df(label_cls=FloatList)).add(ImageList.from_folder(path_utk).split_none().label_from_func(extract_age, label_cls=FloatList)).split_by_rand_pct(0.2, seed=42).transform(tfms, resize_method=ResizeMethod.CROP, padding_mode='border', size=224).databunch(bs=64*2,num_workers=0).normalize(imagenet_stats)

/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in _inner(*args, **kwargs)
    461         assert isinstance(fv, Callable)
    462         def _inner(*args, **kwargs):
--> 463             self.train = ft(*args, from_item_lists=True, **kwargs)
    464             assert isinstance(self.train, LabelList)
    465             kwargs['label_cls'] = self.train.y.__class__

TypeError: add() got an unexpected keyword argument 'from_item_lists'

How to resolve this error?

Best Regards
Abhik

0 Likes

#2

You cannot add LabelLists together, only ItemList, which won’t work in your case since you have labeled data.
You should do some preprocessing to gather all your data in one single dataframe first here.

0 Likes

(Abhik) #3

Oh OK… thanks for so quick reply…

In order to move quickly, I will then resort to train on one set of data and validate on other dataset.

I am using following to do this:

learn.validate(data_2.valid_dl)

This gives me one single number. Does this number mean the average loss on data_2?

Also, is it possible that instead of randomly splitting one data into train and validation, I train on entire data_1 where validation data is data_2? i.e. it trains on data_1 and calculates loss on data_2 and optimizes this loss in number of epochs.

0 Likes

(Zachary Mueller) #4

If you want to do that, you need to actually change the learn.data’s dataloader :slight_smile:

So do learn.data.valid_dl = data_2.valid_dl
Onto the second point, that is the overall losses for the dataset. But right now it is just doing your normal validation set :slight_smile:

0 Likes

(Abhik) #5

Thanks Zach for quick reply…

Do you mean like this:

learn = Learner(data_1, model, metrics = mean_absolute_error, model_dir = "/temp/model/", bn_wd=False, opt_func=opt_func,
               callback_fns=[ShowGraph]).mixup(stack_y=False, alpha=0.2)

learn.loss_func = L1LossFlat()

learn.data.valid_dl = data_2.valid_dl

and then usual learn.fit_one_cycle(…)?

0 Likes

(Zachary Mueller) #6

That would let you train on that data, but if we wanted to evaluate on it we would run learn.validate(), yes. When dealing with the concatenation, I would follow sgugger’s advice and do some preprocessing to make it one dataframe, or something along those lines. I reserve that for just running a quick test set accuracy.

0 Likes

(Abhik) #7

Cool, surely I will follow @sgugger advice for pre-processing but that may take some time and some thinking behind it :slight_smile:

As of now, I will use your suggestion.

Basically, I will not split data_1 into train / validation while creating ImageDataBunch and then use above technique to validate the losses on data_2…

Considering data_1 and data_2 are pretty much similar data (with obviously different images), do you think this technique is something worth exploring?

0 Likes

(Zachary Mueller) #8

Are the classes the same? Could one think of data 2 as a continuation of data 1? And so long as one is your train set that you split into train/valid, you could then consider the other as a test set so long as the above two questions are true

0 Likes

(Abhik) #9

Yes, classes are same and it can be considered as one big dataset being cut into two equal and same dataset (data_1, data_2)…

0 Likes

(Zachary Mueller) #10

If the imbalance of data between one and two isn’t too bad, I don’t see why you couldn’t do this. Unless the data is perfectly balanced, as you could be missing out on valuable training data. Eg. Data 1 has 1000 photos, and data 2 has 2000 photos. Non ideal situation. Data 1 has 1000 photos, and data 2 has 2-300, somewhat ideal. A few options to consider.

1 Like

(Abhik) #11

nice, data_1 and data_2 has almost same class distributions and data_1 is around 25k images and data_2 has around 13k images… so I think I will go ahead and give it a try :slight_smile:

Thanks so much for engaging with me :slight_smile:

0 Likes