Concatenation of two ImageLists

Hi @sgugger, based on your advice, I used following logic to concatenate two ImageLists:

data_wiki = (ImageList.from_df(df_age, path, cols=['full_path'], folder ='../wiki-face-data/wiki_crop/wiki_crop/').split_none().label_from_df(label_cls=FloatList)) .add(ImageList.from_folder(path_utk).split_none().label_from_func(extract_age, label_cls=FloatList)) .split_by_rand_pct(0.2, seed=42) .transform(tfms, resize_method=ResizeMethod.CROP, padding_mode='border', size=224) .databunch(bs=64*2,num_workers=0) .normalize(imagenet_stats)

But I got below error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-33-b6c65e6de23c> in <module>()
      7                                               symmetric_warp(magnitude=(-0.1,0.1)) ])
      8 
----> 9 data_wiki = (ImageList.from_df(df_age, path, cols=['full_path'], folder ='../wiki-face-data/wiki_crop/wiki_crop/').split_none().label_from_df(label_cls=FloatList)).add(ImageList.from_folder(path_utk).split_none().label_from_func(extract_age, label_cls=FloatList)).split_by_rand_pct(0.2, seed=42).transform(tfms, resize_method=ResizeMethod.CROP, padding_mode='border', size=224).databunch(bs=64*2,num_workers=0).normalize(imagenet_stats)

/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in _inner(*args, **kwargs)
    461         assert isinstance(fv, Callable)
    462         def _inner(*args, **kwargs):
--> 463             self.train = ft(*args, from_item_lists=True, **kwargs)
    464             assert isinstance(self.train, LabelList)
    465             kwargs['label_cls'] = self.train.y.__class__

TypeError: add() got an unexpected keyword argument 'from_item_lists'

How to resolve this error?

Best Regards
Abhik

You cannot add LabelLists together, only ItemList, which won’t work in your case since you have labeled data.
You should do some preprocessing to gather all your data in one single dataframe first here.

Oh OK… thanks for so quick reply…

In order to move quickly, I will then resort to train on one set of data and validate on other dataset.

I am using following to do this:

learn.validate(data_2.valid_dl)

This gives me one single number. Does this number mean the average loss on data_2?

Also, is it possible that instead of randomly splitting one data into train and validation, I train on entire data_1 where validation data is data_2? i.e. it trains on data_1 and calculates loss on data_2 and optimizes this loss in number of epochs.

If you want to do that, you need to actually change the learn.data’s dataloader :slight_smile:

So do learn.data.valid_dl = data_2.valid_dl
Onto the second point, that is the overall losses for the dataset. But right now it is just doing your normal validation set :slight_smile:

Thanks Zach for quick reply…

Do you mean like this:

learn = Learner(data_1, model, metrics = mean_absolute_error, model_dir = "/temp/model/", bn_wd=False, opt_func=opt_func,
               callback_fns=[ShowGraph]).mixup(stack_y=False, alpha=0.2)

learn.loss_func = L1LossFlat()

learn.data.valid_dl = data_2.valid_dl

and then usual learn.fit_one_cycle(…)?

That would let you train on that data, but if we wanted to evaluate on it we would run learn.validate(), yes. When dealing with the concatenation, I would follow sgugger’s advice and do some preprocessing to make it one dataframe, or something along those lines. I reserve that for just running a quick test set accuracy.

Cool, surely I will follow @sgugger advice for pre-processing but that may take some time and some thinking behind it :slight_smile:

As of now, I will use your suggestion.

Basically, I will not split data_1 into train / validation while creating ImageDataBunch and then use above technique to validate the losses on data_2…

Considering data_1 and data_2 are pretty much similar data (with obviously different images), do you think this technique is something worth exploring?

Are the classes the same? Could one think of data 2 as a continuation of data 1? And so long as one is your train set that you split into train/valid, you could then consider the other as a test set so long as the above two questions are true

Yes, classes are same and it can be considered as one big dataset being cut into two equal and same dataset (data_1, data_2)…

If the imbalance of data between one and two isn’t too bad, I don’t see why you couldn’t do this. Unless the data is perfectly balanced, as you could be missing out on valuable training data. Eg. Data 1 has 1000 photos, and data 2 has 2000 photos. Non ideal situation. Data 1 has 1000 photos, and data 2 has 2-300, somewhat ideal. A few options to consider.

1 Like

nice, data_1 and data_2 has almost same class distributions and data_1 is around 25k images and data_2 has around 13k images… so I think I will go ahead and give it a try :slight_smile:

Thanks so much for engaging with me :slight_smile:

Hello,

The code is available at the kernel:
https://www.kaggle.com/riteshsinha/databunch-fast-ai-chest-x-ray-model

I am trying to concatenate images from two folders, where the structure is not straightforward.

print(os.listdir("…/input")) - gives
[‘images_007’, ‘images_003’, ‘images_012’, ‘Data_Entry_2017.csv’, ‘images_004’, ‘train_val_list.txt’, ‘ARXIV_V5_CHESTXRAY.pdf’, ‘images_002’, ‘test_list.txt’, ‘FAQ_CHESTXRAY.pdf’, ‘images_005’, ‘README_CHESTXRAY.pdf’, ‘BBox_List_2017.csv’, ‘images_001’, ‘LOG_CHESTXRAY.pdf’, ‘images_008’, ‘images_011’, ‘images_009’, ‘images_006’, ‘images_010’]

Where the images are under images_007, images_XXX directories.
The call to create ImageDataBunch does not succeed because I think the files are further into subdirs.

FileNotFoundError: [Errno 2] No such file or directory: ‘…/input/00000001_000.png’

So I thought to create individual Imagelists and then concatenate them together.
The approach has been created in a kaggle kernel and I wanted to have opinion whether this is the right way to do this.

This is multi label problem so if you can have a look at the labels and confirm that would be great!

The output is as below.

LabelLists;

Train: LabelList (15409 items)
x: ImageList
Image (3, 1024, 1024),Image (3, 1024, 1024),Image (3, 1024, 1024),Image (3, 1024, 1024),Image (3, 1024, 1024)
y: MultiCategoryList
Pneumothorax,Atelectasis;Effusion,Nodule,No Finding,Mass
Path: …/input/images_003/images;

Valid: LabelList (1712 items)
x: ImageList
Image (3, 1024, 1024),Image (3, 1024, 1024),Image (3, 1024, 1024),Image (3, 1024, 1024),Image (3, 1024, 1024)
y: MultiCategoryList
Atelectasis;Consolidation;Edema;Infiltration;Pneumonia,Effusion;Infiltration,Cardiomegaly,Effusion,Infiltration
Path: …/input/images_003/images;

Test: None

Thanks in advance