Concatenation of two ImageLists

abhikjha · June 12, 2019, 3:18pm

Hi @sgugger, based on your advice, I used following logic to concatenate two ImageLists:

data_wiki = (ImageList.from_df(df_age, path, cols=['full_path'], folder ='../wiki-face-data/wiki_crop/wiki_crop/').split_none().label_from_df(label_cls=FloatList)) .add(ImageList.from_folder(path_utk).split_none().label_from_func(extract_age, label_cls=FloatList)) .split_by_rand_pct(0.2, seed=42) .transform(tfms, resize_method=ResizeMethod.CROP, padding_mode='border', size=224) .databunch(bs=64*2,num_workers=0) .normalize(imagenet_stats)

But I got below error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-33-b6c65e6de23c> in <module>()
      7                                               symmetric_warp(magnitude=(-0.1,0.1)) ])
      8 
----> 9 data_wiki = (ImageList.from_df(df_age, path, cols=['full_path'], folder ='../wiki-face-data/wiki_crop/wiki_crop/').split_none().label_from_df(label_cls=FloatList)).add(ImageList.from_folder(path_utk).split_none().label_from_func(extract_age, label_cls=FloatList)).split_by_rand_pct(0.2, seed=42).transform(tfms, resize_method=ResizeMethod.CROP, padding_mode='border', size=224).databunch(bs=64*2,num_workers=0).normalize(imagenet_stats)

/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in _inner(*args, **kwargs)
    461         assert isinstance(fv, Callable)
    462         def _inner(*args, **kwargs):
--> 463             self.train = ft(*args, from_item_lists=True, **kwargs)
    464             assert isinstance(self.train, LabelList)
    465             kwargs['label_cls'] = self.train.y.__class__

TypeError: add() got an unexpected keyword argument 'from_item_lists'

How to resolve this error?

Best Regards
Abhik

sgugger · June 12, 2019, 3:40pm

You cannot add LabelLists together, only ItemList, which won’t work in your case since you have labeled data.
You should do some preprocessing to gather all your data in one single dataframe first here.

abhikjha · June 12, 2019, 4:25pm

Oh OK… thanks for so quick reply…

In order to move quickly, I will then resort to train on one set of data and validate on other dataset.

I am using following to do this:

learn.validate(data_2.valid_dl)

This gives me one single number. Does this number mean the average loss on data_2?

Also, is it possible that instead of randomly splitting one data into train and validation, I train on entire data_1 where validation data is data_2? i.e. it trains on data_1 and calculates loss on data_2 and optimizes this loss in number of epochs.

muellerzr · June 12, 2019, 4:33pm

If you want to do that, you need to actually change the learn.data’s dataloader

So do learn.data.valid_dl = data_2.valid_dl
Onto the second point, that is the overall losses for the dataset. But right now it is just doing your normal validation set

abhikjha · June 12, 2019, 4:37pm

Thanks Zach for quick reply…

Do you mean like this:

learn = Learner(data_1, model, metrics = mean_absolute_error, model_dir = "/temp/model/", bn_wd=False, opt_func=opt_func,
               callback_fns=[ShowGraph]).mixup(stack_y=False, alpha=0.2)

learn.loss_func = L1LossFlat()

learn.data.valid_dl = data_2.valid_dl

and then usual learn.fit_one_cycle(…)?

muellerzr · June 12, 2019, 4:39pm

That would let you train on that data, but if we wanted to evaluate on it we would run learn.validate(), yes. When dealing with the concatenation, I would follow sgugger’s advice and do some preprocessing to make it one dataframe, or something along those lines. I reserve that for just running a quick test set accuracy.

abhikjha · June 12, 2019, 4:43pm

Cool, surely I will follow @sgugger advice for pre-processing but that may take some time and some thinking behind it

As of now, I will use your suggestion.

Basically, I will not split data_1 into train / validation while creating ImageDataBunch and then use above technique to validate the losses on data_2…

Considering data_1 and data_2 are pretty much similar data (with obviously different images), do you think this technique is something worth exploring?

muellerzr · June 12, 2019, 4:46pm

Are the classes the same? Could one think of data 2 as a continuation of data 1? And so long as one is your train set that you split into train/valid, you could then consider the other as a test set so long as the above two questions are true

abhikjha · June 12, 2019, 4:47pm

Yes, classes are same and it can be considered as one big dataset being cut into two equal and same dataset (data_1, data_2)…

muellerzr · June 12, 2019, 4:49pm

If the imbalance of data between one and two isn’t too bad, I don’t see why you couldn’t do this. Unless the data is perfectly balanced, as you could be missing out on valuable training data. Eg. Data 1 has 1000 photos, and data 2 has 2000 photos. Non ideal situation. Data 1 has 1000 photos, and data 2 has 2-300, somewhat ideal. A few options to consider.

abhikjha · June 12, 2019, 4:52pm

nice, data_1 and data_2 has almost same class distributions and data_1 is around 25k images and data_2 has around 13k images… so I think I will go ahead and give it a try

Thanks so much for engaging with me

riteshsinha · July 29, 2019, 3:20pm

Hello,

The code is available at the kernel:
https://www.kaggle.com/riteshsinha/databunch-fast-ai-chest-x-ray-model

I am trying to concatenate images from two folders, where the structure is not straightforward.

print(os.listdir("…/input")) - gives
[‘images_007’, ‘images_003’, ‘images_012’, ‘Data_Entry_2017.csv’, ‘images_004’, ‘train_val_list.txt’, ‘ARXIV_V5_CHESTXRAY.pdf’, ‘images_002’, ‘test_list.txt’, ‘FAQ_CHESTXRAY.pdf’, ‘images_005’, ‘README_CHESTXRAY.pdf’, ‘BBox_List_2017.csv’, ‘images_001’, ‘LOG_CHESTXRAY.pdf’, ‘images_008’, ‘images_011’, ‘images_009’, ‘images_006’, ‘images_010’]

Where the images are under images_007, images_XXX directories.
The call to create ImageDataBunch does not succeed because I think the files are further into subdirs.

FileNotFoundError: [Errno 2] No such file or directory: ‘…/input/00000001_000.png’

So I thought to create individual Imagelists and then concatenate them together.
The approach has been created in a kaggle kernel and I wanted to have opinion whether this is the right way to do this.

This is multi label problem so if you can have a look at the labels and confirm that would be great!

The output is as below.

LabelLists;

Train: LabelList (15409 items)
x: ImageList
Image (3, 1024, 1024),Image (3, 1024, 1024),Image (3, 1024, 1024),Image (3, 1024, 1024),Image (3, 1024, 1024)
y: MultiCategoryList
Pneumothorax,Atelectasis;Effusion,Nodule,No Finding,Mass
Path: …/input/images_003/images;

Valid: LabelList (1712 items)
x: ImageList
Image (3, 1024, 1024),Image (3, 1024, 1024),Image (3, 1024, 1024),Image (3, 1024, 1024),Image (3, 1024, 1024)
y: MultiCategoryList
Atelectasis;Consolidation;Edema;Infiltration;Pneumonia,Effusion;Infiltration,Cardiomegaly,Effusion,Infiltration
Path: …/input/images_003/images;

Test: None

Thanks in advance