How to filter data bunch images based on train_val_list.txt and test_list.txt?

bilalUWE · April 16, 2019, 6:15pm

Hi,

I have all my data in the images folder. I want to split these images based on image names provided in the train_val_list.txt and test_list.txt.

I loaded all the data using the following command but stuck with how to filter further now.

src = (ImageList.from_csv(root_dir, 'data_labels.csv', folder='images')
                         .split_by_rand_pct(0.2)
                         .label_from_df(label_delim='|'))

And

data = (src.transform(tfms, size=128)
                 .databunch().normalize(imagenet_stats))

Where should I specify in the FastAI library to filter training and validation sets based on train_val_list.txt whereas the test set by test_list.txt.

Any guidance would be appreciated.

Many Thanks

Kind Regards
Bilal

shawn · April 16, 2019, 8:39pm

You need to replace the split_by_rand_pct call with a function that will allow you to split by filename. split_by_fname_file seems like the logical choice here.

You can read about the Data Block API in the documentation, here: https://docs.fast.ai/data_block.html#ItemList.split_subsets

bilalUWE · April 16, 2019, 9:04pm

Thx for the reply.

There is a small confusion. The total data needs to be split into two 1) train/validation and 2) test sets. Then the train/validation set needs to be split into 1) train 2)validation by 80% and 20% respectively. I think I might still need the split_by_rand_pct. What do you think?

kushaj · April 16, 2019, 9:08pm

I generally avoid adding test data to the same databunch, as in most cases I am not able to. You only need test dataset when you are ready for deployment, so having it just as dataloader would be sufficient.

bilalUWE · April 16, 2019, 9:11pm

Yeah khusaj,

You are right. I don’t need my test data for training the model. But the dataset I am using has all the images inside one folder and now they have provided the training_val_list.txt and test_list.txt to segregates both from each other. Since I am new to Fast AI doesn’t know if there is anything to process such types of data.

shawn · April 16, 2019, 10:24pm

What you want is completely achievable with FastAI… but it’s not necessarily going to be an “off the rack” solution that just automatically works with your data set. Spend some time looking at the documentation I linked to, and I think you’ll figure it out pretty soon. If you still end up stuck, come back and ask for help.

Edit: Also, Lesson 3 covers the Data Block API.

kushaj · April 17, 2019, 11:47am

Move the image_list to csv file with the second column belonging to the labels. Then you can use from_csv and split_by_rand_pct