Split_by_? in datablock

Hi guys,

I am working on an image classification problem and I have an image master folder with 20 folders for each label:

ford (00001.png, 00002.png, …, 00150.png)
honda (00151.png, 00152.png, …, 00300.png)

I also have a valid.txt containing:

ford/00125
ford/00127

ford/00150
honda/00276
honda/00277

What split should I be using and how should I use it to parse the given validation set in the valid.txt?

Thanks!

I believe it’s split_by_fname_file.

Thanks for the reply!

Do you think i need to further pass any function to only retrieve the 00001.png since the txt contains {folder_name}/{image_id}.

I tried split_by_fname_file and called data.valid_ds.x but the result was unfortunately ImageItemList (0 items).

Sounds like you could add .png to the end of each line in valid.txt

Thanks for the reply.

valid.txt now contains ford/00001.png

Just tried so. I am still suspecting it’s the parsing problem i did not quite get right. Here is the source code:

path_val = PATH + "valid.txt"

src = (ImageItemList.from_folder(path_images)
                    .split_by_fname_file(path_val)
                    .label_from_folder())

data = (src.transform(get_transforms())
           .databunch(bs=bs)
           .normalize(imagenet_stats))

data.valid_ds.x

ImageItemList (0 items)

Any idea what might be wrong?

had a look at the source code.

def split_by_fname_file(...):
        ...
        return self.split_by_files(valid_names)

Calling split_by_files does not seem like the split I need?

First off, are you sure the path is right and that it is loading the images into your training set Ok?

Yes, by doing data.train_ds.x I am getting ImageItemList (1000 items)

Hey! Maybe you should try .split_by_fname_file('valid.txt').

Thanks Junlin for the suggestion! My argument already includes valid.txt from path_val = PATH + "valid.txt". I tried only including the file name and encountered [Errno 2] No such file or directory.

Oh no, i just realized that i might have ford/00001 and then benz/00001, which essentially means i need to split by including folder name.

Okay I tried using split_by_list instead.

I now have train_list = ['ford/00001.jpg', 'ford/00002.jpg', ...] and valid_list = ['ford/00125.jpg', 'ford/00126.jpg', ...]

after calling

src = (ImageItemList.from_folder(path_images)
                    .split_by_list(train_list, valid_list)
                    .label_from_folder())

the error is now

'list' object has no attribute 'ignore_empty'

I suspect is my lack of understanding the source code as in not knowing the correct way of passing the argument (i.e., including the path /ford etc.)

Highly appreciate any help. Thank you!

your path_images may be incorrect.

I think it is correct because i am getting some result from doing the default split.

doing !ls path_images shows all the folders:

ford, honda, ...

Is that the full errror? What is the full traceback error?

Thanks @ilovescience, yes that is the full error. nothing further in the stack. Any idea?

Same error here,

train_fnames=np.loadtxt('train__fnames.csv',delimiter=",", dtype=str).tolist()
valid_fnames=np.loadtxt('valid_fnames.csv',delimiter=",", dtype=str).tolist()

src = (SegmentationItemList.from_folder(path_img)) 

tfms = get_transforms(flip_vert=True, max_warp=0.1, max_rotate=20, max_zoom=2, max_lighting=0.3)

src = (src.split_by_list(train_fnames,valid_fnames)
           .label_from_func(get_y_fn, classes=codes))

data = (src.transform(tfms, size=size, tfm_y=True)
        .databunch(bs=bs)
        .normalize(imagenet_stats))

Error:

'list' object has no attribute 'ignore_empty'

Quick Fix

The easiest solution is using split_by_files(valid_fnames) keeping in path_img only train and valid. What won’t go into valid will go into train. My problem is that I have also my test set tiles into path_img.

Question
@sgugger I think I am looking for something like .split_by_idxs(train_idx=test_idx, valid_idx=test_idx) using file names instead of idx. Is it implemented somehow? Is there a workaround?

Hello @sgugger, the above is still very relevant for us. Have you found a solution or suggestion for it? thanks a lot