Should I be getting different behavior with ImageDataBunch.from_folder vs. from.csv?

deepgander · January 4, 2019, 2:41pm

If I use ImageDataBunch.from_folder with this folder structure:

-train
–m
–f
-valid
–m
–f
tfms = get_transforms()
data = ImageDataBunch.from_folder(path=‘/home/user/example/’,
ds_tfms=tfms,
size=128)
versus when I have all images in the same folder and use a csv file that looks like this:
file, labels
img001.jpg, m
img002.jpg, f
img003.jpg, f
img004.jpg, m
tfms = get_transforms()

data2 = ImageDataBunch.from_csv(path='/home/user/example/',
                        folder='jpgs',
                        csv_labels='tagged_images.csv',
                        label_col=1,
                        sep=' ',
                        ds_tfms=tfms,
                        size=128,
                        valid_pct=0.2)

This gives the same results for data and data2:
print(data2.classes)
len(data2.classes),data2.c

[‘f’, ‘m’]
(2, 2)

But, I get very different behavior with fastai 1.0 (fastai 1.0.39, pytorch 1.0.0 py3.7_cuda10.0.130_cudnn7.4.1_1 [cuda100]). Firstly, it seems like with from_folder m/f is treated as mutually exclusive (i.e. an example is either ‘m’ or ‘f’, but never both or neither), while with from_csv it is treated as non-mutually exclusive (i.e. an example can be ‘m’, ‘f’, ‘m f’ or ’ '). In connection with that, when I use create_cnn, it looks like there seem to be two outputs in the second case, instead of one. Additionally, accuracy/error_rate cause trouble. Additionally, I then have trouble with adapting
interp = ClassificationInterpretation.from_learner(learn1)
losses,idxs = interp.top_losses()
len(data.valid_ds)==len(losses)==len(idxs)
(i.e. len(losses) and len(idxs) are twice as long as data.valid_ds, this then breaks functions like show_top_losses)

Has anyone else run into this? Any solutions?

sgugger · January 4, 2019, 2:56pm

You’re passing sep = ' ' in your ImageDataBunch.from_csv call, which is used for multilabelling problems. That’s why you have those difference. For single classification, you shouldn’t have that argument.