Problems importing Kaggle data with labels in CSV file

Hi!

I’m struggling to import an image data set from Kaggle. It has all of the images in a single folder, and uses a CSV file to assign labels to them.

I have attempted to import this as a DataBunch from example code in other notebooks, however the functions used don’t seem to be available in fastai v3. (or at least I can’t get them to work)

I can import the CSV file successfully (e.g.)

id label
0 f38a6374c348f90b587e046aac6079959adf3835 0
1 c18f2d887b7ae4f6742ee445113fa1aef383ed77 1
2 755db6279dae599ebb4d39a9123cce439965282d 0
3 bc3f0c64fb968ff4a8bd33af6971ecae77c75e08 0
4 068aba587a4950175d04c680d38943fd488d6a9d 0

I can load the images successfully (e.g.)

np.random.seed(42)
src = (ImageItemList.from_csv(path, 'train/labels.csv', suffix='.tif')
       .random_split_by_pct(0.2)
       .label_from_df(sep=' ',  classes=[0, 1]))

data.show_batch(rows=3, figsize=(13,5))

But I when I create a data bunch, it uses the file names as categories and not the label in the CSV file.

Each file is labelled either 1 or 0 in the CSV file. I just can’t seem to get the right method and/or syntax to import this properly.

Can anyone please assist or point me to some example code?

1 Like

The sep argument to label_from_df() is only for multi-label problems where multiple labels are in one column of the df. It is then used to split them up. Since you only have one label for each image you dont need to specify any argument to label_from_df (the function also assumes that the label is in the second column which is also correct in this case):

This should work properly:

src = (ImageItemList.from_csv(path, 'train/labels.csv', suffix='.tif')
   .random_split_by_pct(0.2)
   .label_from_df()

Thanks Rasmus,

I used your code and that seemed to resolve the error I was encountering. Now I have a different error.

path = ‘~/data/histo-cancer’
df = pd.read_csv(’~/data/histo-cancer/train/labels.csv’)

trn_tfms,_ = get_transforms(do_flip=True, flip_vert=True, max_rotate=30., max_zoom=1, max_lighting=0.05, max_warp=0.)

src = (ImageItemList.from_csv(path, ‘train/labels.csv’, suffix=’.tif’)
.random_split_by_pct(0.2)
.label_from_df()
)

data = (src.transform((tfms, _), size=224).databunch().normalize())

FileNotFoundError: [Errno 2] No such file or directory: ‘~/data/histo-cancer/./f1d38d1478da68a5a9ef3e5696a3909f8a5cbacc.tif’

It seems to be inserting a /./ in the file path and it should be /train/ - not sure why

The .from_csv() function has also an folder argument. Try to set folder="train" and see what happens. I have the train_labels.csv file in the folder above the train folder and it works like I proposed. The problem might be that the default value for folder is "."

I usually prefer to use “from_df”(https://docs.fast.ai/vision.data.html#ImageDataBunch.from_df) to deal with file names lists that need to be transformed, because is more flexible (actually from_csv uses internally from_df).

In your example:

...
df = pd.read_csv(your_csv_file)
df['id'] = df['id'].apply(lambda x: './train/' + x + '.tif')
... 
il = ImageItemList.from_df(path=path, df=df, cols=['id'])
ils = il.random_split_by_pct()
ils = ils.label_from_df(cols=['label'])
...

NB: It’s a I’m not sure about paths - didn’t try the code…

Take a look at “from_df” usage in this notebook:

Thanks Marius and Stefano. I moved the labels.csv up a folder in the hierarchy so now that works.

I tried both of your approaches and I was able to use both to create a data frame. I encountered an error when I tried to create a Data Bunch though. Despite having the correct file path and checking manually that the file exists at the command line, the code returns a file not found error. The path in the error message is correct.

I don’t understand why the code cannot find the file when it has the correct path, and I can copy/paste that exact same path in the CLI and find the file without a problem…

path = ‘~/data/histo-cancer/’
df = pd.read_csv(’~/data/histo-cancer/train_labels.csv’)

data = ImageDataBunch.from_df(path, df, folder=“train”, fn_col=0, label_col=1)
data.normalize(imagenet_stats)

FileNotFoundError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/IPython/core/formatters.py in call(self, obj)
700 type_pprinters=self.type_printers,
701 deferred_pprinters=self.deferred_printers)
–> 702 printer.pretty(obj)
703 printer.flush()
704 return stream.getvalue()

~/anaconda3/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
398 if cls is not object
399 and callable(cls.dict.get(‘repr’)):
–> 400 return _repr_pprint(obj, self, cycle)
401
402 return _default_pprint(obj, self, cycle)

~/anaconda3/lib/python3.6/site-packages/IPython/lib/pretty.py in repr_pprint(obj, p, cycle)
693 “”“A pprint that just redirects to the normal repr function.”""
694 # Find newlines and replace them with p.break
()
–> 695 output = repr(obj)
696 for idx,output_line in enumerate(output.splitlines()):
697 if idx:

~/anaconda3/lib/python3.6/site-packages/fastai/basic_data.py in repr(self)
98
99 def repr(self)->str:
–> 100 return f’{self.class.name};\n\nTrain: {self.train_ds};\n\nValid: {self.valid_ds};\n\nTest: {self.test_ds}’
101
102 @classmethod

~/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in repr(self)
460
461 def repr(self)->str:
–> 462 x = f’{self.x}’ # force this to happen first
463 return f’{self.class.name}\ny: {self.y}\nx: {x}’
464 def predict(self, res):

~/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in repr(self)
58 return self.items[i]
59 def repr(self)->str:
—> 60 items = [self[i] for i in range(min(5,len(self.items)))]
61 return f’{self.class.name} ({len(self)} items)\n{items}…\nPath: {self.path}’
62

~/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in (.0)
58 return self.items[i]
59 def repr(self)->str:
—> 60 items = [self[i] for i in range(min(5,len(self.items)))]
61 return f’{self.class.name} ({len(self)} items)\n{items}…\nPath: {self.path}’
62

~/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in getitem(self, idxs)
90
91 def getitem(self,idxs:int)->Any:
—> 92 if isinstance(try_int(idxs), int): return self.get(idxs)
93 else: return self.new(self.items[idxs], xtra=index_row(self.xtra, idxs))
94

~/anaconda3/lib/python3.6/site-packages/fastai/vision/data.py in get(self, i)
264 def get(self, i):
265 fn = super().get(i)
–> 266 res = self.open(fn)
267 self.sizes[i] = res.size
268 return res

~/anaconda3/lib/python3.6/site-packages/fastai/vision/data.py in open(self, fn)
260 def open(self, fn):
261 “Open image in fn, subclass and overwrite for custom behavior.”
–> 262 return open_image(fn, convert_mode=self.convert_mode)
263
264 def get(self, i):

~/anaconda3/lib/python3.6/site-packages/fastai/vision/image.py in open_image(fn, div, convert_mode, cls)
374 with warnings.catch_warnings():
375 warnings.simplefilter(“ignore”, UserWarning) # EXIF warning from TiffPlugin
–> 376 x = PIL.Image.open(fn).convert(convert_mode)
377 x = pil2tensor(x,np.float32)
378 if div: x.div_(255)

~/anaconda3/lib/python3.6/site-packages/PIL/Image.py in open(fp, mode)
2578
2579 if filename:
-> 2580 fp = builtins.open(filename, “rb”)
2581 exclusive_fp = True
2582

FileNotFoundError: [Errno 2] No such file or directory: ‘~/data/histo-cancer/train/c18f2d887b7ae4f6742ee445113fa1aef383ed77.tif’

The same problem is happening with me as well and I have tried many things to get it fixed but that did not happen. I am also looking for the solution and it has also affected my system as well and while doing the printing process I am getting Epson error code e-01. It has to be fixed.