Having problems with ImageDataBunch.from_df

I’m trying to apply the techniques we’ve learned in class to the Kaggle humpback whales competition. I’m working on Paperspace. I’ve been able to download the data successfully, but I’m having trouble creating the ImageDataBunch.

Here’s the relevant part of my notebook:

And below is the full error message. It seems to me maybe what is happening is it’s looking for the Ids rather than the image names in the folder containing the images, but I tried renaming or flipping the columns and it didn’t help. I double-checked that all the images are where they’re supposed to be and the path names are correct and I’m pretty sure everything is in order there. Any help appreciated!


KeyError Traceback (most recent call last)
/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/data_block.py in process_one(self, item)
277 def process_one(self,item):
–> 278 try: return self.c2i[item] if item is not None else None
279 except:

KeyError: ‘w_2fdf4cb’

During handling of the above exception, another exception occurred:

Exception Traceback (most recent call last)
in
----> 1 data = ImageDataBunch.from_df(path=path/‘train’, df=df, bs=bs)

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/vision/data.py in from_df(cls, path, df, folder, sep, valid_pct, fn_col, label_col, suffix, **kwargs)
123 src = (ImageItemList.from_df(df, path=path, folder=folder, suffix=suffix, cols=fn_col)
124 .random_split_by_pct(valid_pct)
–> 125 .label_from_df(sep=sep, cols=label_col))
126 return cls.create_from_ll(src, **kwargs)
127

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/data_block.py in _inner(*args, **kwargs)
391 self.valid = fv(*args, **kwargs)
392 self.class = LabelLists
–> 393 self.process()
394 return self
395 return _inner

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/data_block.py in process(self)
438 “Process the inner datasets.”
439 xp,yp = self.get_processors()
–> 440 for i,ds in enumerate(self.lists): ds.process(xp, yp, filter_missing_y=i==0)
441 return self
442

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/data_block.py in process(self, xp, yp, filter_missing_y)
561 def process(self, xp=None, yp=None, filter_missing_y:bool=False):
562 “Launch the processing on self.x and self.y with xp and yp.”
–> 563 self.y.process(yp)
564 if filter_missing_y and (getattr(self.x, ‘filter_missing_y’, None)):
565 filt = array([o is None for o in self.y])

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/data_block.py in process(self, processor)
66 if processor is not None: self.processor = processor
67 self.processor = listify(self.processor)
—> 68 for p in self.processor: p.process(self)
69 return self
70

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/data_block.py in process(self, ds)
284 ds.classes = self.classes
285 ds.c2i = self.c2i
–> 286 super().process(ds)
287
288 def getstate(self): return {‘classes’:self.classes}

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/data_block.py in process(self, ds)
36 def init(self, ds:Collection=None): self.ref_ds = ds
37 def process_one(self, item:Any): return item
—> 38 def process(self, ds:Collection): ds.items = array([self.process_one(item) for item in ds.items])
39
40 class ItemList():

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/data_block.py in (.0)
36 def init(self, ds:Collection=None): self.ref_ds = ds
37 def process_one(self, item:Any): return item
—> 38 def process(self, ds:Collection): ds.items = array([self.process_one(item) for item in ds.items])
39
40 class ItemList():

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/data_block.py in process_one(self, item)
278 try: return self.c2i[item] if item is not None else None
279 except:
–> 280 raise Exception(“Your validation data contains a label that isn’t present in the training set, please fix your data.”)
281
282 def process(self, ds):

Exception: Your validation data contains a label that isn’t present in the training set, please fix your data.

Edit: it looks like this is probably due to the Humpback competition having many categories with very few, or even one, examples. Checked for posts on ImageDataBunch or my error message, but not Humpback, sorry.

@GiantSquid Having the same kind of error in this comp. Any thoughts on how you got around it?

This works for me
data = ImageItemList.from_df(trn_df, '../input/train', suffix='.tif')