Custom ItemList, getting ForkingPickler broken pipe

Glad to see someone else working on PetFinder comp by combining different data together! I participated on that competition as well but only combined text and tabular (and get cnn features from a different model) but the model is sensitive and hard to train so I end up just train tabular NN. Looking forward to your post on combining all three!

I started working a little bit too late on it… Did not submit anything on the real competition. And now I can’t test on the public leaderboard, submissions seems closed. This is my first competition so I would have liked testing it on the competition test set because the performance I am having seems too good to be true.

I am having 0.950 quadratic kappa score on a 10% validation set and the best team on the public leaderboard is 0.502 quadratic kappa score…

image

I haven’t done any cross-validation and this is my first competition so I am remaining cautious… I would really like to make a submission on the test set before releasing anything to see if the performance I am having generalize.

Yeah it seems too good to be true. The best kappa score using neural net I achieve on this dataset is 0.41 and I used 5 fold strategy based on RescuerID. What is your validation strategy? Do you pick random 10% from the set? Do you do any type of feature engineering?

Indeed this seems too good to be true… I don’t have enough experience in data science to be 100% sure. I thought maybe I was not using the KappaScore from fastai correctly.

But then I tested this morning with a tabular_learner, with only loading the csv with tabular data from the train.csv file from the competition and got around 0.32 quadratic kappa (without even processing the sentiments files).

vPcpyZdvum

But then for my model using image, tabular and text, I created a dataframe where I have one row per image and I also parse the sentiment files for each PetID. Each row contains all the tabular information for the pet that this picture belongs to. So basically pets that have more than one photo have more rows in the training dataset basically giving more weight to the pets with more photos. The csv has 14993 rows, but joined with the pictures, the dataset now has 58311 rows (there’s 58311 pictures in train_images).

But now my validation set has potentially multiple prediction for a single PetID, so I just average the prediction per PetID and round them to the nearest AdoptionSpeed.

Using that even with just a tabular_learner, I get around 0.9329 kappa score.

I used the method quadratic_weighted_kappa from this notebook to test something different than fastai KappaScore…

So I am not sure if what I am doing is correct or not.

I think I know what went wrong with your validation approach. Because you upsample the dataset by duplicating tabular data to pets with several images, and you pick 10% random from the upsampled data to be your validation set, there is a high chance that data related to one pet got shared between train and validation set (e.g.: pet A has 10 images so you create 10 rows for him, first 8 rows got into train set, last 2 are in valid set). This is known as ‘leaking information from training set into validation set’, and this could be worse since the only difference between 10 rows are the images (the tabular rows are copied).

You can pick a better validation set by making sure validation set has complete different pets from training set (or ideally, different rescuer IDs, as I remember the test set in that competition has a complete different set of rescuer IDs)

1 Like

Makes complete sense! I will fix my code tonight!

1 Like

For anyone interested, I published a notebook showing an example on how to use MixedItemList:

2 Likes

Etienne,

I am getting the same error on my toy dataset on windows before I upload (using a custom_itemlist (and as you say I did not get the error when not using that custom itemlist).

I am working in a jupyter notebook, and tried to put

if __name__ == '__main__':

Before calling

lr_find(learn)
learn.recorder.plot()

But, still get the same error

Could you explain the “…code doing the training loop…” (i.e. specifically where did you put "if name==‘main’: ?

Yeah I had to put some of my code in a separate file (that I import in the notebook) for the Python multi-process to be happy… If the code was in the notebook it was not happy and threw this error.

Thanks Etienne, much appreciated. I was able to put num_workers = 0 and accomplish as you describe. Thank you very much.

Yeah num_workers=0 fixes it, but then you lose multi-processing… Depending on your application this could be fine, but if you are processing images, this will be much much slower.

Is there a good way to add test sets correctly?:狞笑:

If you are using MixedItemList, I could not make it work because it doesnt work like other ItemLists.

What I did in my code is that I trick fastai into thinking the test set is actually the validation set and then get predictions for the validation set.

Yes, thank you for your excellent work, which has helped me a lot. However, if I set the training set size to 0 in fastai, an error will be reported:哀思:

Hopefully fixed in master now (thanks to @sgugger 's quick response): https://github.com/fastai/fastai/pull/2107

3 Likes

Thanks @Herman!

Hi Etienne,

Thanks a lot for your wonderful work and also for kindly sharing it with the rest of us (and special thanks to sgugger. I coincidentally found myself following the exact path you followed. I was wondering if you were able to use learn.predict method after training?

I am using a model only with tabular and image data and successfully trained it. But no luck on predict. This was sort of an effort to compile a list of predictions on validation set and “plot_top_losses” manually for inspection :slight_smile:

But the funny thing is error I’m getting is a KeyError from pd.Categorical called by fastai’s Categorify proc on data. And below is what I think is causing the problem; the creation of a dataframe from two copies of TabularLines (fastai.tabular.data line 45). I tried to temporarily replace TabularLine inside my MixedItem with a custom object to make those two lines “happy” but ended up creating more problems that I understand less. I’ve included full traceback when I call predict.

Final note; I’m looking to see if I did some obvious mistake. Otherwise, I think the answer for my (and any) MixedItemList related Q is to wait for v2 :slight_smile: Also, apologies for long post.

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/tabular/data.py in process_one(self, item)
     44     def process_one(self, item):
     45         df = pd.DataFrame([item,item])
---> 46         for proc in self.procs: proc(df, test=True)

Partial code (which is proven working with Images only) and full traceback below;

dup_valid_dl = copy.deepcopy(data.valid_dl)
for item in dup_valid_dl.dataset:
#     print(f'{item}')
     data_item, label = item
     img_item, tab_item = data_item.obj[0], data_item.obj[1]
     print(f'Tab_Item:\t{tab_item}')
     print(f'Img_Item:\t{img_item}')
     print(f'Label:\t{label}')
     predicted = learn.predict( (data_item) )[0].data[0]
#     print(f'{predicted}')
#     truth = label.data
#     error = truth-predicted
#     predicted_list.append([item, np.abs(error), truth, predicted])
#     print(f'{truth:.3f}-{predicted:.3f}={error:.3f}')
#     print(f'======================================================================================')
predicted_list = sorted(predicted_list, key=lambda x: -x[1])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2656             try:
-> 2657                 return self._engine.get_loc(key)
   2658             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()

KeyError: 'port_id'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-175-383315a78990> in <module>()
     20     print(my_df)
     21 
---> 22     predicted = learn.predict( (data_item) )[0].data[0]
     23 #     print(f'{predicted}')
     24 #     truth = label.data

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/basic_train.py in predict(self, item, return_x, batch_first, with_dropout, **kwargs)
    372     def predict(self, item:ItemBase, return_x:bool=False, batch_first:bool=True, with_dropout:bool=False, **kwargs):
    373         "Return predicted class, label and probabilities for `item`."
--> 374         batch = self.data.one_item(item)
    375         res = self.pred_batch(batch=batch, with_dropout=with_dropout)
    376         raw_pred,x = grab_idx(res,0,batch_first=batch_first),batch[0]

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/basic_data.py in one_item(self, item, detach, denorm, cpu)
    178         "Get `item` into a batch. Optionally `detach` and `denorm`."
    179         ds = self.single_ds
--> 180         with ds.set_item(item):
    181             return self.one_batch(ds_type=DatasetType.Single, detach=detach, denorm=denorm, cpu=cpu)
    182 

~/anaconda3/envs/pytorch_p36/lib/python3.6/contextlib.py in __enter__(self)
     79     def __enter__(self):
     80         try:
---> 81             return next(self.gen)
     82         except StopIteration:
     83             raise RuntimeError("generator didn't yield") from None

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/data_block.py in set_item(self, item)
    606     def set_item(self,item):
    607         "For inference, will briefly replace the dataset with one that only contains `item`."
--> 608         self.item = self.x.process_one(item)
    609         yield None
    610         self.item = None

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/data_block.py in process_one(self, item, processor)
     88         if processor is not None: self.processor = processor
     89         self.processor = listify(self.processor)
---> 90         for p in self.processor: item = p.process_one(item)
     91         return item
     92 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/data_block.py in process_one(self, item)
    754         res = []
    755         for procs, i in zip(self.procs, item):
--> 756             for p in procs: i = p.process_one(i)
    757             res.append(i)
    758         return res

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/tabular/data.py in process_one(self, item)
     44     def process_one(self, item):
     45         df = pd.DataFrame([item,item])
---> 46         for proc in self.procs: proc(df, test=True)
     47         if len(self.cat_names) != 0:
     48             codes = np.stack([c.cat.codes.values for n,c in df[self.cat_names].items()], 1).astype(np.int64) + 1

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/tabular/transform.py in __call__(self, df, test)
    122         "Apply the correct function to `df` depending on `test`."
    123         func = self.apply_test if test else self.apply_train
--> 124         func(df)
    125 
    126     def apply_train(self, df:DataFrame):

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/tabular/transform.py in apply_test(self, df)
    143         "Transform `self.cat_names` columns in categorical using the codes decided in `apply_train`."
    144         for n in self.cat_names:
--> 145             df.loc[:,n] = pd.Categorical(df[n], categories=self.categories[n], ordered=True)
    146 
    147 FillStrategy = IntEnum('FillStrategy', 'MEDIAN COMMON CONSTANT')

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2925             if self.columns.nlevels > 1:
   2926                 return self._getitem_multilevel(key)
-> 2927             indexer = self.columns.get_loc(key)
   2928             if is_integer(indexer):
   2929                 indexer = [indexer]

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2657                 return self._engine.get_loc(key)
   2658             except KeyError:
-> 2659                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2660         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2661         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()

KeyError: 'port_id'

Hey @alpsayin, sorry I haven’t touched anything deep learning related since May. But from what I remember I did not try to make predict work. You can see my code here on how I went around the problem: https://github.com/EtienneT/fastai-petfinder/blob/master/Fastai%20PetFinder.ipynb. But really not ideal.

Can’t wait to see what v2 will yield for this kind of scenario too.

1 Like

??? If, in a Jupyter notebook, you enclose a block of code inside an
if name == 'main':
statement, you get a
NameError.

NameError Traceback (most recent call last)
in

NameError: name ‘name’ is not defined

Hi Elfayoumi,
I’m also wondering about how to join several imageLists in mixedItemList. Is there a straightforward way to do that?
Thanks

1 Like