TabularPandas procs problems

Sindre · September 10, 2022, 10:59pm

Hi!

(Disclaimer: I have posted this question already under the Lesson 6 official topics - Part 1 2022, but reading the previous comments it seems to not be too much activity there, so I try to create a new topic here. Sorry if this breaks with any rules or anything like that.)

I am trying to use the TabularPandas tool for predicting future sales in the playground Kaggle competition Tabular Playground Series - Sep 2022.

The results are currently poor, despite relatively thorough analysis and feature engineering, so any tips on how to improve the notebook would be really appreciated.

However, I was wondering about the need to use “procs” when creating the TabularPandas object, because I am experiencing improved results when not adding “procs” for RandomForest and XGBoost. Rather use sklearn’s LabelEncoder for categorizing relevant columns. But, I am unable to train a neural network using the FastAI framework, because after creating a learner object from tabular_learner with a dls without procs I get an error:

AttributeError Traceback (most recent call last)
Input In [174], in <cell line: 1>()
----> 1 learn = tabular_learner(dls, layers=[1000,500], config=config_tabular,
2 n_out=1,
3 #loss_func=F.mse_loss,
4 metrics=[exp_rmspe])

File ~/mambaforge/envs/fastai2/lib/python3.10/site-packages/fastai/tabular/learner.py:42, in tabular_learner(dls, layers, emb_szs, config, n_out, y_range, **kwargs)
40 if layers is None: layers = [200,100]
41 to = dls.train_ds
—> 42 emb_szs = get_emb_sz(dls.train_ds, {} if emb_szs is None else emb_szs)
43 if n_out is None: n_out = get_c(dls)
44 assert n_out, “n_out is not defined, and could not be inferred from data, set dls.c or pass n_out”

File ~/mambaforge/envs/fastai2/lib/python3.10/site-packages/fastai/tabular/model.py:32, in get_emb_sz(to, sz_dict)
27 def get_emb_sz(
28 to:Tabular|TabularPandas,
29 sz_dict:dict=None # Dictionary of {‘class_name’ : size, …} to override default emb_sz_rule
30 ) → list: # List of embedding sizes for each category
31 “Get embedding size for each cat_name in Tabular or TabularPandas, or populate embedding size manually using sz_dict”
—> 32 return [_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]

File ~/mambaforge/envs/fastai2/lib/python3.10/site-packages/fastai/tabular/model.py:32, in (.0)
27 def get_emb_sz(
28 to:Tabular|TabularPandas,
29 sz_dict:dict=None # Dictionary of {‘class_name’ : size, …} to override default emb_sz_rule
30 ) → list: # List of embedding sizes for each category
31 “Get embedding size for each cat_name in Tabular or TabularPandas, or populate embedding size manually using sz_dict”
—> 32 return [_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]

File ~/mambaforge/envs/fastai2/lib/python3.10/site-packages/fastcore/basics.py:491, in GetAttr.getattr(self, k)
489 if self._component_attr_filter(k):
490 attr = getattr(self,self._default,None)
→ 491 if attr is not None: return getattr(attr,k)
492 raise AttributeError(k)

File ~/mambaforge/envs/fastai2/lib/python3.10/site-packages/fastcore/transform.py:212, in Pipeline.getattr(self, k)
→ 212 def getattr(self,k): return gather_attrs(self, k, ‘fs’)

File ~/mambaforge/envs/fastai2/lib/python3.10/site-packages/fastcore/transform.py:173, in gather_attrs(o, k, nm)
171 att = getattr(o,nm)
172 res = [t for t in att.attrgot(k) if t is not None]
→ 173 if not res: raise AttributeError(k)
174 return res[0] if len(res)==1 else L(res)

AttributeError: classes

BobMcDear · September 11, 2022, 12:09pm

Hello,

Categorify must be applied to your data if it contains categorical variables or NaNs and you would like to handle it with a Learner. Additionally, it is strange that random forests and XGBoost are more accurate without procs. Could you train them a few times with various seeds and average their results to discover whether the difference in performance is significant or not? Normalize and Categorify should not affect tree-based models.

As for improvements, I strongly recommend you explore and play around with others’ notebooks. They can provide you with a good sense of what methods work well and offer novel perspectives for this task. Also, browse through past time-series forecasting competitions and study the winning solutions; numerous cutting-edge techniques for such datasets can be found there.

Please let me know if you have other questions.