Tabular data

offirinbar · November 3, 2019, 8:29am

Hi all

I’m currently working on a tubular data project. I’m after the stage of building a quick model and now it time to improve it.

After searching in the forums I still have few questions:

Is it ok to use : procs = [FillMissing, Categorify, Normalize] on the test set? what are the pros and cons? (I used it on the training set)
layers [xx:yy] - what does it exactly mean? is there a guide on how to tune them?

Thank you!
Offir

muellerzr · November 3, 2019, 8:38am

The same preprocessing should be done to your test set, so if you used all three it should be done there as well, as those statistics and patterns are what your model was trained on.

To the second, that is how many neurons are in the fully connected layers that make up our tabular models. There is not a real ‘method’ per-se to a size. Either 200,100, or 1000,500 (like in rossmann) works well. But you could try and see what layer sizes work well

I saw some work a few months ago that saw that 3 layers was the ‘ideal’ model size. Eg 200,200,200. FFT

offirinbar · November 3, 2019, 8:50am

Thank you for your answer.

About the same preprocessing: for instance. in jermey’s notebook: lesson6-rossmann.ipynb he didn’t use the procs for the test set:

do you know why?

muellerzr · November 3, 2019, 8:51am

They’re automatically applied with .add_test

(The same pre-processors as your training data)

offirinbar · November 3, 2019, 8:53am

Oh OK.

If I don’t do procs = procs in Tabularlist.from_df it will automatically don’t do it in the add_test?

muellerzr · November 3, 2019, 8:54am

Yes, as your procs will be nothing. But you should always do this though (apply procs)

offirinbar · November 3, 2019, 9:06am

Thank you very much!

I wish to ask you another question about the model, this time about learn.model output:

Embeddings: what does it actually mean? can I change it manually?

***** (bn_cont) : BatchNorm1d(2 (what does it mean), eps=1e-05 (whats this)
***** in_features=206, out_features=400 : can I handle them? what is the meaning of this?

thanks !

Ralph · November 3, 2019, 1:15pm

Embedding are for your categorical data - (0): Embedding(4,3) means you have four unique values for the first categorical variable, and three dimensions have been provided to describe that variable. The actual number of dimensions can be changed in the setup. The default is set in tabular.data.py:

#def emb_sz_rule(n_cat:int)->int: return min(50, (n_cat//2)+1)
def emb_sz_rule(n_cat:int)->int: return min(600, round(1.6 * n_cat**0.56))

def def_emb_sz(classes, n, sz_dict=None):
“Pick an embedding size for n depending on classes if not given in sz_dict.”
sz_dict = ifnone(sz_dict, {})
n_cat = len(classes[n])
sz = sz_dict.get(n, int(emb_sz_rule(n_cat))) # rule of thumb
return n_cat,sz

But you can directly specify the size in your code.

For the in/out sizes, these are your layers. The first in size is the width of your data (including embeddings) and out is the first layer size. For the middle layers, you will see the previous out size become the next in size. Your final out size is the number of possible classifications.

As for tuning, I saw this recently but haven’t had time to investigate:
Get Better fastai Tabular Model with Optuna

offirinbar · November 3, 2019, 1:34pm

Thank you!

muellerzr · November 3, 2019, 2:30pm

AWESOME find @Ralph!

Ralph · November 3, 2019, 2:35pm

Jeremy’s twitter FTW…

crcrpar · November 5, 2019, 7:34am

Oh, thanks for mentioning my post.

In that post, I just tweaked the number of layers, the number of unit for each layer, and the dropout ratio of each layer while I could set the embedding size for each categorical attribute.