Def_emb_sz is wrong in fastai/tabular/data.py

crcrpar · October 25, 2018, 3:21pm

IIRC, Jeremy’s rule of thumb for embedding sizes is min(50, (cardinality of a category + 1)//2).
For example, if a category is composed of 3 variables, then the corresponding embedding size should be min(50, (3 + 1) // 2) = 2. Correct so far?

github.com

fastai/fastai/blob/743b68489d6bbb1bb29d531dd798c602a1e0f802/fastai/tabular/data.py#L13-L17


def def_emb_sz(df, n, sz_dict):
col = df[n]
n_cat = len(col.cat.categories)+1  # extra cat for NA
sz = sz_dict.get(n, min(50, (n_cat//2)+1))  # rule of thumb
return n_cat,sz

However, the current implementation of def_emb_szs will return the cardinality of a category because dict.get returns the second argument only if the key does not exist in the passed dict.

My solution
IMO, sz_dict is not necessary. So I want to reduce the arguments to (df, n) from (df, n, sz_dict).

Additional
I known this works when invoke get_tabular_learner w/o emb_szs dict, though the example code in docs passes emb_szs dict.

jeremy · October 25, 2018, 6:09pm

The trick is to only use sz_dict to pass sizes where you don’t want to use the default.

crcrpar · October 25, 2018, 11:13pm

So how about editing docs?
It uses sz_dict and is confusing for me.

jeremy · October 26, 2018, 4:45am

I’m sure it wasn’t intended, but this comes across as rather unpleasant. We’re working very hard to help you, and receive no financial gains in return. So we do appreciate some gratitude, or at least patience and understanding.

The docs are correct. They are using sz_dict in the way I described - to override the defaults.

crcrpar · October 27, 2018, 12:34am

OK, thank you!