IIRC, Jeremy’s rule of thumb for embedding sizes is min(50, (cardinality of a category + 1)//2).
For example, if a category is composed of 3 variables, then the corresponding embedding size should be min(50, (3 + 1) // 2) = 2. Correct so far?
However, the current implementation of def_emb_szs will return the cardinality of a category because dict.get returns the second argument only if the key does not exist in the passed dict.
My solution
IMO, sz_dict is not necessary. So I want to reduce the arguments to (df, n) from (df, n, sz_dict).
Additional
I known this works when invoke get_tabular_learner w/o emb_szs dict, though the example code in docs passes emb_szs dict.
I’m sure it wasn’t intended, but this comes across as rather unpleasant. We’re working very hard to help you, and receive no financial gains in return. So we do appreciate some gratitude, or at least patience and understanding.
The docs are correct. They are using sz_dict in the way I described - to override the defaults.