Lesson 4: Is there a non-teaching reason not to have default embedding sizes?

My takeaway from the structured data part of lesson 4 is basically: knowing what embeddings are, and wondering why I’d stray from the code in the notebook:

cat_sz = [(c, len(joined_samp[c].cat.categories)+1) for c in cat_vars]
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]

ie. cardinality/2, but not bigger than 50 - the heuristic discussed in the video.

Obviously generating those and passing them as the first parameter makes for a good lesson in what they are, how to use them and their shape - but is there a reason we’d stray from that heuristic to the extent that it wouldn’t make sense as a default?

After all, we’ve given the library our data frame and told it which variables are categorical. Why not just let the library generate our embedding sizes for us?

I’ll post this as a Github issue too, with a view toward would this make sense as a default, but I thought it’d be good here in case I’m missing something and there are different embeddings we’d sometimes want to use. :slight_smile: Related GH issue: https://github.com/fastai/fastai/issues/141

1 Like