In the tabular chapter the cont_cat_split
method is called with different max_cardinality
parameters for the DT/RF model and the NN model as follows:
#DT/RF
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
#NN
cont_nn,cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)
The chapter does say that categorical columns are treated differently as it needs to create embeddings and indicates that embedding of size greater than 10k should not used and hence the 9k is used as max cardinality.
So I am having trouble understanding how a feature/column is decided to be continuous or categorical by using the limit an embedding size is supposed to be?
Also a max_card
of 1 for the random forest seems to be too low in my opinion? Wouldn’t the cardinality of any categorical column have unique values greater than 1?