Define cont_cat_split parameters?

When trying to understand what the parameters for the following code in 09_tabular codebook.

cont,cat = cont_cat_split(df, 1, dep_var=dep_var)

what does the “1” mean???

The docs state that this value is the max_card argument that is being passed in and represented by the 1.

2 Likes

I’m still trying to understand the max_card

the documentation states:

cont_cat_split [source]

cont_cat_split(df, max_card=20, dep_var=None)

Helper function that returns column names of cont and cat variables from given df.

This function works by determining if a column is continuous or categorical based on the cardinality of its values. If it is above the max_card parameter (or a float datatype) then it will be added to the cont_names else cat_names. An example is below:

would max_card be the number of cat variables? in the documentation it is set to 20 … how would you define 20 vs 1?

Sorry not really sure. It only comes up once in the fastbook and once in the docs in that same place I linked to in my last reply.

2 Likes

what is the use of dep_var ??
def cont_cat_split(df, max_card=20, dep_var=None):
“Helper function that returns column names of cont and cat variables from given df.”
cont_names, cat_names = [], []
for label in df:
if label in L(dep_var): continue
if ((pd.api.types.is_integer_dtype(df[label].dtype) and
df[label].unique().shape[0] > max_card) or
pd.api.types.is_float_dtype(df[label].dtype)):
cont_names.append(label)
else: cat_names.append(label)
return cont_names, cat_names

Hey team! Thanks for this thread :slight_smile: I also had the exact same question with @bbrown & @adithya_lbhat on max_card & dep_var parameters.

Sharing my findings & understanding of those parameters in case anyone in the future might have the same question as us and find my answer useful!
(1) max_card: it seems like in general how cont_cat_split decides which columns to be categorical or continuous variables is based on the cardinality of the data as the doc says. For a column that consists of integers, it has one more special logic to decide whether its data is continuous or categorical. According to this post and this code line, if a column consists of integers, but its cardinality is smaller than or equal to the max_card parameter, it is considered as a category variable.

(2) dep_var: Based on this code line, it seems like dep_var is used to tell the fastai library that “hey this column is a column for dependent variables so no need to check whether it is continuous or categorical as we only need to do so for columns for independent variables”.

2 Likes

In that case a change has to be done in the function definition as it is confusing for no reason .

2 Likes

An easier way to think of max_card is as a threshold. Let’s say max_card = 5.

If a certain feature variable has 5 or fewer values, it is considered a categorical variable. Else, it is considered a continuous variable.

1 Like

under the sub title Using a Neural Network
To create embeddings, fastai needs to determine which columns should be treated as categorical variables. It does this by comparing the number of distinct levels in the variable to the value of the max_card parameter. If it’s lower, fastai will treat the variable as categorical. Embedding sizes larger than 10,000 should generally only be used after you’ve tested whether there are better ways to group the variable,