When trying to understand what the parameters for the following code in 09_tabular codebook.
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
what does the “1” mean???
When trying to understand what the parameters for the following code in 09_tabular codebook.
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
what does the “1” mean???
The docs state that this value is the max_card
argument that is being passed in and represented by the 1
.
I’m still trying to understand the max_card
the documentation states:
cont_cat_split
[source]
cont_cat_split
(df
,max_card
=20
,dep_var
=None
)
Helper function that returns column names of cont and cat variables from given df
.
This function works by determining if a column is continuous or categorical based on the cardinality of its values. If it is above the max_card
parameter (or a float
datatype) then it will be added to the cont_names
else cat_names
. An example is below:
would max_card be the number of cat variables? in the documentation it is set to 20 … how would you define 20 vs 1?
Sorry not really sure. It only comes up once in the fastbook and once in the docs in that same place I linked to in my last reply.
what is the use of dep_var ??
def cont_cat_split(df, max_card=20, dep_var=None):
“Helper function that returns column names of cont and cat variables from given df
.”
cont_names, cat_names = [], []
for label in df:
if label in L(dep_var): continue
if ((pd.api.types.is_integer_dtype(df[label].dtype) and
df[label].unique().shape[0] > max_card) or
pd.api.types.is_float_dtype(df[label].dtype)):
cont_names.append(label)
else: cat_names.append(label)
return cont_names, cat_names
Hey team! Thanks for this thread I also had the exact same question with @bbrown & @adithya_lbhat on max_card & dep_var parameters.
Sharing my findings & understanding of those parameters in case anyone in the future might have the same question as us and find my answer useful!
(1) max_card
: it seems like in general how cont_cat_split decides which columns to be categorical or continuous variables is based on the cardinality of the data as the doc says. For a column that consists of integers, it has one more special logic to decide whether its data is continuous or categorical. According to this post and this code line, if a column consists of integers, but its cardinality is smaller than or equal to the max_card parameter, it is considered as a category variable.
(2) dep_var
: Based on this code line, it seems like dep_var is used to tell the fastai library that “hey this column is a column for dependent variables so no need to check whether it is continuous or categorical as we only need to do so for columns for independent variables”.
In that case a change has to be done in the function definition as it is confusing for no reason .
An easier way to think of max_card
is as a threshold. Let’s say max_card = 5
.
If a certain feature variable has 5 or fewer values, it is considered a categorical variable. Else, it is considered a continuous variable.
under the sub title Using a Neural Network
To create embeddings, fastai needs to determine which columns should be treated as categorical variables. It does this by comparing the number of distinct levels in the variable to the value of the max_card
parameter. If it’s lower, fastai will treat the variable as categorical. Embedding sizes larger than 10,000 should generally only be used after you’ve tested whether there are better ways to group the variable,