Reset TabularPandas split changes loss function

Pablo · September 7, 2022, 10:31am

I am working on a TabularPandas problem, and I am defining my data as follows:

        data = TabularPandas(
            df,
            [Categorify, FillMissing],
            categorical_variables,
            continuous_variables,
            y_names=dependent_variable,
            splits=split_indices,
        )

which results in a CrossEntropy loss function when instantiating a tabular_learner, and works great.

At a later point I am resetting the data split (I am using ensembles). I don’t know if there is an easy way to do this, but what I am trying is first stitching up my data in a single DataFrame:

    data_df = pd.concat([data.train.xs, data.valid.xs])
    data_df[data.y_names[0]] = pd.concat([data.train.y, data.valid.y])

and then redeclaring a TabularPandas object, exactly like before!

    data = TabularPandas(
        data_df,
        [Categorify, FillMissing],
        data.cat_names,
        data.cont_names,
        y_names=data.y_names[0],
        splits=splits,
    )

where I pass new lists of indices for splits.

The DataFrames seem correct (data.train.xs.head() gives the same result before and after if I don’t get new indices for the split). But now my loss is MSELoss, which is clearly not what I need. I guess I can manually fix my loss, but I need to understand why this is changing, because it may indicate other problems I am overlooking.

I double checked that data.train.y and data.valid.y are both dtype: int8.

Big thanks!

zonkyo · September 7, 2022, 3:26pm

Hi Pablo,

I think the easiest way is telling the TabularPandas what type of problem you are targeting, explicitly setting
y_block=CategoryBlock
in both TabluarPandas; this should use the correct loss function automatically.

Why it switches from CrossEntropy to MSE? I do not know from the top of my head and would most likely have to look at your data. It can be that the transformations by Categorify and FillMissing are already enough to change the loss function.

Pablo · September 8, 2022, 7:34am

Hi @zonkyo! Thanks for your reply. I’ve finally figured out the problem, but first let me say that your fix is the best approach indeed

The problem was that my original data was not as I thought it was! Originally my labels column was a list of “yes”/“no” values. Which Fastai correctly understood as a classification problem and translated automatically into 1/0 format. But then the next time my data was in this 1/0 format, and Fastai had more trouble automatically determining the type of problem.

Now there is at least something else going on behind the scenes, because the 1/0 data works fine with the CrossEntropy loss only if we say we have a categorical problem, otherwise it protests about the format of the targets (it says it can’t take “char”, which is int8 apparently). I suppose there is some callback or something adapting the output data for classification problems, which is now active when we use y_block=CategoryBlock.