Couple errors with TabularPandas, procs = Normalize

Hi everyone,

I’m currently working on the Practical Deep Learning for Coders course and recently completed Lesson 5. I’m interested in trying out for a different project, so I downloaded the House Prices - Advanced Regression Techniques dataset from Kaggle.

My goal is to use the tabular learner to compare its predictions with models like polynomial regression and random forests. Initially, I thought this would be straightforward, but I’ve encountered two errors that I’m struggling to resolve.

Based on my search in the forums, it seems removing the Normalize function from the procs attribute in TabularPandas might resolve the errors. However, without data normalization, the model’s predictions become significantly inaccurate.

I’m unsure if I’m approaching this project incorrectly or if there’s a subtle detail causing the errors. Here’s a summary of the issues:

1st error.

2nd error

I’ve also shared my notebook for further insights:

I’d greatly appreciate any suggestions or guidance you can offer to help me overcome these challenges and effectively compare the learner with other models.

Thanks you!

1st Error

I’m taking a look at your colab notebook and I think the issue is with some of the pre-processing that you have done. I ran the following code and was able to get a successful output from show_batch()—note that I added some things like dep_var and cont_cat_split to make it similar to the process I’ve seen in the textbook:

import pandas as pd
import fastai

from fastai.tabular.all import *

df_train = pd.read_csv("/content/train.csv")

df_train_modas = df_train.mode().iloc[0]

dep_var = 'SalePrice'

cont, cat = cont_cat_split(df_train, 1, dep_var=dep_var)

splits = RandomSplitter()(df_train)

dls = TabularPandas(
    splits = splits,
    cat_names = cat,
    cont_names = cont,
    procs=[Categorify, Normalize],
    y_names = 'SalePrice', y_block=RegressionBlock()

Here are the lines of your code that I didn’t run, which is likely where your error needs to be resolved:


no_pools = df_train[df_train.PoolArea == 0].index.to_list()
for i in no_pools:
  df_train.iloc[i,72] = "NA"

cont_names = ['LotFrontage','LotArea','YearBuilt','YearRemodAdd','MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea','GarageYrBlt','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal','MoSold','YrSold','SalePrice']

cat_names = df_train.drop(columns=cont_names).columns.to_list()

vars = cont_names + cat_names

2nd Error

I think the problem with your get_preds call was that you were giving it a DataLoaders (plural) object instead of a single DataLoader. I was able to get predictions on dls.valid:

I will note that the training doesn’t look good (the MSE is huge) and I’m not sure what is causing that

Hey, thanks for checking out my notebook! I think I found where I messed up the pre-processing. Anyways, I didn’t know about the cont_cat_split method, so thanks for showing it.

I noticed you used 1 as max_card, while I used 26, and it worked well for me. Any specific reason for choosing 1?

Also, I switched from MSE to MAE as the loss function and ended up with a MAE of 24000. With the average sale price in the train dataframe at $180921, a $24000 error is almost 15%, which seems a lot! Surprisingly, this model scored better than my random forest one on Kaggle. Instead of continuing to experiment with a neural network, I’m considering using a GBM to minimize the error.

Once again, thanks for helping me with the dataloader problem; I was really stuck.

1 Like

Awesome to hear about your progress with this problem! I chose max_card=1 for no real reason–it’s just what was used in the textbook example I was looking at and I wanted to keep it the same.