Why my simple fastai model is not working with Kaggle house price dataset?

I built really simple model using fastai to just show another person how easy it is to build models using fastai but it was embarrassing to realize that the model didn’t worked with house price dataset that I tested. I don’t have test set but validation set accuracy stays below 10%. Then I tested with titanic dataset and it gave around 80% accuracy without any modifications. My personal doubt is that NaN values in house price dataset causes some problems but not sure so can someone review the code below to check if I forgot some important part. And if those NaNs are causing errors is there some easy proc that I can use or do I need to write my own code?

from fastai.tabular import *
import pandas as pd

target_column = 'SalePrice'

df = pd.read_csv('/path/train.csv')

df = df.sample(frac=1).reset_index(drop=True)
valid_idx = range(0,250)#range(0,min(10000,max(int(len(df)*0.05),64)))
_,cat_names = cont_cat_split(df,dep_var=target_column)

procs = [FillMissing, Categorify, Normalize]

data = TabularDataBunch.from_df('.', df, target_column, valid_idx=valid_idx, procs=procs, cat_names=cat_names)

learn = tabular_learner(data, layers=[200,200,100], metrics=accuracy)
learn.fit_one_cycle(5, 1e-2)
1 Like

To add above post I tested learn.get_preds() by first looking the predictions and then the real labels. It seems like the predictions were something like 3.0607e-02, 1.9987e-03, 1.4917e-03, etc. but then labels are 238, 128, 383, 199, 159, etc.

thats float32 representation probably. Try to use print(nparray.view(int32))

Yeah i understand that those are decimals but they are hundreds of times wrong. That’s the main problem. As soon as you see the original post that I first deleted you understand it better. Hope that moderator opens it soon.

My best guess is to look at my kernel. There are 2-3 variables that don’t really work pretty on the housing dataset I found (Maybe a few more) https://www.kaggle.com/muellerzr/regression-using-the-fastai-library

My public score is .19

Hope that can help answer some questions :slight_smile:

That’s really cool! Thanks for sharing. I just don’t understand that if there is two variables that aren’t good for price prediction shouldn’t the model realize this. Still your model is learning something when mine is more like staying the same but I just can’t find what you are doing differently.

It will by not learning :slight_smile: I did many lr_find() to break down which ones made my model not train (my lr plot was BLANK). To the rest, yes. You can run something like permutation importance to see what’s really being utilized fully and what’s not. Usually you want to keep everything in there (so long as nothing is detrimental) as Jeremy states the curse of dimensionality isn’t really a thing.

Also, are you undoing the log of the predictions? Note how I do my predictions. We use RMSE, we need to undo that.

I don’t care so much to test my results on this competition but just to see that the mode gives some predictions so I’m not using RMSE.

So I have never heard this but can some bad variables cause model to not learn? I don’t understand how that is happening like why the model can’t just notice that these variables aren’t useful? Is this common? And the final question is that did you figure out why these variables aren’t useful like is the information just not related to the target or is there many NaN values?

Well there would be ‘dirty data’ is What I found. Or some other extraneous reason hence why I don’t use all the data they give us. I found that they’d break my Learner. That’d be how I’d notice an issue.

To the rest, the model will explain why something is or isn’t useful. Eg permutation importance will tell you directly what variable had the most impact. For a breakdown of where misses were via variable distribution the ClassConfusion widget will do this (also works in colab, see my repo if you’re using that)

Basically from permutation importance you could then further go see if there were a lot of NaN, etc if you wanted.

1 Like

I just notice that there is 597 outputs in the model. It seems like fastai thinks that SalePrice is categorical data. How this decision is happening and how I can avoid this happening. Is this something to do with that SalePrice is int64 instead of float?

Sure :slight_smile: go look at the Rossmann notebook. You can pass in label_cls=FloatList when you label.

Oh… thanks I might have already asked this previously. If you have the answer right away how it works here TabularDataBunch.from_df(’.’, df, target_column, valid_idx=valid_idx, procs=procs, cat_names=cat_names, label_cls=FloatList)
It’s saying that there is no argument like that

and also isn’t there any automatic thing to recognize this because I thought that this might be a great example code I can use everytime I use tabular data and I don’t want to remember changing that thing every time I switch from categorical to continuous.

No there isn’t :slight_smile: I use the datablock API for my work and just build from the tabularList like it’s shown in Rossmann (TabularList.from_x).split_by_x.labelfrom_x.databunch()

Try doing it how the Rossmann is set up copy/paste

:+1: Do you think this is something I could built inside the library because in theory it seems like a problem that could be solved automatically.

I don’t think so, as I’m trying to think how that could be done automatically as you could very easily have a classification task where n labels is an absurdly large amount. Maybe for your own tasks where you’re comfortable with how it’s done though :slight_smile: (my research I’ve been doing my number of classes at one point was >400)

But let’s think it this way. If it’s labels it is really easy to just say it’s categorical. Okay so how numerical data should be split? I think if there is more than 20 or something variations then it’s probably continuous. I probably represent this idea because it might work better than current solution but definitely still should be option to choose this manually in case someone has more than 20 numerical values and all are labels.

Why adding log=True to label_from_df (using datablock API) gives so much better result? First I tried without it and the predictions were off a lot but then I used it and the model seemed to work pretty well. Isn’t that just changing the targets to log of them instead of the original?

Just realized that I’m using mean absolute error so is this log thing is making the model “better” because the variation of the targets is smaller and that’s why model’s predictions are pretty close but if I do squared the result will be the same.

actually it’s weird that when I’m not using log=True then the predictions are something in range of -20 to 20 or something like that although the real values are thousands. Then if I use log the prediction range is much closer to the real values.

I had a similar issue but I couldn’t find anything in the fastai documentation about passing log=True as an argument to the module label_from_df

What’s the theory behind this? If found a reference in the data block source code.

class FloatList(ItemList):
    "`ItemList` suitable for storing the floats in items for regression. Will add a `log` if this flag is `True`."
    def __init__(self, items:Iterator, log:bool=False, classes:Collection=None, **kwargs):
        super().__init__(np.array(items, dtype=np.float32), **kwargs)
        self.log = log
        self.copy_new.append('log')
        self.c = self.items.shape[1] if len(self.items.shape) > 1 else 1
        self.loss_func = MSELossFlat()
1 Like

I think it might be because if you don’t use log the chance that the predictions go to over- or underflows is possible and that might harm the training. So maybe it’s better for the model to keep the predictions at smaller range when training.