Experience with porting Lesson 3 Notebook for the Kaggle Housing Challenge

cadolphs · September 26, 2018, 8:05pm

I haven’t found much discussion of this in this forum here, so I’m wondering what people’s experiences are. I got a RMSE of ~ 0.11 using a “minimalstic” conversion, but have a few general higher-level questions.

Given that we only have years given for built, remodeled, and garage, I didn’t do any sort of feature extraction similar to the “CompetitionOpenSince”. There are also no recurring events, so I didn’t bother with any “Duration” feature extraction.

Data cleaning for missing values seemed straightforward: Mostly the N/A / nan stands for “Not there”, like “Doesn’t have a Pool”, so just replacing nan with a string “None” should be fine. Except for “Electrical” where I called it “Unknown”. I think the only numerical numbers that were missing could also just be set to 0.

The division of variables into categorical and continuous ones seemed straightforward, too: Everything measured in feet or sqft, dollars, or years is continuous; the rest is a categorical. (Even if there’s inherent order in all those condition-based category like good, average, poor. I trust the NN would catch up on that.)

For the NN, I played around with different layer sizes and dropouts, and this is where I’m running into trouble: I have a hard time acquiring an intuition, based on the results I’m getting, which knob I should tweak: If I’m overfitting, is it the dropout I want to increase? What should the batch size be?

What I often observe is that both train and validation error decrease quite a bit, but the RMSE metric stays the same.

Is the small size of the data set a problem? Rossmann had 10s of thousands of rows, the housing challenge only has ~ 1500.

And in general… how would one go about telling where time would best be spent? More / better feature engineering (i.e. look for and get rid of outliers? Create columns for total number of bathrooms, etc, etc?) or better hyperparameter tuning?