Kaggle: Home Credit Competition


(David Salazar) #1

Hi!

Right now, there’s a Kaggle competition running about trying to predict credit delinquency. The problem, thus, is a classification one with structured data. A specific problem that is never actually used in the Deep Learning Course. You will see then why this is a perfect competition to get out of our ML comfort zone.

I just created a Kaggle kernel (I had troubles with the Kaggle Kernel, so I created another one) with the bare basics to start using Neural Networks on the problem. I had to tweak the code from the Rossman lectura a little bit but I finally got it running with categorical embeddings and a weighted loss function to try to account for the class imbalance.

Right now, this model is still lagging, as all the others kernels using Neural Networks, with respect to Boosting Trees. If you have any recommendations or questions, I’d be happy to discuss them!


(Will) #2

Seems interesting I’ll be sure to check this out and share any ideas


(Will) #3

Some ideas for you:

instead of filling the missing categoricals you could try filling them with ‘missing’ and add a boolean column to indicate whether it was missing or not. For that matter, before filling any missing data I would add a boolean indicator column. Not clear to me whether you had already done this before loading the merged data from feather.

I would also focus on creating wider dense layers rather than narrower but deeper. so instead of going 100x100x100 try going 500x250

I would also take the embeddings learned from your best neural net and throw them into your boosted tree model to see if that can boost your best performance there.

With respect to your learning rate finder, try passing parameters that zoom in on the elbow of your learning rate. so in your case for your first time calling lr_find() try learn.lr_find(start_lr=1e-5, end_lr= 1e-3). I have found this to be effective on structured data where there is a very small range of effective learning rates. You may find that the LR curve itself changes when doing this. Also may want to increase your batch size based on the choppiness of the loss curve the second time you call lr_find()

instead of adding weights, try copying the underrepresented target rows to balance the dataset. Jeremy has reported this to be very effective and its worked recently in some kaggle competitions.

Another thing I’m playing with when doing this is applying VERY slight feature transformation when copying these over. So randomly altering numerics by <1% can work but i don’t have any rules of thumb there and am still experimenting.

Hope these help, report back with results!


#4

I just wanted to say thank you for putting the notebook together :slight_smile: I came across it on Kaggle even before I found this thread and it was a pleasure to read!

You share some really cool ideas in the notebook and I am gonna steal a few of them from you! :wink:


(David Salazar) #5

Thanks for all your suggestions. I will heed your advice and report back next weekend!


(Kodiak Labs) #6

@whamp : with respect to altering numerics by 1% when adding them to ‘up sample’ the under-represented data set, you’re moving along the lines of the SMOTE techinique for upsampling.

I think this can also work with categorical variables, but I have yet to find a reliable resource.


(Rahim Shamsy) #7

For this competition, I am finding the data files too big to load on pandas. There’s suggested ways online - using Dask - but I’d like to know what you did. Essentially, the issue is that the files are too big, and when I run pd.read_csv(), it leads to MemoryError.

Thanks


(Kodiak Labs) #8

@rshamsy: could you possibly use a stratified sample of the dataset that would fit into memory, and work from there?


(Giuseppe Merendino) #9

Thank you for your notebook @davidsalazarvergara
I think that in this competition win those who use the time series well

@rshamsy I started using H2Database loading all CSV into tables for data exploring, I’ll create small samples for quickly test some solutions


(Will) #10

interesting i wasn’t aware of that technique, i’ll have to check it out thank you!


(David Salazar) #11

I am using Kaggle Kernels to try this competition and there’s no reason to use Dask. A couple of suggestions:

1.You should try to reduce the RAM usage of any dataset you load and consequently any dataset you create. Every column in the dataframe has a particular numpy dtype but the defaults are sometimes an overkill. If you change them, your machine will be faster.
2. Load the different datasets sequentially: i.e., load two of them, merge them and then delete each of them you just loaded and call garbage collector. Continue doing the same with the other datasets.
When writing to disk, use the df.to_feather method.


(David Salazar) #12

I have created another kernel with your suggestions. The results did improve: from 0.759 to 0.763.

  1. For the missing values, I already was handling them the way you said.
  2. I did tinker with the network architecture and it improved.
  3. Oversampling the imbalanced target class was the one that improved my results the most.
  4. Have not yet played with slight feature transformation when oversampling. Maybe I will try the SMOTE that @KodiakLabs said.

Thanks for your suggestions!


(Will) #13

Glad it’s working for you! I was getting similar results of ~.76 or so with an architecture 6 layers deep going 800 600 400 200 100 20 and dropout of .4 ,.3, .2, .2, .1, .01

These are far from optimal but just some things I experimented with when i had time. I’m actually getting married this weekend so my attention has been slightly diverted!

Have you tried putting the embedding matrix created in your neural net fitting process into your XGBoost(or similar) model as additional features? That has shown very good results in the past


#14

Congrats Will! Go divert your attention 100%.


(Sophia Wang) #15

I encountered the similar problem, after talking with my teammates, we think it might not necessary to load everything to pandas. Pandas is still able to handle 1 or 2 or 3 files separately, and we are selecting important features in a small set of features. What do you think about this approach?