Home credit default risk challenge

(Vineeth Kanaparthi) #1

I need help with handling multiple tables of data. What strategies should I use in combining those data? @jeremy

(Sanyam Bhutani) #2

You should check the kaggle discussions for the competition-They might be a better place to ask this.

Loading into pandas=RAM blows up to full usage. People have found using SQL to merge the tables is less intensive.

Also: Please don’t @ admins unless you’re sure that the thread is directly related to them.

(s.s.o) #3

you may check this example kernel with whole dataset

(Michael) #4

Hello Sanyam,

can you send me the URL to this post (I cannot find it on the kaggle competition page)?

I’m also currently struggling with the data wrangling in my (limited) RAM. :wink:

Thank you very much & best regards

(Michael) #5

So far I was looking for a solution to this problem with dask (http://dask.pydata.org).
On my paperspace machine the merging of the data is no problem, but when I merged everything I cannot write it to the disk without a MemoryError.
So far I didn’t find a solution on the net. I guess if I want to use the Home credit default risk dataset for looking in the the “rossmann fast.ai” approach I have to switch to a more powerful AWS instance (like suggested here Most effective ways to merge “big data” on a single machine or use another, hopefully smaller, dataset like https://www.kaggle.com/c/favorita-grocery-sales-forecasting/data, or learn SQL). :wink:
Any suggestions?
Is the paperspace machine really my bottle-neck?

(David Salazar) #6

Hi, I am working on the same competition with a Paperspace CPU and have not run into your problem. A couple of recommendations:

  1. You should try to reduce the RAM usage of any dataset you load and consequently any dataset you create. Every column in the dataframe has a particular numpy dtype but the defaults are sometimes an overkill. If you change them, your machine will be faster.
  2. Although you say the merging is no problem, consider the following. Load the different datasets sequentially:i.e., load two of them, merge them and then delete each of them you just loaded and call garbage collector. Continue doing the same with the other datasets.
  3. When writing to disk, use the df.to_feather method.

Hope this helps!