Kaggle: Home Credit Competition


(David Salazar) #1

Hi!

Right now, there’s a Kaggle competition running about trying to predict credit delinquency. The problem, thus, is a classification one with structured data. A specific problem that is never actually used in the Deep Learning Course. You will see then why this is a perfect competition to get out of our ML comfort zone.

I just created a Kaggle kernel (I had troubles with the Kaggle Kernel, so I created another one) with the bare basics to start using Neural Networks on the problem. I had to tweak the code from the Rossman lectura a little bit but I finally got it running with categorical embeddings and a weighted loss function to try to account for the class imbalance.

Right now, this model is still lagging, as all the others kernels using Neural Networks, with respect to Boosting Trees. If you have any recommendations or questions, I’d be happy to discuss them!


Structured Learner
(Will) #2

Seems interesting I’ll be sure to check this out and share any ideas


(Will) #3

Some ideas for you:

instead of filling the missing categoricals you could try filling them with ‘missing’ and add a boolean column to indicate whether it was missing or not. For that matter, before filling any missing data I would add a boolean indicator column. Not clear to me whether you had already done this before loading the merged data from feather.

I would also focus on creating wider dense layers rather than narrower but deeper. so instead of going 100x100x100 try going 500x250

I would also take the embeddings learned from your best neural net and throw them into your boosted tree model to see if that can boost your best performance there.

With respect to your learning rate finder, try passing parameters that zoom in on the elbow of your learning rate. so in your case for your first time calling lr_find() try learn.lr_find(start_lr=1e-5, end_lr= 1e-3). I have found this to be effective on structured data where there is a very small range of effective learning rates. You may find that the LR curve itself changes when doing this. Also may want to increase your batch size based on the choppiness of the loss curve the second time you call lr_find()

instead of adding weights, try copying the underrepresented target rows to balance the dataset. Jeremy has reported this to be very effective and its worked recently in some kaggle competitions.

Another thing I’m playing with when doing this is applying VERY slight feature transformation when copying these over. So randomly altering numerics by <1% can work but i don’t have any rules of thumb there and am still experimenting.

Hope these help, report back with results!


#4

I just wanted to say thank you for putting the notebook together :slight_smile: I came across it on Kaggle even before I found this thread and it was a pleasure to read!

You share some really cool ideas in the notebook and I am gonna steal a few of them from you! :wink:


(David Salazar) #5

Thanks for all your suggestions. I will heed your advice and report back next weekend!


(Kodiak Labs) #6

@whamp : with respect to altering numerics by 1% when adding them to ‘up sample’ the under-represented data set, you’re moving along the lines of the SMOTE techinique for upsampling.

I think this can also work with categorical variables, but I have yet to find a reliable resource.


(Rahim Shamsy) #7

For this competition, I am finding the data files too big to load on pandas. There’s suggested ways online - using Dask - but I’d like to know what you did. Essentially, the issue is that the files are too big, and when I run pd.read_csv(), it leads to MemoryError.

Thanks


(Kodiak Labs) #8

@rshamsy: could you possibly use a stratified sample of the dataset that would fit into memory, and work from there?


(Giuseppe Merendino) #9

Thank you for your notebook @davidsalazarvergara
I think that in this competition win those who use the time series well

@rshamsy I started using H2Database loading all CSV into tables for data exploring, I’ll create small samples for quickly test some solutions


(Will) #10

interesting i wasn’t aware of that technique, i’ll have to check it out thank you!


(David Salazar) #11

I am using Kaggle Kernels to try this competition and there’s no reason to use Dask. A couple of suggestions:

1.You should try to reduce the RAM usage of any dataset you load and consequently any dataset you create. Every column in the dataframe has a particular numpy dtype but the defaults are sometimes an overkill. If you change them, your machine will be faster.
2. Load the different datasets sequentially: i.e., load two of them, merge them and then delete each of them you just loaded and call garbage collector. Continue doing the same with the other datasets.
When writing to disk, use the df.to_feather method.


(David Salazar) #12

I have created another kernel with your suggestions. The results did improve: from 0.759 to 0.763.

  1. For the missing values, I already was handling them the way you said.
  2. I did tinker with the network architecture and it improved.
  3. Oversampling the imbalanced target class was the one that improved my results the most.
  4. Have not yet played with slight feature transformation when oversampling. Maybe I will try the SMOTE that @KodiakLabs said.

Thanks for your suggestions!


(Will) #13

Glad it’s working for you! I was getting similar results of ~.76 or so with an architecture 6 layers deep going 800 600 400 200 100 20 and dropout of .4 ,.3, .2, .2, .1, .01

These are far from optimal but just some things I experimented with when i had time. I’m actually getting married this weekend so my attention has been slightly diverted!

Have you tried putting the embedding matrix created in your neural net fitting process into your XGBoost(or similar) model as additional features? That has shown very good results in the past


#14

Congrats Will! Go divert your attention 100%.


(Sophia Wang) #15

I encountered the similar problem, after talking with my teammates, we think it might not necessary to load everything to pandas. Pandas is still able to handle 1 or 2 or 3 files separately, and we are selecting important features in a small set of features. What do you think about this approach?


(Rahim Shamsy) #16

Thanks, I will try this right now.


(Rahim Shamsy) #17

Thanks for the input @sophia.onion.

It seems like a good approach - to only pick the features that you think are important. But in the exploration stage, you would want to test and see which ones are important - the control should probably be when all features are present. Then you would go on and remove the features to see what difference is made. I would think that is the best testing sequence. You could work the other way round - start with few features and consequently add more, but eventually you may run into the memory problem again.

Or are you trying something different?

Rahim


(Will) #18

So I’ve been trying to implement the SMOTE method by running

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=23, ratio = ‘minority’, n_jobs=-1)
%time X_train_res, y_train_res = sm.fit_sample(X_train, y_train)

#Convert back to dataframe to get dataloader to work
X_trn_res = pd.DataFrame(data=X_train_res,columns=df.columns)
y_trn_res = pd.DataFrame(data=y_train_res,columns=[‘TARGET’]
y_valid = pd.DataFrame(data=y_valid,columns=[‘TARGET’])

My code works all the way through training before attempting to implement the SMOTE code.
After inserting the code above, I have begun to get a keyerror in the lr_find() and nothing else will run.

It feels like it has to be a simple issue of converting back from the numpy array output by the SMOTE code back into dataframes so i can use my existing code, but for the life of me I can’t figure out what’s wrong. Does anyone have experience with the imblearn package or smote and getting it to place nice with fast.ai ?

Please forgive the huge error message dump but after a second day debugging with no progress I’m flailing for help a little bit here. There must be a way the dataloader is handling the indexes in a way that gets destroyed when being upsampled in the SMOTE algo and output as a np array and then reconstituted as a dataframe. At this point, I’m just going to keep things as arrays and convert the remaining dataframes to arrays and use the ColumnarModelData.from_arrays() method.

md2  = ColumnarModelData.from_data_frames('', trn_df = X_trn_res, val_df = X_valid, trn_y = y_trn_res.astype('int'),
                                         val_y = y_valid.astype('int'), cat_flds=cat_vars, bs=256, is_reg= False,is_multi=False)

m2 = MixedInputModel(emb_szs, n_cont = len(df.columns)-len(cat_vars), emb_drop = 0.1, out_sz = 2,
                    szs = [1000, 800, 600, 400, 200], drops = [0.5, 0.4, 0.3, 0.2, 0.1],y_range = None,
                    use_bn = False, is_reg = False, is_multi = False)

bm2 = BasicModel(m2.cuda(), 'binary_classifier')

learn2 = StructuredLearner(md2, bm2)    

learn2.lr_find()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\Lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2524             try:
-> 2525                 return self._engine.get_loc(key)
   2526             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 111286

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-63-19b58a336fcf> in <module>()
----> 1 learn2.lr_find()

~\fastai\courses\dl1\structured\fastai\learner.py in lr_find(self, start_lr, end_lr, wds, linear, **kwargs)
    328         layer_opt = self.get_layer_opt(start_lr, wds)
    329         self.sched = LR_Finder(layer_opt, len(self.data.trn_dl), end_lr, linear=linear)
--> 330         self.fit_gen(self.model, self.data, layer_opt, 1, **kwargs)
    331         self.load('tmp')
    332 

~\fastai\courses\dl1\structured\fastai\learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, best_save_name, use_clr, use_clr_beta, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, use_swa, swa_start, swa_eval_freq, **kwargs)
    232             metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, fp16=self.fp16,
    233             swa_model=self.swa_model if use_swa else None, swa_start=swa_start,
--> 234             swa_eval_freq=swa_eval_freq, **kwargs)
    235 
    236     def get_layer_groups(self): return self.models.get_layer_groups()

~\fastai\courses\dl1\structured\fastai\model.py in fit(model, data, n_epochs, opt, crit, metrics, callbacks, stepper, swa_model, swa_start, swa_eval_freq, **kwargs)
    135         if all_val: val_iter = IterBatch(cur_data.val_dl)
    136 
--> 137         for (*x,y) in t:
    138             batch_num += 1
    139             for cb in callbacks: cb.on_batch_begin()

~\Anaconda3\Lib\site-packages\tqdm\_tqdm.py in __iter__(self)
    928 """, fp_write=getattr(self.fp, 'write', sys.stderr.write))
    929 
--> 930             for obj in iterable:
    931                 yield obj
    932                 # Update and possibly print the progressbar.

~\fastai\courses\dl1\structured\fastai\dataloader.py in __iter__(self)
     86                 # avoid py3.6 issue where queue is infinite and can result in memory exhaustion
     87                 for c in chunk_iter(iter(self.batch_sampler), self.num_workers*10):
---> 88                     for batch in e.map(self.get_batch, c):
     89                         yield get_tensor(batch, self.pin_memory, self.half)
     90 

~\Anaconda3\envs\fastai\lib\concurrent\futures\_base.py in result_iterator()
    584                     # Careful not to keep a reference to the popped future
    585                     if timeout is None:
--> 586                         yield fs.pop().result()
    587                     else:
    588                         yield fs.pop().result(end_time - time.time())

~\Anaconda3\envs\fastai\lib\concurrent\futures\_base.py in result(self, timeout)
    423                 raise CancelledError()
    424             elif self._state == FINISHED:
--> 425                 return self.__get_result()
    426 
    427             self._condition.wait(timeout)

~\Anaconda3\envs\fastai\lib\concurrent\futures\_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

~\Anaconda3\envs\fastai\lib\concurrent\futures\thread.py in run(self)
     54 
     55         try:
---> 56             result = self.fn(*self.args, **self.kwargs)
     57         except BaseException as exc:
     58             self.future.set_exception(exc)

~\fastai\courses\dl1\structured\fastai\dataloader.py in get_batch(self, indices)
     73 
     74     def get_batch(self, indices):
---> 75         res = self.np_collate([self.dataset[i] for i in indices])
     76         if self.transpose:   res[0] = res[0].T
     77         if self.transpose_y: res[1] = res[1].T

~\fastai\courses\dl1\structured\fastai\dataloader.py in <listcomp>(.0)
     73 
     74     def get_batch(self, indices):
---> 75         res = self.np_collate([self.dataset[i] for i in indices])
     76         if self.transpose:   res[0] = res[0].T
     77         if self.transpose_y: res[1] = res[1].T

~\fastai\courses\dl1\structured\fastai\column_data.py in __getitem__(self, idx)
     35 
     36     def __getitem__(self, idx):
---> 37         return [self.cats[idx], self.conts[idx], self.y[idx]]
     38 
     39     @classmethod

~\Anaconda3\Lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2137             return self._getitem_multilevel(key)
   2138         else:
-> 2139             return self._getitem_column(key)
   2140 
   2141     def _getitem_column(self, key):

~\Anaconda3\Lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
   2144         # get column
   2145         if self.columns.is_unique:
-> 2146             return self._get_item_cache(key)
   2147 
   2148         # duplicate columns & possible reduce dimensionality

~\Anaconda3\Lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
   1840         res = cache.get(item)
   1841         if res is None:
-> 1842             values = self._data.get(item)
   1843             res = self._box_item_values(item, values)
   1844             cache[item] = res

~\Anaconda3\Lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
   3841 
   3842             if not isna(item):
-> 3843                 loc = self.items.get_loc(item)
   3844             else:
   3845                 indexer = np.arange(len(self.items))[isna(self.items)]

~\Anaconda3\Lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2525                 return self._engine.get_loc(key)
   2526             except KeyError:
-> 2527                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2528 
   2529         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 111286

#19

Definitely interested in classification of structured data using deep learning techniques / models.

Please let me know how I can get involved.


#20

Hi David

Why did you use the parameter out_sz = 2 defining the MixedInputModel? I’m having a hard time trying to convert the predictions of my binary model (out_sz = 1) to probabilities between 0 and 1 like the sample submission. If choosing size 2 output the model returns the log of the probabilities, what size 1 returns?

My model returns values between (-0.08659633, 1.2345045) and exp (0.9170472, 3.436675). Can I transform this numbers to what I want or I have to change the model?