Models in kaggle kernel

(blevy) #1

I am running out of memory when trying to commit my model(RandomForestRegressor) to a kaggle kernel. I’m only using 60 estimators, and about 3m rows. One thought I had is to run a model locally or on a remote machine, store the fitted model, then upload the model as bytes. Has anyone successfully done this? I’ve tried saving model as a pickle, but I think the size was too large. I’ve also tried compressing it and saving as bytes object but I’m not sure how how to get this object from my machine to the kaggle kernel server.
Any input would be appreciated, I’m fairly sure there’s an easier way to handle ram issues.

(Maciej Kedziora) #2

Hi blev,

please see below link:

it has a code to reduce memory impact of the data. It should help.
The other solution is to NOT to use gpu enabled kernel.
The reason is Kaggle will give you 4cpu and 17GB ram for cpu kernel but only 2cpu and 14GB of ram for GPU enabled kernels.
If you didnt enabled GPU you are on CPU by default.

also you can try to limit numbers of rows used, either by subsample parameter in proc_df, or by setting set_rf_samples to let say 500 000.
I would prefer the second option as Jeremy said that then each tree will use different 500 000.

next you can try using less cores. In one of mine kernels a had errors when running on n_jobs=-1, but the kernel run fine on n_jobs=2.

examples from lessons:
df_trn, y_trn, nas = proc_df(df_raw, ‘SalePrice’, subset=500000, na_dict=nas)
set_rf_samples(500000) #to set samples
reset_rf_samples() #to reset samples

if you use second solution please turn off OOB_score in random forest as it will be super slow. Due to implementation in sk-learn the oob is not using not used row from the sample (example 40% of 500 000) but all not used row in database (so if you set samples to 500 000 it will use 2500 000 to calculate score)