I need help with handling multiple tables of data. What strategies should I use in combining those data? @jeremy
You should check the kaggle discussions for the competition-They might be a better place to ask this.
Loading into pandas=RAM blows up to full usage. People have found using SQL to merge the tables is less intensive.
Also: Please don’t @ admins unless you’re sure that the thread is directly related to them.
can you send me the URL to this post (I cannot find it on the kaggle competition page)?
I’m also currently struggling with the data wrangling in my (limited) RAM.
Thank you very much & best regards
So far I was looking for a solution to this problem with dask (http://dask.pydata.org).
On my paperspace machine the merging of the data is no problem, but when I merged everything I cannot write it to the disk without a MemoryError.
So far I didn’t find a solution on the net. I guess if I want to use the Home credit default risk dataset for looking in the the “rossmann fast.ai” approach I have to switch to a more powerful AWS instance (like suggested here Most effective ways to merge “big data” on a single machine or use another, hopefully smaller, dataset like https://www.kaggle.com/c/favorita-grocery-sales-forecasting/data, or learn SQL).
Is the paperspace machine really my bottle-neck?
Hi, I am working on the same competition with a Paperspace CPU and have not run into your problem. A couple of recommendations:
- You should try to reduce the RAM usage of any dataset you load and consequently any dataset you create. Every column in the dataframe has a particular numpy dtype but the defaults are sometimes an overkill. If you change them, your machine will be faster.
- Although you say the merging is no problem, consider the following. Load the different datasets sequentially:i.e., load two of them, merge them and then delete each of them you just loaded and call garbage collector. Continue doing the same with the other datasets.
- When writing to disk, use the df.to_feather method.
Hope this helps!
@jeremy are their particular instances where structured data deep learning works better than others. I tried out the principles you outlined in the Rossman data lecture on a currently running Kaggle challenge (home credit default risk) on their base dataset and got an AUC of 0.503 which obviously ain’t great. Any pointers on which structured data type problems embeddings work better and when they do not, thanks
If you got an AUC of ~0.5 than this is basically random chance - there is some issue with you setup. On the kaggle forum there is a very nice fastai starter kernel, might be worth using as a starting point.
thanks i will try and find that and check it out
thanks, figured it out. I was actually submitting the 0/1 predictions instead of the probability. I basically get an AUC of 0.738 which is pretty good given the dataset i am using but lower than catboost
I think these are the ones radek was mentioning (Thanks for the tip, Radek!).
The data preprocessing kernel is also very good:
(thank you very much @davidsalazarvergara, I learned a lot!)
Loading the data for Corporación Favorita Grocery Sales Forecasting in Less Ram Machine
thanks kevin and michael
I was also trying to write a notebook using StructuredLearner and got close to the implementation you mentioned. On top of that, I applied all Phasewise learnings (by Slyvian), different depths of network, different drop-outs etc.
It turns out that the performance (ROC AUC) does not improve beyond 0.74 which is easily achieved with lightgbm or catboost. Now, with that said, I’m trying to understand what is wrong and this post felt like a nice place to brain storm.
I think, after merging all the datasets (filling NAs by mean and mode for numeric and char vars respectively), most of the variables add noise, I did recursive elimination and found 122 variables resulting in better AUC than the original 205 variables. I tried my fastai approach with both the datasets and neither could achieve a score beyond 0.74.
Is it the case that the variables do not have any latent information which embeddings could make use off?
If so, is that a dead-end to deep learning for such an example, I always think that NNets being universal approximators can adapt to anything, given the freedom to learn latent information, why do we hit the threshold?
Please suggest me anything that you feel is worth trying
You could try normalizing inputs but I do not think you will get a lot of mileage out of this. The winner of the porto seguro competition uses rank gauss but just mean subtraction and dividing by std dev should be okay as well.
Other than that, it’s feature selection and feature engineering based on the other csv files and k-fold training with combining predictions using multiple seeds.
You could also try denoising autoencoder on the data you are using right now but probably more as a fun exercise.
BTW I don’t think me posting this breaks Kaggle rules as it’s literally a distillation of everything being said in the Kaggle forums up unto like 3 weeks ago which was when I stopped following the discussion. But I would be very surprised if any of the top solutions use a different approach.
I’ve also had trouble using fast.ai trying to get similar results as kagglers using LGBM.
I got 0.75+ using only the main table, and 0.77+ taking the features from an LGBM kernel. While that kernel is closer to 0.79.
Makes me wonder if neural networks are state of the art for structured data problems.