I am running into a challenge of how to get tabular to perform as well as LightGBM. There are 3 recent Kaggle competitions (Microsoft, VSB, Santander-ongoing) that are using tabular data, AUROC, unbalanced data, and a binary classifier.
They all seem to be doing better with LightGBM, can anyone help me understand why and what we can do to change that?
Digging around on the forums I did find this older thread acknowledging a similar observation.
Some simple theories I plan on testing out:
Tabular with a categorylist uses cross entropy for loss. Which is similar but different than the AUCROC that LightGBM is using. While similar, this could account for some precision.
LightGBM has a problem with overfitting and it looks like there are very deep trees. Perhaps, I should be adding more or more massive layers.
Tabular seems sensitive to unbalanced data. By oversampling, I have been getting much better results. However, what is the right mix? Should I start oversampling and walk it back?
Better validation sets. I feel uncomfortable that my validation set is genuinely representative with the rest of the data.
LightGBM is hyperparameter sensitive. Perhaps, I just havenât found the right hyperparameter âmixâ yet.
Welcome to any suggestions! Although specific ones for Santander we need to team up to follow Kaggleâs rules.
The Mircosoft competition was a bit random, I would say. See how the final leaderboard was shuffled. I also tried to run Deep Learning methods on that dataset but without much success. Though I didnât try too much. Would be interesting to experiment. Also interesting that a different boosting method Iâve used, namely catboost, has shown a bit (though in terms of Kaggle one would treat it as much) worse result then LightGBM, using the same features and validation method, even when I tried to tune its parameters with randomized grid search.
I found the same thing with petfinder just using the tabular data I couldnât get as good performance with a neural net as I could with boosted trees. I also found my neural net couldnât outperform a linear model with embeddings (and some custom binning I implemented).
Iâve been meaning to go back to this and seeing whether I can create a neural network that works as effectively as boosted trees on the few most predictive models.
I think the best way forward would be to start with a straightforward competition and experiment with approaches.
Tabular with a categorylist uses cross entropy for loss. Which is similar but different than the AUCROC that LightGBM is using. While similar, this could account for some precision.
I donât think it makes sense to use AUCROC for loss because thatâs a global metric. In the LightGBM Docs they state for binary classification they use cross entropy.
LightGBM has a problem with overfitting and it looks like there are very deep trees. Perhaps, I should be adding more or more massive layers.
That agrees with Jeremyâs approach of overfitting your model as a first step. Itâs definitely worth experimenting with dimensions and depth here.
Tabular seems sensitive to unbalanced data. By oversampling, I have been getting much better results. However, what is the right mix? Should I start oversampling and walk it back?
Oversampling to the validation/test set proportions is a good idea; my guess is walking it back wouldnât help much but could try it.
Better validation sets. I feel uncomfortable that my validation set is genuinely representative with the rest of the data.
That would be equally a problem with LightGBM (or since theyâre probably using Cross Validation you could do that with your neural network too).
LightGBM is hyperparameter sensitive. Perhaps, I just havenât found the right hyperparameter âmixâ yet.
Definitely possible; following Jeremyâs advice from lesson 8 I guess try to get it to overfit (model size as in 2, learning rate) and then explore regularizing.
In vsb I found the validation data to be behaving completely different from test data. I tried to change the signals into images and did image classification. In validation set, I found the accuracy to be pretty high. In test set it was appalling. I tried tabular method as well. The result here was also pretty poor. I think the only parameters I felt worth dabbling was in hidden layer sizes. But that was just experimentation. I didnât find a solid way to set the hidden layer sizes properly
I donât know if this is helpful, but Jermey mentioned the paper in DL1. It is based on an old Kaggle competition. https://arxiv.org/pdf/1604.06737.pdf
I tried to change the signals into images and did image classification
@karthik.subraveti I would love to see your image classification. I tried the same thing and will try to share it this week. It was HORRIBLE (less than just all 0s).
@tanyaroosta My big takeaway from the Rossmann paper was embeddings are amazing and help everyone. Unfortunately, it was a year before LightGBM, and LightGBM findings do not appear. So we would have to go back and run LightGBM to compare. However, even GBM was .71 while NN was .7 for table 3 when the data was randomized. Time series (table 4) had NN doing much better than trees, which seems to be a lesson from practical machine learning 3 or 4.
As Iâm typing, this seems to bring back observations of @devforfu and @edwardjross that some of these Kaggle competitions might not be a NN strong point at this time (none of these are time series, all have strange quirks in the data).
So new way ahead!
Scrap #1; or just use a callback to stop training with AUCROC⌠#2 run tests on layers and sizes #3 run more tests on Oversampling #4 More cross-validation #5 Try to recreate lr_find() but for hyperparameters
I participated in the Microsoft competition mentioned, using the LightGBM approach, and using the NNs. Needless to say, using LightGBMs I always obtained slightly better results than using NNs.
Caveat: I am a Kaggle rookie, so my experience with this is somehow limited, but I had the same feeling as @devforfu. The data seemed a little bit random (plus LB changes of more than 1000 pos).
Also, since I donât have enough experience, I would like to get a clarification on one issue. The MS dataset needed a lot of curation (mainly entire rows of nans). I donât know if this is the standard with kaggle datasets, or maybe they wanted to recreate what we would encounter in a normal industry datasets.
I also have better results with boosting trees (lightgbm and catboost) on several tabular competitions (without time series).
I am trying to combine embedding from NN with boosting tree to benefit from the deep knowledge representation of NN and from the strong bias reduction of boosting tree.
Maybe this combined approach could give good results.
MS dataset needed a lot of curationâŚI donât know if this is the standard
Jeremy brought up this topic in ML. I would reference here. (While this discusses nans explicitly, the entire ML course is an excellent discussion of data curation.)
In the real world, we can go back and ask questions about the data after determining the importance and find better ways to collect what is important. IMHO Kaggle is artificial because I canât go back and alter the data we are gathering or add in even more. My train set and test set are fixed.
That is actually what the authors do in that paper too. They take their embeddings and put them through a random forest and observe a better performance. I think it is a good initial test.
I was thinking along those lines. In the Rossmann paper the authors found that adding embeddings to a gradient boosted tree brought performance from significantly worse than the NN to almost the same. But in this scenario, the embeddings came from a NN that was performing better than the original boosted tree.
Iâm curious if taking embeddings from a NN that performs slightly worse than a LightGBM would lead to improved performance.
You should definitely check out the porto seguro winning methodology which used denoising autoencoders to create latent representations of the data and then fed those into an ensemble of models downstream.
Iâve got a version of this model working with v0.7 and Iâm hoping to port it to v1 in the coming weeks.
I took part in the Microsoft competition as well and, like others, was disappointed at how uncompetitive NNs were. I had high hopes because it was a s*** ton of data and NNs typically do well with more data. One thing I did notice though was that one of the winners posted his Kernels after the competition ended and he did use NNs but he did some fancy feature engineering. One particular one that stood out was he took some version number out of the original field that was something like 1.2.3.12 and he made a new column called with in this case a value of 12. Apparently that was pretty important as it was indicative of high likelihood of malware and a naive NN embeddings approach would fail as 1.2.4.12 would be interpreted as a different category than 1.2.3.12. If you make a new column you can capture this relationship.
TLDR: Sadly even with NNs, feature engineering does not go away and you need to know your data. If we use our understanding to make new columns and Categories, NNs may well be competitive with LGB.
Good point yeah naive implementations of LGBM still outperformed NNs. I tried changing the layers and embedding sizes but the score did not improve much. Wonder if there is some golden rule for the architecture to work with for a particular data set that could boost the performance.
I actually tried this one trick I thought for sure would work where I made a new column with the âlightgbmâ results. I thought for sure that combining that with NNs would yield something better and my training got up to a very high score but then it wasnât that great for the test case.
Is there a way StratifiedKFold can easily be called?
I have found another contender using NN that gets up to almost the same scores as the LightGBM leading Kernels, and it handles StratifiedKFold with Keras.
I think that using a larger dataset helps the modeling.