Optimizing Tabular Data vs. LightGBM

I am running into a challenge of how to get tabular to perform as well as LightGBM. There are 3 recent Kaggle competitions (Microsoft, VSB, Santander-ongoing) that are using tabular data, AUROC, unbalanced data, and a binary classifier.

They all seem to be doing better with LightGBM, can anyone help me understand why and what we can do to change that?

Digging around on the forums I did find this older thread acknowledging a similar observation.

Some simple theories I plan on testing out:

  1. Tabular with a categorylist uses cross entropy for loss. Which is similar but different than the AUCROC that LightGBM is using. While similar, this could account for some precision.

  2. LightGBM has a problem with overfitting and it looks like there are very deep trees. Perhaps, I should be adding more or more massive layers.

  3. Tabular seems sensitive to unbalanced data. By oversampling, I have been getting much better results. However, what is the right mix? Should I start oversampling and walk it back?

  4. Better validation sets. I feel uncomfortable that my validation set is genuinely representative with the rest of the data.

  5. LightGBM is hyperparameter sensitive. Perhaps, I just haven’t found the right hyperparameter “mix” yet.

Welcome to any suggestions! Although specific ones for Santander we need to team up to follow Kaggle’s rules.

10 Likes

The Mircosoft competition was a bit random, I would say. See how the final leaderboard was shuffled. I also tried to run Deep Learning methods on that dataset but without much success. Though I didn’t try too much. Would be interesting to experiment. Also interesting that a different boosting method I’ve used, namely catboost, has shown a bit (though in terms of Kaggle one would treat it as much) worse result then LightGBM, using the same features and validation method, even when I tried to tune its parameters with randomized grid search.

1 Like

I found the same thing with petfinder just using the tabular data I couldn’t get as good performance with a neural net as I could with boosted trees. I also found my neural net couldn’t outperform a linear model with embeddings (and some custom binning I implemented).

I’ve been meaning to go back to this and seeing whether I can create a neural network that works as effectively as boosted trees on the few most predictive models.

I think the best way forward would be to start with a straightforward competition and experiment with approaches.

  1. Tabular with a categorylist uses cross entropy for loss. Which is similar but different than the AUCROC that LightGBM is using. While similar, this could account for some precision.

I don’t think it makes sense to use AUCROC for loss because that’s a global metric. In the LightGBM Docs they state for binary classification they use cross entropy.

  1. LightGBM has a problem with overfitting and it looks like there are very deep trees. Perhaps, I should be adding more or more massive layers.

That agrees with Jeremy’s approach of overfitting your model as a first step. It’s definitely worth experimenting with dimensions and depth here.

  1. Tabular seems sensitive to unbalanced data. By oversampling, I have been getting much better results. However, what is the right mix? Should I start oversampling and walk it back?

Oversampling to the validation/test set proportions is a good idea; my guess is walking it back wouldn’t help much but could try it.

  1. Better validation sets. I feel uncomfortable that my validation set is genuinely representative with the rest of the data.

That would be equally a problem with LightGBM (or since they’re probably using Cross Validation you could do that with your neural network too).

  1. LightGBM is hyperparameter sensitive. Perhaps, I just haven’t found the right hyperparameter “mix” yet.

Definitely possible; following Jeremy’s advice from lesson 8 I guess try to get it to overfit (model size as in 2, learning rate) and then explore regularizing.

2 Likes

In vsb I found the validation data to be behaving completely different from test data. I tried to change the signals into images and did image classification. In validation set, I found the accuracy to be pretty high. In test set it was appalling. I tried tabular method as well. The result here was also pretty poor. I think the only parameters I felt worth dabbling was in hidden layer sizes. But that was just experimentation. I didn’t find a solid way to set the hidden layer sizes properly

I don’t know if this is helpful, but Jermey mentioned the paper in DL1. It is based on an old Kaggle competition. https://arxiv.org/pdf/1604.06737.pdf

This has been very helpful!

I tried to change the signals into images and did image classification

@karthik.subraveti I would love to see your image classification. I tried the same thing and will try to share it this week. It was HORRIBLE (less than just all 0s).

@tanyaroosta My big takeaway from the Rossmann paper was embeddings are amazing and help everyone. Unfortunately, it was a year before LightGBM, and LightGBM findings do not appear. So we would have to go back and run LightGBM to compare. However, even GBM was .71 while NN was .7 for table 3 when the data was randomized. Time series (table 4) had NN doing much better than trees, which seems to be a lesson from practical machine learning 3 or 4.

As I’m typing, this seems to bring back observations of @devforfu and @edwardjross that some of these Kaggle competitions might not be a NN strong point at this time (none of these are time series, all have strange quirks in the data).

So new way ahead!
Scrap #1; or just use a callback to stop training with AUCROC…
#2 run tests on layers and sizes
#3 run more tests on Oversampling
#4 More cross-validation
#5 Try to recreate lr_find() but for hyperparameters

1 Like

I participated in the Microsoft competition mentioned, using the LightGBM approach, and using the NNs. Needless to say, using LightGBMs I always obtained slightly better results than using NNs.

Caveat: I am a Kaggle rookie, so my experience with this is somehow limited, but I had the same feeling as @devforfu. The data seemed a little bit random (plus LB changes of more than 1000 pos).

Also, since I don’t have enough experience, I would like to get a clarification on one issue. The MS dataset needed a lot of curation (mainly entire rows of nans). I don’t know if this is the standard with kaggle datasets, or maybe they wanted to recreate what we would encounter in a normal industry datasets.

I also have better results with boosting trees (lightgbm and catboost) on several tabular competitions (without time series).
I am trying to combine embedding from NN with boosting tree to benefit from the deep knowledge representation of NN and from the strong bias reduction of boosting tree.
Maybe this combined approach could give good results.

MS dataset needed a lot of curation…I don’t know if this is the standard

Jeremy brought up this topic in ML. I would reference here. (While this discusses nans explicitly, the entire ML course is an excellent discussion of data curation.)

In the real world, we can go back and ask questions about the data after determining the importance and find better ways to collect what is important. IMHO Kaggle is artificial because I can’t go back and alter the data we are gathering or add in even more. My train set and test set are fixed.

Big thanks for the reference. I will definitely have to check it out.

What if you used a NN to learn embeddings, then used the embeddings in a LightGBM model

1 Like

That is actually what the authors do in that paper too. They take their embeddings and put them through a random forest and observe a better performance. I think it is a good initial test.

I was thinking along those lines. In the Rossmann paper the authors found that adding embeddings to a gradient boosted tree brought performance from significantly worse than the NN to almost the same. But in this scenario, the embeddings came from a NN that was performing better than the original boosted tree.

I’m curious if taking embeddings from a NN that performs slightly worse than a LightGBM would lead to improved performance.

1 Like

You should definitely check out the porto seguro winning methodology which used denoising autoencoders to create latent representations of the data and then fed those into an ensemble of models downstream.

I’ve got a version of this model working with v0.7 and I’m hoping to port it to v1 in the coming weeks.

8 Likes

I took part in the Microsoft competition as well and, like others, was disappointed at how uncompetitive NNs were. I had high hopes because it was a s*** ton of data and NNs typically do well with more data. One thing I did notice though was that one of the winners posted his Kernels after the competition ended and he did use NNs but he did some fancy feature engineering. One particular one that stood out was he took some version number out of the original field that was something like 1.2.3.12 and he made a new column called with in this case a value of 12. Apparently that was pretty important as it was indicative of high likelihood of malware and a naive NN embeddings approach would fail as 1.2.4.12 would be interpreted as a different category than 1.2.3.12. If you make a new column you can capture this relationship.
TLDR: Sadly even with NNs, feature engineering does not go away and you need to know your data. If we use our understanding to make new columns and Categories, NNs may well be competitive with LGB.

2 Likes

interesting point. thanks for sharing

FE will always be beneficial. My confusion still remains on why would LightGBM outperform the NN with the exact same data.

Oversampling has drastically helped me on my quest. I think LightGBM can deal with it better than our NN can.

1 Like

Good point yeah naive implementations of LGBM still outperformed NNs. I tried changing the layers and embedding sizes but the score did not improve much. Wonder if there is some golden rule for the architecture to work with for a particular data set that could boost the performance.

I actually tried this one trick I thought for sure would work where I made a new column with the ‘lightgbm’ results. I thought for sure that combining that with NNs would yield something better and my training got up to a very high score but then it wasn’t that great for the test case. :sweat_smile:

Is there a way StratifiedKFold can easily be called?

I have found another contender using NN that gets up to almost the same scores as the LightGBM leading Kernels, and it handles StratifiedKFold with Keras.

I think that using a larger dataset helps the modeling.