I am running into a challenge of how to get tabular to perform as well as LightGBM. There are 3 recent Kaggle competitions (Microsoft, VSB, Santander-ongoing) that are using tabular data, AUROC, unbalanced data, and a binary classifier.
They all seem to be doing better with LightGBM, can anyone help me understand why and what we can do to change that?
Digging around on the forums I did find this older thread acknowledging a similar observation.
Some simple theories I plan on testing out:
Tabular with a categorylist uses cross entropy for loss. Which is similar but different than the AUCROC that LightGBM is using. While similar, this could account for some precision.
LightGBM has a problem with overfitting and it looks like there are very deep trees. Perhaps, I should be adding more or more massive layers.
Tabular seems sensitive to unbalanced data. By oversampling, I have been getting much better results. However, what is the right mix? Should I start oversampling and walk it back?
Better validation sets. I feel uncomfortable that my validation set is genuinely representative with the rest of the data.
LightGBM is hyperparameter sensitive. Perhaps, I just haven’t found the right hyperparameter “mix” yet.
Welcome to any suggestions! Although specific ones for Santander we need to team up to follow Kaggle’s rules.