Kaggle Comp: Mercedes-Benz Greener Manufacturing ... How to improve a regression problem?

I entered this competition but have a pretty low score and have been unable to improve it.

There aren’t a lot of training or test examples (less than 5,000 for each set) and having almost made my way through Part 1, I don’t think we’ve done much with regards to solving regression based problems (e.g., feature selection and augmentation, what to do when X goes wrong, etc…)

Anyhow, just looking for general advice on how to approach something like this competition. Things like …

  • What kind of network architectures should I explore?

  • What hyperparameters I should be looking at?

  • How to do feature selection including manipulating current features or adding addition features

And any just general methodology to follow for these kind of regression problems would be helpful.

While I haven’t had a chance to look at that competition yet, just the kind of data, small amount of training data, etc, makes me initially think “XGBoost”… just sayin’… :slight_smile:

What is XGBoost?

Having spent a couple hours on kaggle, it seems like EVERYONE is using it for this competition (and a bunch of others). As such, it definitely appears like something worth learning and so … I have some questions:

  1. Is it a Neural Network? If so, how does it differ from the architectures we’ve been developing in Part 1?

  2. When is something like XGBoost a preferable option to building a NN and vice versa? What kind of problems is it best suited at solving as compared to building neural network in Keras?

And thanks for the comment!

This competition is very well poised for practicing with ensembles, at the very least just fork some of the best public kernels to have variety and play with this great guide from one of the masters https://mlwave.com/kaggle-ensembling-guide/


Xgboost is a gradient boosted tree(similar to a decision tree) ensembling algorithm that partitions data in a way that minimizes a loss function. Its not a neural network but the algorithm defines a loss function and optimizes it to solve a machine learning problem(like NNs).