Handling mislabeled tabular data with xgboost

cmauck10 · February 6, 2023, 7:30pm

Hey FastAi!

Upon searching for methods to find errors within tabular data, I wasn’t able to find much so I decided to write my own notebook and article.

In the article, I outline the necessity for data-centric techniques and how you are able to get model-agnostic improvement by improving the quality of your data. Not only does this improve model performance by itself, but it also leaves additional room for the usual model improvements (hyperparameter tuning, architecture optimization, etc).

I go into more detail in the article, but at a high level I:

Trained the baseline xgboost model on the original data (67% accuracy)
Used data-centric techniques (cleanlab open-source) to find label issues within the dataset
Dropped the incorrect data from the training set
Retrained the same XGBoost model on the better-quality data (90% accuracy)

In doing this, I was able to see a reduction in error of 70%. The raw increase in classification accuracy went from 67% to 90%.

I hope this will be of value to those working with tabular data and machine learning. Check out the blog and let me know your thoughts in the comments!