NCAA Basketball Predictions

I was curious how one might go about using deep learning to predict sports outcomes and events. I know kaggle runs a large competition each year on the ncaa march madness tournament looking at likely hoods of winning. I was thinking of taking it a step further and trying to find the lower seeds that have a chance of upsetting a higher seed based on yearly head to head data, and old tournament information. I am stuck on the format of the model. The ideal input is just the two team ids, with an output of a score spread, but that doesn’t make as much sense as the input should be all the data fields except for the output spread?

Curious if I am on the way wrong track or right track with what should the input be?

Second, is this actually a transfer learning problem, model should be trained on all historic data (data pre tournament) and tested with tournament data, but then you wouldn’t know the additional features the historic games had.

Any guidance is much appreciated as we roll into not only ncaa but other sports playoffs and MLB season. I love sports and love ML, not sure why this one has me in a mental pretzel.

I cannot help much about what features to engineer for a model (the inputs), but looking at the kaggle notebooks from previous year should give you a good idea of what seems to work and what does not.

Also, this appears to be a pure tabular data problem. This means that we generally do not do pre-training and transfer learning. You would use the normal train/validate/test split for the data you have. And if you think the time effect is important. This is what I would do first to understand the data and what can be a good base line for you model. I would use RF or maybe xgboost, then deep learning

Then you can try more complex modeling. Ideas:

  • using deep learning with tabular data + some nlp from information collected on the games
  • using deep learning and model probabilities of winning (both mean and std) so that you can use bayesian analysis

I hope it helps you a little.

1 Like

Yes this does help a little. I was hoping maybe to instead of running a regression, maybe monte carlos simulations with deep learning are more what I was thinking about. If the team plays x amount of times, with the fuzziness of player to player interaction and team strength to team strength interaction judged/calculated by overall games played in the season, what might the stats of the games come out to be on average.

On average, team A will win and expect to score this many points against team B with roughly this many points.

Cannot help much more here, unfortunately. You may want to explore statistical modeling with tools like PyMC3. They use Markov Chain Monte Carlo sampling to resolve the models and this may be what you need. There are a lot of tutorials on these tools. Good luck with the project.