Given a dataset of 400 entries and about 500 features about soccer games, teams and players of the 2018/2019 season, is it possible to build a powerful DL model to predict upcoming scores of the end of the season?
Are you trying to predict the total game point for each team by the end of the season? Or score for each match in the fixture?
There are 30 rounds in 2018/2019, 26 rounds have been already played. The goal is to predict all scores of the remaining 4 rounds.
IMHO I think asking whether it is possible is kinda pointless, it’s more fruitful if we discuss how to tackle the problem instead.
If it were me, I would try to go after some more stable statistics first, like number of shots on goal. The score would then be a derivation of these kind of metrics, factoring in qualitative stuffs like morale/determination/etc. and unobservable things like luck.
I tried solving this problem using Deep Learning. I used a feed forward neural network to do that. The input was the stats of the team, players, seasons, etc… up to 24 features. The output was the number score. But I could not get better than 50% on the validation set. This problem is tricky because there are no pre-trained model to do transfer learning. Also, the training examples set is small. I wanted to use GANs to generate more training examples but was not sure it could work as nobody has done that before. My question is whether Deep Learning is well suited for this problem or not. As I was able to achieve 50% too with a trivial SVM.
For a high dimensionality problem like this, perhaps try the tree approach first so that you have a rough idea of feature importance.
You might also want to pay attention to the loss function. Rather than 0-1 loss, how about incorporating the odds from betting house and try to beat it?
I’ve been applying fastai tabular models to different datasets. In my little experience, on this small datasets it underperform regular ML. May be your output could be win/tie/lose ? I think it would be more feasible.
I was thinking about implementing some NLP to analyze tweets and see what the majority of people are saying about the games to make the prediction better. Also, since I have the time, I was thinking about modeling this problem as a time series forecasting using RNN. Thank you for your answer.
Thank you for your answer. The model gets 3 points if you predict the right outcome and an extra 3 points if you predict the score accurately.