Predicting Bike counts in Munich (Project Chapter 9 Tabular Modelling Deep Dive)

To get some pratical experience I chose my own project to practice the concepts introduced in Chapter 9 “Tabular Modelling Deep Dive” of the fastai_v2 course.

I tried to predict the numbers of bike counts in the city of munich based on weather information and the date.

You can see the notebook here: https://gist.github.com/svendroid/015f2e807ff1f436b69d54bfe9687ee4

I thought maybe it’s helpful for others. Also I am not sure if the results would be really helpful, maybe someone else has done a similiar project and we could compare our solutions.

I wrote down my open questions and thoughts in following conclusion:

I tried two approaches RandomForest and a Neuronal Net and both seem to work in general, but I still have no real feeling if the models are helpful and what the RSME (RootMeanSquaredError) means for the quality of my model in general. Are those good models that would be helpful?

Should I use another metric? Are there other metrics besided RSME (RootSquareMeanError) that would make it easier for me to grasp the performance of my model? Should I maybe use a relative metric? e.g. a percentage Metric? or would MeanAbsoluteError help me? The OOB score is relative from 0 (bad) to 1 (good). It was arround 0.88. Does that mean it is an okay model?

Or does the real value of the predictions only gets visible when I would have a real use case in which they are used? e.g. is an value 100 below the real value helpful for traffic planners?

RandomForest

  • Results change on every rerun. Therfore: How comparable are the results with the different changes then? Should I fix a seed for the RandomForest, is this possible?
  • What does it mean if the predictions get worse with more trees?

Future Steps

  • I really like the waterfall chart for the RandomForest which shows which influence different features have. Is there something similiar for a neural net? e.g. like the visualisation of CNN Layers for image classification
  • Add bavarian holidays to the data
  • Use not only one counting station but the total of all six
  • Use datasets where weekdays and weekends are seperated. Could I train a “Should I bike or take the bus?”-Model on this?

Would be really happy about feedback! :slight_smile: Thanks