I have completed the ML course till Lesson 3. And been trying couple of competitions on Kaggle to learn and understand the code better. One of the competitions has been the TFI Restaurant Prediction:
As I am up to Lesson 3, I have extracted features from date, and changed text data to categories. Now the train data set is very small. So I am using OOB score and RMSE on train data as a starting point.
The problem I was facing that both OOB score and RMSE scores fluctuated widely across runs. For example,
1st run: -0.04 (OOB)
2nd Run: -0.15
3rd Run: -0.01
What can I do in such scenarios?
I have had similar issues when trying the ML course techniques on the Titanic dataset.
I think the reason is at least partially in the dataset itself:
The training set only consist of 137 examples. Assuming you define 20% for validation set size, that means the val set(s) is/are about 27 examples, so getting something wrong on one hast a huge influence on the score. So this is not very statistically sound and could explain large fluctuations between the runs due to the different random selections of the val examples. Maybe try your exact same notebook on a larger dataset and see if the problems persist…
I used all 137 rows to avoid any splits. And relied on OOB to make sense of the data.
Now after drawing some plots and graphs on the data, there are some really interesting things going with the data:
- The field description says:
“Type : Type of the restaurant. FC: Food Court, IL: Inline, DT: Drive Thru, MB: Mobile”
But there are no instances of MB in the train data
2. Out of 135 rows 50 belong to a single city - Istanbul which is ~40% of the data.
How does someone work with a data like this?