This is a forum wiki thread, so you all can edit this post to add/change/organize info to help make it better! To edit, click on the little pencil icon at the bottom of this post. Here’s a pic of what to look for:
<<< Wiki: Lesson 1 | Wiki: Lesson 3 >>>
Lesson resources
Notes from @melissa.fabros:
Intro to ML foundations via Kaggle’s Bluebook for Bulldozers competition
Let’s investigate r^2 aka RMSE (root mean square error), aka demystify math
It’s important to review the evaluation criteria for any Kaggle competition. Evaluating Bluebook for Bulldozers lets us investigate what RMSE (root means square error) really means
Let’s translate the math notation using Bulldozers!
We have data on bulldozer sales. In a sales ledger, Jeremy wrote: on January 1 , Yuka sold one bulldozer; on January 2, she sold four bulldozers, and finally on January 3, Yuka sold five bulldozers.
In Math, this ledger can be translated as Y(i) = [1,4,5]
Y is the real data you have-- Yuka’s actual sales over 3 days.
(i) = the index /position (aka day) where each value appears in your data set.
So, Y(1) = 1 bulldozer, Y(2) =4 bulldozers, Y(3) = 5
ȳ = the variance in your data set. Here, the variance of Yuka’s sales (aka ȳ ) is 4 (highest value - lowest value in your data or 5-1). Therefore any value in your dataset is within a range of ±4.
In English, any number between 1 thru 4 is a pretty good guess for bulldozer sales.
For this data Y(i), the RMSE = 3.7
And in December, we paid Freddie to make a model that predicted sales for same Jan 1-3 period: F(i) - [3, 1, 10]
Here, F is predicted bulldozer sales over three days --> F(i) = [3, 1, 10]
There is an F(i) predicted value correlated to every real data point Y(i) .
In English, Freddie predicted Yuka might sell 3 bulldozers on the first day, then 1 on the second, and finally 10.
Is Freddie’s model smarter, than if we just predicted that Yuka would sell 4 bulldozers per day (3.7 rounded up)? Variance for Freddie’s current model is 9. Whelp, Nope….oops.
RMSE is one way to keep score of your model’s success. You want your model to score at least as well as a good guess. Basically RMSE provides a baseline benchmark measure for a dataset. How good is your custom model to the most naive dumb model (the average of all known data). Ideally you can do better!
It’s less important to learn/memorize the formula; it’s important to understand what’s happening conceptually and to explain the intuition behind the math notation.
I’ve (@mrgold) upload Kaggle kernel to easily-understand-r-2-aka-rmse with @melissa.fabros Freddie example above
Introduction to validations sets
Creating your validation set is one of the most important things you can do in a machine learning practice. Very often people in industry say, they made an ML a model, and it worked in research conditions. Then the models failed miserably in production (aka real life with new unseen data), because they trained all their models on the entirety of their data. The model overfits (or “memorizes”) the current data, and doesn’t generalize to new data. Portioning off a validation set from your training data lets you understand how your model will work in the real world with novel data.
Kaggle does something well in creating real-life data conditions.
For example, the Bulldozer data represents dates for current timespan in training data, and the test set is the data against you’re model is being scored. You score on Kaggle’s public leaderboard is based on this public test set. But Kaggle often has yet another private dataset to assess your model. Many kaggle competitors often overfit their models on the test set and are at the top of the leaderboard until the private data assessment. And some competitors jump up in the rankings because their models performed better on the new data.
You’ll want to create validation set that creates the conditions of Kaggle’s test set from the data given to you as training data. You save this validation data for your own assessments of the model.
Q:What is a validation set?
You’re holding out data from the training data (the data where you know the answers to the problem statement) and never look at it until after you build your model and it’s ready to be evaluated. Never introduce this data for any training the model. For the model, it will be novel information it’s never seen before, so you can look at it predictions, but you also know the actual results. So you can compare each prediction vs the actual data point to assess model accuracy. (Going back to Yuka and Jeremy’s bulldozer business, Freddie predicted 10 bulldozers sales on day 3, when actual sales were 5).
You can use sklearn tools to portion out subsamples of the training sets for validation or dev sets.
how to develop models quickly
You want to be fast and to be able to have a model ingest data and quickly give you predictions so you can eyeball the accuracy of the model and start to tune the model or fix what’s broken
- Really good hardware: you want buy to as fast as you can afford cpus and/or gpus. Fast hardware such as NVMe drives, or SSD and powerful GPUs, let you train and evaluate models against large or complex datasets quickly. The sooner you know something is wrong, the sooner you can fix it. It’s hard to iterate over models and the data ingestion pipeline if you have to wait hours before you get a result. Computer science people often optimize to reduce “expensive” hardware calculations, Data people should optimize for speed of development. Modern computing hardware makes what is often considered “expensive” processes negligible for the data scientist.
- Create another “dev_set” of the training data where execution of a programming process is completed in around 10 seconds so you know that the part of the pipeline you’re building works. You need to iterate and tweak the pipeline and model quickly. If you want to test the whole data pipeline from data ingestion to validation, use a larger subset of training data, but you don’t have to use all training data just to know that your end-to-end pipeline works.
- Once you’re done iterating, then you can train the model on the all the training data overnight. If you have access to more than one gpu, you can train one model on the gpu while you iterate on another model (maybe one with a different architecture or the same model with different hyperparameters)
- In the end you’ll have, four sets of data: dev_set, validation_set, train, test (aka evaluation set). Dev and validation sets are portioned out from the train dataset. Once you’re done iterating, you can recombine the dev_set with the train dataset for overall model training. The validation set never gets used for model training only for model evaluation.
This week try to experiment, some ideas:
- Explore different datasets ,
- try to write your own functions,
- try to use different libraries,
- use some different plot styles or plot libraries