Wiki/lesson thread: Lesson 2

This is a forum wiki thread, so you all can edit this post to add/change/organize info to help make it better! To edit, click on the little pencil icon at the bottom of this post. Here’s a pic of what to look for:

<<< Wiki: Lesson 1Wiki: Lesson 3 >>>

Lesson resources


Notes from @melissa.fabros:

Intro to ML foundations via Kaggle’s Bluebook for Bulldozers competition

Let’s investigate r^2 aka RMSE (root mean square error), aka demystify math

It’s important to review the evaluation criteria for any Kaggle competition. Evaluating Bluebook for Bulldozers lets us investigate what RMSE (root means square error) really means

Let’s translate the math notation using Bulldozers!
We have data on bulldozer sales. In a sales ledger, Jeremy wrote: on January 1 , Yuka sold one bulldozer; on January 2, she sold four bulldozers, and finally on January 3, Yuka sold five bulldozers.

In Math, this ledger can be translated as Y(i) = [1,4,5]
Y is the real data you have-- Yuka’s actual sales over 3 days.
(i) = the index /position (aka day) where each value appears in your data set.
So, Y(1) = 1 bulldozer, Y(2) =4 bulldozers, Y(3) = 5

ȳ = the variance in your data set. Here, the variance of Yuka’s sales (aka ȳ ) is 4 (highest value - lowest value in your data or 5-1). Therefore any value in your dataset is within a range of ±4.
In English, any number between 1 thru 4 is a pretty good guess for bulldozer sales.
For this data Y(i), the RMSE = 3.7

And in December, we paid Freddie to make a model that predicted sales for same Jan 1-3 period: F(i) - [3, 1, 10]
Here, F is predicted bulldozer sales over three days --> F(i) = [3, 1, 10]
There is an F(i) predicted value correlated to every real data point Y(i) .
In English, Freddie predicted Yuka might sell 3 bulldozers on the first day, then 1 on the second, and finally 10.

Is Freddie’s model smarter, than if we just predicted that Yuka would sell 4 bulldozers per day (3.7 rounded up)? Variance for Freddie’s current model is 9. Whelp, Nope….oops.

RMSE is one way to keep score of your model’s success. You want your model to score at least as well as a good guess. Basically RMSE provides a baseline benchmark measure for a dataset. How good is your custom model to the most naive dumb model (the average of all known data). Ideally you can do better!

It’s less important to learn/memorize the formula; it’s important to understand what’s happening conceptually and to explain the intuition behind the math notation.

I’ve (@mrgold) upload Kaggle kernel to easily-understand-r-2-aka-rmse with @melissa.fabros Freddie example above

Introduction to validations sets

Creating your validation set is one of the most important things you can do in a machine learning practice. Very often people in industry say, they made an ML a model, and it worked in research conditions. Then the models failed miserably in production (aka real life with new unseen data), because they trained all their models on the entirety of their data. The model overfits (or “memorizes”) the current data, and doesn’t generalize to new data. Portioning off a validation set from your training data lets you understand how your model will work in the real world with novel data.

Kaggle does something well in creating real-life data conditions.
For example, the Bulldozer data represents dates for current timespan in training data, and the test set is the data against you’re model is being scored. You score on Kaggle’s public leaderboard is based on this public test set. But Kaggle often has yet another private dataset to assess your model. Many kaggle competitors often overfit their models on the test set and are at the top of the leaderboard until the private data assessment. And some competitors jump up in the rankings because their models performed better on the new data.

You’ll want to create validation set that creates the conditions of Kaggle’s test set from the data given to you as training data. You save this validation data for your own assessments of the model.

Q:What is a validation set?
You’re holding out data from the training data (the data where you know the answers to the problem statement) and never look at it until after you build your model and it’s ready to be evaluated. Never introduce this data for any training the model. For the model, it will be novel information it’s never seen before, so you can look at it predictions, but you also know the actual results. So you can compare each prediction vs the actual data point to assess model accuracy. (Going back to Yuka and Jeremy’s bulldozer business, Freddie predicted 10 bulldozers sales on day 3, when actual sales were 5).

You can use sklearn tools to portion out subsamples of the training sets for validation or dev sets.

how to develop models quickly

You want to be fast and to be able to have a model ingest data and quickly give you predictions so you can eyeball the accuracy of the model and start to tune the model or fix what’s broken

  • Really good hardware: you want buy to as fast as you can afford cpus and/or gpus. Fast hardware such as NVMe drives, or SSD and powerful GPUs, let you train and evaluate models against large or complex datasets quickly. The sooner you know something is wrong, the sooner you can fix it. It’s hard to iterate over models and the data ingestion pipeline if you have to wait hours before you get a result. Computer science people often optimize to reduce “expensive” hardware calculations, Data people should optimize for speed of development. Modern computing hardware makes what is often considered “expensive” processes negligible for the data scientist.
  • Create another “dev_set” of the training data where execution of a programming process is completed in around 10 seconds so you know that the part of the pipeline you’re building works. You need to iterate and tweak the pipeline and model quickly. If you want to test the whole data pipeline from data ingestion to validation, use a larger subset of training data, but you don’t have to use all training data just to know that your end-to-end pipeline works.
  • Once you’re done iterating, then you can train the model on the all the training data overnight. If you have access to more than one gpu, you can train one model on the gpu while you iterate on another model (maybe one with a different architecture or the same model with different hyperparameters)
  • In the end you’ll have, four sets of data: dev_set, validation_set, train, test (aka evaluation set). Dev and validation sets are portioned out from the train dataset. Once you’re done iterating, you can recombine the dev_set with the train dataset for overall model training. The validation set never gets used for model training only for model evaluation.

This week try to experiment, some ideas:

  • Explore different datasets ,
  • try to write your own functions,
  • try to use different libraries,
  • use some different plot styles or plot libraries
8 Likes

In class, you mentioned that we could use a subset of data for training in order to speed things up. The code looks like this:

df_trn, y_trn = proc_df(df_raw, 'SalePrice', subset=30000)
X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)

In this way, since proc_df() randomly samples 30000 rows from the original dataset, will the training set overlap with our validation set ?

Yes, that’s why we call split_vals afterwards. Since we pull out the first 2/3 of the data, it shouldn’t overlap with the validation set.

Hmmm… although on reflection I suspect that proc_df is also randomly shuffling the data!.. Let me look into this - may need to fix a little problem.

OK this is all fixed now (I think!) Do a git pull and check out the ‘Speeding Things Up’ section of lesson 1 to see it in action.

1 Like

Thanks :grinning:
Now, proc_df() extracts first 30000 rows instead of randomly shuffling. I think it should look good !!

I have trouble drawing the tree with draw_tree(m.estimators[0], df_trn, precision=3).

After having run the entire lesson1 notebook, I get the following error when drawing the tree:

ExecutableNotFound: failed to execute ['dot', '-Tsvg'], make sure the Graphviz executables are on your systems' PATH

Am I missing something?

Sounds like you haven’t installed graphviz.

https://www.graphviz.org/

2 Likes

@fastai1 - I’m running fastai on a local Windows installation and had the same error as you. If you’re also on Windows, try this:

pip uninstall graphviz
conda install python-graphviz

This solution was recommended on the graphviz github issues page.

9 Likes

I just completed Lesson 2 and I am curious about the subsampling section of the discussion. Given that bootstrapping already subsamples the data why do we need a separate subsampling? Especially given that Jeremy clarifies that bootstrapping doesn’t work if we use the set_rf_samples function.

My understanding is that subsampling is that it is just another way of getting our model to run faster and test. Is that correct?

Can someone show calculation how Freddie Variance becomes 9 ? F(i) = [3, 1, 10] --> 9

Yes, the primary motivation for set_rf_samples() is to reduce training time so that you can iterate quickly when tuning your model. You be surprised at how few samples you need to get decent accuracy!

@mrgold Freddie´s variance calculation is: highest value(10) - lowest value(1). So, 10-1=9. All Freddie´s values are between 1 and 10. Therefore, if you had to make a sale prediction for any given day, any number between 1 and 10 would be a good guess.

1 Like

Seems to me that using oob_score with this dataset is not a good metric to use for hyper-parameter tuning. oob_score is based off a random sample but this data is time-based. The score against the validation set should be used for the tuning, particularly if automating via grid-search. Am I right?

1 Like

(Making a good kaggle test set is kind of a different beast and I will ignore that here to focus on the general case.) Dealing with timeseries is always tricky, but the out of bag score should be fine, depending on how you create the training set to train your model. Make sure to sort by date then maybe grab 20% for validation and train your model so that most recent dates get more weight than observations from earlier times. That’s an argument to the fit() method. If the most recent data is similar to the data beyond your 80% cut off, the out of bag score should be reasonable.

it’s useful to use the out of bag because it’s much faster than doing cross validation testing; and comes for free with the fit.

All that said, you are right. Extrapolation with random forests is not good. They’re going to predict that the future looks exactly like the most recent data in the training set. If the validation set is much different, you would in fact see out of bag score not matching validation score.

One can consider adding a feature to a random forest model that gives it a time sensitive hook or you can go to a generalized linear model etc…

1 Like

Not sure that’s entirely true. OOB is much less ideal than a time-based validation set. In the course for the bulldozers dataset we always print the score on the validation set, for this reason. OOB is really just useful for when you have a real shortage of data, or when you are explicitly looking to figure out if your model accuracy issues are due to extrapolation problems.

5 Likes

Ok, thanks for the correction. Makes sense. OOB will underestimate the error found with a true “future time” validation set. I like that comparison idea: compare OOB to validation error to highlight extrapolation weaknesses. got it. thanks.

Isn’t OOB also useful though when you don’t have a time-based set (even with lots of data)?

For sure

When I download the bulldozer dataset from Kaggle, it isn’t date sorted. When we create the validation set, the data isn’t differentiated by dates. So, does that make the OOB score and the validation set any good?

2 Likes

Yup you’ll need to sort it.

Can someone please list the kaggle competitions of similar dataset on which we can practice on and submit to the leaderboard as this bulldozer challenge does not have submit predictions option.

2 Likes