Another treat! Early access to Intro To Machine Learning videos

(Kaitlin Duck Sherwood) #500

In Lecture 3, @jeremy showed how you could figure out what the relative importance of features is by messing with one feature at a time, shuffling that column to neutralize its effect on the model.

Q: Why shuffle the values instead of replacing all that columns’ values with the mean of the column? (Or would that work just as well?)

(fabrice) #501

In Lesson 6 @jeremy mentioned: ‘In practice, you’ll care more about the results of your simulation than your predictive model directly’

From his experience on ML adoption in real-world problems, he suggests a combination of predictive, simulation & optimization models to better leverage raw data and produce smarter actionable outcomes.
BTW, great details are in his article ‘Designing great data products’.

Q: how can I run simulation with optimization using for example the feature importance of my RF model?
Do I have to use a software like Matlab or can I do that using python (maybe scipy?)
If anyone happen to have any practical pointers, that would be greatly appreciated. Many Thanks.


Hi fabrice,

I’ll take a stab at answering your question. You can run a simulation with whatever software you feel most comfortable - Python, Excel, R, SQL Server, etc. Typically simulations and optimizations are domain-specific and so it’s up to you to figure our what will be held constant in your simulations and what will be optimized over. For instance, you may be looking to maximize revenue for your company while holding profit constant or maximize profit while holding revenue constant. It’s up to you and the goals of your company. There is no explicit connection with feature importance, but you can use feature importance to ask and answer certain questions in your simulation. Questions such as “If I didn’t include this important but controversial variable in my model, what would the effect be on sales over the next five years?”.

I hope that helps, but the onus is really on you as the practitioner to define your simulation and optimization. Certain tools may be better equipped than others, but likely many tools could accomplish the job.


Kaitlin, it’s not a terrible idea but that would alter the distribution of the variable. The distribution would be a point mass with no variance. Your evaluation metric might be sensitive to the distribution of the predictions and greatly altering the distribution of the predictor variables could significantly alter that distribution. What if it just so happened that the mean value of that feature produces the highest predictions?

This technique is interesting but I wonder about if its biased towards high cardinality variables. A low cardinality but important categorical variable wouldn’t actually be that “corrupted” by the shuffling process because a lot of the observations would retain the same value as the original data and thus the difference in evaluation metric seems like it would be muted.

(fabrice) #504

Thanks Patrick for your prompt reply and your suggestions. Would you have by any chance some references about introduction/general guidelines for simulation/optmization? Looks like Solver is quite often used so I will have a look too. Thanks again

(Medhat Omr) #506

I tried to add the link to Lecture 12 video directly into the post but I couldn’t so I quoted it again here
Lecture 12

Thanks a lot @timlee!

(Aditya) #507

Different Test Set and Training Set Distribution

The distribution of test set is different from the training set. I want to sub-sample observations from training set which closely resembles test set.

One of the critical assumption one would make to build a machine learning model for future prediction is that unseen data (test) comes from the same distribution as training data! However, in reality this rather simple assumption breaks easily and upcoming data (its distribution) changes over time for many reasons

How can I do this?

(Muralidharan Surendran) #508

@jeremy how do we view the ML Videos on I think they are only available via youtube.

(Abhishek Sharma) #509

Yes, currently they are only available via links on this thread. This is early access.

(Muralidharan Surendran) #510

@alephthoughts thank you.

(Callum) #511

A good candidate for a note when this becomes available on would be the Kaggle data download process. With the official Kaggle CLI the instructions are:

First time:

  1. Accept the competition rules
  2. Go to<username>/account, select ‘Create API token’, scp it to whichever VM in ~/.kaggle
  3. pip install kaggle on your VM

Then to download a competition, cd to data directory on VM and:
kaggle competitions download -c bluebook-for-bulldozers -p .

(Reshama Shaikh) #512

that’s available in kaggle download cli

(Callum) #513

This is the official Kaggle one though: I prefer it because you don’t have to pass a username/password, they use proper revokable API tokens. :slight_smile:

(Surya Mohan) #514

Thanks for these amazing videos, @jeremy. I started with the deep learning videos and then moved to these after completing them. For anyone who is just starting out in the area of machine learning, I would recommend that you finish these Machine Learning videos first and then move to the Deep Learning Videos. Or try to complete them concurrently. There is just so much amazing and transferable information in these two courses and doing them together just brings the total content come together in a much more clear way.

(Eric Perbos-Brinck) #515

ML1 Video Timelines updated with Lessons 7 & 8.

Note: iirc, the remaining lessons focus on Neural Networks so you may have the choice between continuing or switching to the Deep Learning series of Jeremy & Rachel.

(imbolc) #516

Such a great idea to measure confidence with std of trees predictions! I’m just wondering is there a way to normalize this number to be linear from 0 to 1?

(Eric Perbos-Brinck) #517

ML1 Video Timelines updated with Lesson 9.


Has anyone else hit an error following along with ml1/lesson5-nlp.ipynb ?
When it ran texts_labels_from_folders, it errored on a text file (11325_4.txt) with

UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x8d in position 194: character maps to

I changed my texts_labels_from_folders() to open it as encoding utf-8 which fixed it, but I’m not sure why my environment needed that or if this is more widespread? (I haven’t looked into if python / fastai have defaults in play here)
(local windows environment, py3.6)

(Aditya) #519

Actually you have to set the encoding in Windows before opening a file…

(nok) #520


Sorry for this stupid question, how I can make prediction for test set with this class? Couldn’t find a way to make prediction with this bag of word model. Struggle for quite a while, I tried to read the source code but couldn’t understand the library well.