Another treat! Early access to Intro To Machine Learning videos

(Aditya) #312

How to deal with a dataset in which 95% of the column have 0 and 1 only…

It’s so because the whole dataset has already been encoded.

(Eric Perbos-Brinck) #313

@jeremy as Part 1 v2 section is now opened to public, and this thread with the 12 videos is visible too:

  • Can we link/share some of your demonstrations (like “Bag of Little Bootstraps” BLB, or “Out-of-Bag”) on small volume sites like KaggleNoobs ?

  • Will there be a wiki + video timelines for ML1, as per DL1/DL2 ?
    Also in some videos, Jeremy mentioned that undocumented changes in Fastai library are usually discussed/explained in ML1 forums directly (ex: adding the ‘nas’ variable to proc_df())

I’m on Lesson 4 of ML1 and it’s a fantastic “behind-the-scene” ressource for DL1 v2, as you walk through code details of your fastai library.
Plus I love the teaching atmosphere with your graduate students, fond memories :+1:

(Tyler Morgan) #314

I’ve made it to video 4 of the ML courses, and I want to make sure I understand the difference between using set_rf_samples and using the parameter bootstrap=True. It appears that both are performing bagging (i.e. they are selecting a random subsample with replacement). Is the purpose of set_rf_samples just to let you choose the subsample size vs. bootstrap=True which appears to automatically select the subsample size for you?

(Jeremy Howard) #315

Yes to all

(Alan O'Donnell) #316

@tyler I think bootstrap=True by itself always uses a sample size of len(x_trn). (Which because of sampling with replacement etc. works out to seeing about 1 - 1/e = 63% of the full dataset.) It seems like sklearn’s random forests don’t give you an easy way to change the sample size though, hence the need for set_rf_samples, which just goes ahead and mutates sklearn.ensemble.forest :slight_smile:

(Tyler Morgan) #317

That makes more sense. In my head, the replacement happened between each tree but it actually replaces between each sample? So a single tree might train a few times on some points and none on others?

(Alan O'Donnell) #318

Yeah, each tree samples its own training set with replacement, so tree A looks at one 63%-ish subset, tree B looks at some other 63%ish subset, and each tree can see a single training example multiple times. I’m new to this stuff and I still find that to be a really interesting idea that I don’t totally get (When does it make sense to use bootstrapping?).

But for random forests, I think the idea is that it’s good if each individual tree trains on a somewhat different training set from all the other trees, since that way the forest as a whole will be less biased. (Feel like I’m handwaving though.)


The data is sampled once for each tree - you sample with replacement before the tree starts looking at the data and the rows of data it will ever have access to stay constant as the tree is being constructed.

A tree will look at the data once for adding each level, but some of the examples it will look might be duplicates (say you have rows A B C D, there might be a tree that will be looking at A A B C, another one at A B C B, etc)

(Eric Perbos-Brinck) #321

I did the Video Timelines for the first 6 lessons of “Intro to Machine Learning”, and will update this post as I move forward.

Note: I changed the format a bit, adding more details/keywords for search purpose, as Jeremy dives deeper into explaining his “behind-the-scene” code of the Fastai library in ML1 vs DL1.

WIP: let me know if you find bugs or suggest changes :sunglasses:

Lesson 1 video timeline

  • 00:02:14 AWS or Crestle Deep Learning

  • 00:05:14 lesson1-rf notebook Random Forests

  • 00:10:14 ?display documentation, ??display source code

  • 00:12:14 Blue Book for Bulldozers Kaggle competition: predict auction sale price,
    Download Kaggle data to AWS using a nice trick with FireFox javascript console, getting a full cURL link,
    Using Jupyter “New Terminal”

  • 00:23:55 using !ls {PATH} in Jupyter Notebook

  • 00:26:14 Structured Data vs data like Computer Vision, NLP, Audio,
    ’$ vim’ in /fastai$, ‘low_memory=False’, ‘parse_dates’,
    Python 3.6 format string f’{PATH}Train.csv’,

  • 00:33:14 Why Jeremy’s doesn’t do a lot of EDA,
    Bulldozer RMSLE difference between the log of prices

  • 00:36:14 Intro to Random Forests, in general doesn’t overfit, no need to setup a validation set.
    The Silly Concepts of Cursive Dimensionality and No Free Lunch theorem,
    Brief history of ML and lots of theory vs practice in the 90’s.

  • 00:43:14 RandomForestRegressor, RandomForestClassifier
    Stack Trace: how to fix an error

  • 00:48:14 Continuous and categorical variables, add_datepart()

  • 00:57:14 Dealing with strings in data (“low, medium, high” etc.), which must be converted into numeric coding, with train_cats() creating a mapping of integers to the strings.
    Warning: make sure to use the same mapping string-numbers in Training and Test sets,
    Use “apply_cats” for that,
    Change order of index of .cat.categories with .cat.set_categories.

  • 01:07:14 Pre-processing to replace categories with their numeric codes,
    Handle missing continuous values,
    And split the dependant variable into a separate variable.
    proc_df() and fix_missing()

  • 01:14:01 ‘split_vals()’

Lesson 2

  • 00:03:30 simlink sim link to fastai directory

  • 00:06:15 understand the RMSLE relation to RMSE, and why use np.log(‘SalePrice’) with RMSE as a result

  • 00:09:01 proc_df, numericalize

  • 00:11:01 rsquare root square of mean errors RMSE,
    What the formula rsquare (and others in general) does and understand it

  • 00:17:30 Creating a good validation set, ‘split_vals()’ explained
    "I don’t trust ML, we tried it, it looked great, we put it in production, it didn’t work" because the validation set was not representative !

  • 00:21:01 overfitting over-fitting underfitting ‘don’t look at test set !’,
    Example of failed methodology in sociology, psychology,
    Using PEP8 (or not) for ML prototyping models

  • 00:29:01 RMSE function and RandomForestRegressor,
    Speeding things up with a smaller dataset (subset = ),
    Use of ‘_’ underscore in Python

  • 00:32:01 Single Tree model and visualize it,

  • 00:47:01 Bagging of little Boostraps, ensembling

  • 00:57:01 scikit-learn ExtraTreeRegressor randomly tries variables

  • 01:04:01 m.estimators_,
    Using list comprehension

  • 01:10:00 Out-of-bag (OOB) score

  • 01:13:45 Automate hyperparameters hyper-parameters with grid-search gridsearch
    Randomly subsample the dataset to reduce overfitting with ‘set_rf_samples()’, code detail at 1h18m25s

  • 01:17:20 Tip for Favorita Grocery competition,

  • 01:30:20 Looking at ‘fiProductClassDesc’ column with .cat.categories and

Lesson 3

  • 00:02:44 When to use or not Random Forests (unstructured data like CV or Sound works better with DL),
    Collaborative filtering for Favorita

  • 00:05:10 dealing with missing values present in Test but not Train (or vice-versa) in ‘proc_df()’ with “nas” dictionary whose keys are names of columns with missing values, and the values are the medians.

  • 00:09:30 Starting Favorita notebook,
    The ability to explain the goal of a Kaggle competition or a project,
    What are independent and dependant variables ?
    Star schema warehouse database, snowflake schema

  • 00:15:30 Use dtypes to read data without ‘low_memory = False’

  • 00:20:30 Use ‘shuf’ to read a sample of large dataset at start

  • 00:26:30 Take the Log of the sales with ‘np.log1p()’,
    Apply ‘add_datepart)’,

  • 00:28:30 Models,
    ‘np.array(trn, dtype=np.float32’,
    Use ‘%prun’ to find lines of code that takes a long time to run

  • 00:33:30 We only get reasonable results, but nothing great on the leaderboard: WHY ?

  • 00:43:30 Quick look at Rossmann grocery competition winners,
    Looking at the choice of validation set with Favorita Leaderboard by Terence Parr (his @ pseudo here ?)

  • 00:50:30 Lesson2-rf interpretation,
    Why is ‘nas’ an input AND an output variable in ‘proc_df()’

  • 00:55:30 How confident are we in our predictions (based on tree variance) ?
    Using ‘set_rf_samples()’ again.
    ‘parallel_trees()’ for multithreads parallel processing,
    EROPS, OROPS, Enclosure

  • 01:07:15 Feature importance with ‘rf_feat_importance()’

  • 01:12:15 Data leakage example,

Lesson 4

  • 00:00:04 How to deal with version control and notebooks ? Make a copy and rename it with “tmp-blablabla” so it’s hidden from Git Pull

  • 00:01:50 Summarize the relationship between hyperparameters in Random Forests, overfitting and colinearity.
    ‘set_rf_samples()’, ‘oob_score = True’,
    ‘min_samples_leaf=’ 8m45s,
    ‘max_features=’ 12m15s

  • 00:18:50 Random Forest Interpretation lesson2-rf_interpretation,

  • 00:26:50 ‘to_keep = fi[fi.imp>0.005]’ to remove less important features,
    high cardinality variables 29m45s,

  • 00:32:15 Two reasons why Validation Score is not good or getting worse: overfitting, and validation set is not a random sample (something peculiar in it, not in Train),
    The meaning of the five numbers results in ‘print_score(m)’, RMSE of Training & Validation, R² of Train & Valid & OOB.
    We care about the RMSE of Validation set.

  • 00:35:50 How Feature Importance is normally done in Industry and Academics outside ML: they use Logistic Regression Coefficients, not Random Forests Feature/Variable Importance.

  • 00:39:50 Doing One-hot encoding for categorical variables,
    Why and how works ‘max_n_cat=7’ based on Cardinality 49m15s, ‘numericalize’

  • 00:55:05 Removing redundant features using a dendogram and '.spearmanr()'for rank correlation, ‘get_oob(df)’, ‘to_drop = []’ variables, ‘reset_rf_samples()’

  • 01:07:15 Partial dependence: how important features relate to the dependent variable, ‘ggplot() + stat_smooth()’, ‘plot_pdp()’

  • 01:21:50 What is the purpose of interpretation, what to do with that information ?

  • 01:30:15 What is EROPS / OROPS ?

  • 01:32:25 Tree interpreter

Lesson 5

  • 00:00:04 Review of Training, Test set and OOB score, intro to Cross-Validation (CV),
    In Machine Learning, we care about Generalization Accuracy/Error.

  • 00:11:35 Kaggle Public and Private test sets for Leaderboard,
    the risk of using a totally random validation set, rerun the model including Validation set.

  • 00:22:15 Is my Validation set truly representative of my Test set. Build 5 very different models and score them on Validation and on Test. Examples with Favorita Grocery.

  • 00:28:10 Why building a representative Test set is crucial in the Real World machine learning (not in Kaggle),
    Sklearn make train/test split or cross-validation = bad in real life (for Time Series) !

  • 00:31:04 What is Cross-Validation and why you shouldn’t use it most of the time (hint: random is bad)

  • 00:38:04 Tree interpretation revisited, lesson2-rf_interpreter.ipynb, waterfall plot for increase and decrease in tree splits,
    ‘ti.predict(m, row)’

  • 00:48:50 Dealing with Extrapolation in Random Forests,
    RF can’t extrapolate like Linear Model, avoid Time variables as predictors if possible ?
    Trick: find the differences between Train and Valid sets, ie. any temporal predictor ? Build a RF to identify components present in Valid only and not in Train ‘x,y = proc_df(df_ext, ‘is_valid’)’,
    Use it in Kaggle by putting Train and Test sets together and add a column ‘is_test’, to check if Test is a random sample or not.

  • 00:59:15 Our final model of Random Forests, almost as good as Kaggle #1 (Leustagos & Giba)

  • 01:03:04 What to expect for the in-class exam

  • 01:05:04 Lesson3-rf_foundations.ipynb, writing our own Random Forests code.
    Basic data structures code, class ‘TreeEnsemble()’, np.random.seed(42)’ as pseudo random number generator
    How to make a prediction in Random Forests (theory) ?

  • 01:21:04 class ‘DecisionTree()’,
    Bonus: Object-Oriented-Programming (OOP) overview, critical for PyTorch

Lesson 6

Note: this lesson has a VERY practical discussion with USF students about the use of Machine Learning in business/corporation, Jeremy shares his experience as a business consultant (McKinsey) and entrepreneur in AI/ML. Deffo not PhD’s stuff, too real-life.

  • 00:00:04 Review of previous lessons: Random Forests interpretation techniques,
    Confidence based on tree variance,
    Feature importance,
    Removing redundant features,
    Partial dependence…
    And why do we do Machine Learning, what’s the point ?
    Looking at PowerPoint ‘intro.ppx’ in Fastai GitHub: ML applications (horizontal & vertical) in real-life.
    Churn (which customer is going to leave) in Telecom: google “jeremy howard data products”,
    drive-train approach with ‘Defined Objective’ -> ‘Company Levers’ -> ‘Company Data’ -> ‘Models’

  • 00:10:01 "In practice, you’ll care more about the results of your simulation than your predictive model directly ",
    Example with Amazon 'not-that-smart’recommendations vs optimization model.
    More on Churn and Machine Learning Applications in Business

  • 00:20:30 Why is it hard/key to define the problem to solve,
    ICYMI: read “Designing great data products” from Jeremy in March 28, 2012 ^!^
    Healthcare applications like ‘Readmission risk’. Retail applications examples.
    There’s a lot more than what you read about Facebook or Google applications in Tech media.
    Machine Learning in Social Sciences today: not much.

  • 00:37:15 More on Random Forests interpretation techniques.
    Confidence based on tree variance

  • 00:42:30 Feature importance, and Removing redundant features

  • 00:50:45 Partial dependence (or dependance)

  • 01:02:45 Tree interpreter (and a great example of effective technical communications by a student)
    Using Excel waterfall chart from Chris
    Using ‘’, a command-line wrapper for git that makes you better at GitHub.

  • 01:16:15 Extrapolation, with a 20 mins session of live coding by Jeremy

Unofficial release of part 1 v2
(Jeremy Howard) #323

Thanks @EricPB!

(Eric Perbos-Brinck) #324

ML1 Video Timelines updated with Lesson 4.

(Aditya) #325

Can CNN’S be used to detect Photoshoped images?

I wonder how’s that done since we don’t know at all whether they are Photoshoped as they look all alike…

(Andrei S) #326

Thanks for the awesome videos!
Can you suggest any tips how to handle unbalanced data with RF? I’m doing binary classification and dataset has 90% samples of one class. As a result model predicts this one class almost always. Accuracy, precision are good, but recall is quite bad. What can be done in this case? Thanks!


Duplicating data so that you have 50% of each class might be worth giving a shot

(Andrei S) #328

In the Lesson 8 we use fit() and predict() methods to train model and looks like this method belongs to fastai library. What is the original Pytorch way to fit/train and predict models?

(Aditya) #329

No it’s isn’t…
It’s the Decision Trees method which we are using?
(Predict())which in turn calls the original predefined one …

(Andrei S) #330

@ecdrid Thanks for your answer. I meant mnist and neural networks (Lesson 8) speaking of fit and predict

(Eric Perbos-Brinck) #331

ML1 Video Timelines updated with Lesson 5.

(Even Oldridge) #332

Definitely, although in the most recent paper I found I think they’re using RNNs:

(Delgermurun Purevkhuu) #333

Corporación Favorita Grocery Sales Forecasting competition is just finished. Waiting for Jeremy’s source code :slight_smile: