Another treat! Early access to Intro To Machine Learning videos

(Jeremy Howard) #315

Yes to all

(Alan O'Donnell) #316

@tyler I think bootstrap=True by itself always uses a sample size of len(x_trn). (Which because of sampling with replacement etc. works out to seeing about 1 - 1/e = 63% of the full dataset.) It seems like sklearn’s random forests don’t give you an easy way to change the sample size though, hence the need for set_rf_samples, which just goes ahead and mutates sklearn.ensemble.forest :slight_smile:

(Tyler Morgan) #317

That makes more sense. In my head, the replacement happened between each tree but it actually replaces between each sample? So a single tree might train a few times on some points and none on others?

(Alan O'Donnell) #318

Yeah, each tree samples its own training set with replacement, so tree A looks at one 63%-ish subset, tree B looks at some other 63%ish subset, and each tree can see a single training example multiple times. I’m new to this stuff and I still find that to be a really interesting idea that I don’t totally get (When does it make sense to use bootstrapping?).

But for random forests, I think the idea is that it’s good if each individual tree trains on a somewhat different training set from all the other trees, since that way the forest as a whole will be less biased. (Feel like I’m handwaving though.)


The data is sampled once for each tree - you sample with replacement before the tree starts looking at the data and the rows of data it will ever have access to stay constant as the tree is being constructed.

A tree will look at the data once for adding each level, but some of the examples it will look might be duplicates (say you have rows A B C D, there might be a tree that will be looking at A A B C, another one at A B C B, etc)

(Eric Perbos-Brinck) #321

I did the Video Timelines for the first 8 lessons of “Intro to Machine Learning”, and will update this post as I move forward.

Note: I changed the format a bit, adding more details/keywords for search purpose, as Jeremy dives deeper into explaining his “behind-the-scene” code of the Fastai library in ML1 vs DL1.

WIP: let me know if you find bugs or suggest changes :sunglasses:

Lesson 1 video timeline

  • 00:02:14 AWS or Crestle Deep Learning

  • 00:05:14 lesson1-rf notebook Random Forests

  • 00:10:14 ?display documentation, ??display source code

  • 00:12:14 Blue Book for Bulldozers Kaggle competition: predict auction sale price,
    Download Kaggle data to AWS using a nice trick with FireFox javascript console, getting a full cURL link,
    Using Jupyter “New Terminal”

  • 00:23:55 using !ls {PATH} in Jupyter Notebook

  • 00:26:14 Structured Data vs data like Computer Vision, NLP, Audio,
    vim' in /fastai, ‘low_memory=False’, ‘parse_dates’,
    Python 3.6 format string f’{PATH}Train.csv’,

  • 00:33:14 Why Jeremy’s doesn’t do a lot of EDA,
    Bulldozer RMSLE difference between the log of prices

  • 00:36:14 Intro to Random Forests, in general doesn’t overfit, no need to setup a validation set.
    The Silly Concepts of Cursive Dimensionality and No Free Lunch theorem,
    Brief history of ML and lots of theory vs practice in the 90’s.

  • 00:43:14 RandomForestRegressor, RandomForestClassifier
    Stack Trace: how to fix an error

  • 00:48:14 Continuous and categorical variables, add_datepart()

  • 00:57:14 Dealing with strings in data (“low, medium, high” etc.), which must be converted into numeric coding, with train_cats() creating a mapping of integers to the strings.
    Warning: make sure to use the same mapping string-numbers in Training and Test sets,
    Use “apply_cats” for that,
    Change order of index of .cat.categories with .cat.set_categories.

  • 01:07:14 Pre-processing to replace categories with their numeric codes,
    Handle missing continuous values,
    And split the dependant variable into a separate variable.
    proc_df() and fix_missing()

  • 01:14:01 ‘split_vals()’

Lesson 2

  • 00:03:30 simlink sim link to fastai directory

  • 00:06:15 understand the RMSLE relation to RMSE, and why use np.log(‘SalePrice’) with RMSE as a result

  • 00:09:01 proc_df, numericalize

  • 00:11:01 rsquare root square of mean errors RMSE,
    What the formula rsquare (and others in general) does and understand it

  • 00:17:30 Creating a good validation set, ‘split_vals()’ explained
    “I don’t trust ML, we tried it, it looked great, we put it in production, it didn’t work” because the validation set was not representative !

  • 00:21:01 overfitting over-fitting underfitting ‘don’t look at test set !’,
    Example of failed methodology in sociology, psychology,
    Using PEP8 (or not) for ML prototyping models

  • 00:29:01 RMSE function and RandomForestRegressor,
    Speeding things up with a smaller dataset (subset = ),
    Use of ‘_’ underscore in Python

  • 00:32:01 Single Tree model and visualize it,

  • 00:47:01 Bagging of little Boostraps, ensembling

  • 00:57:01 scikit-learn ExtraTreeRegressor randomly tries variables

  • 01:04:01 m.estimators_,
    Using list comprehension

  • 01:10:00 Out-of-bag (OOB) score

  • 01:13:45 Automate hyperparameters hyper-parameters with grid-search gridsearch
    Randomly subsample the dataset to reduce overfitting with ‘set_rf_samples()’, code detail at 1h18m25s

  • 01:17:20 Tip for Favorita Grocery competition,

  • 01:30:20 Looking at ‘fiProductClassDesc’ column with .cat.categories and

Lesson 3

  • 00:02:44 When to use or not Random Forests (unstructured data like CV or Sound works better with DL),
    Collaborative filtering for Favorita

  • 00:05:10 dealing with missing values present in Test but not Train (or vice-versa) in ‘proc_df()’ with “nas” dictionary whose keys are names of columns with missing values, and the values are the medians.

  • 00:09:30 Starting Favorita notebook,
    The ability to explain the goal of a Kaggle competition or a project,
    What are independent and dependant variables ?
    Star schema warehouse database, snowflake schema

  • 00:15:30 Use dtypes to read data without ‘low_memory = False’

  • 00:20:30 Use ‘shuf’ to read a sample of large dataset at start

  • 00:26:30 Take the Log of the sales with ‘np.log1p()’,
    Apply ‘add_datepart)’,

  • 00:28:30 Models,
    ‘np.array(trn, dtype=np.float32’,
    Use ‘%prun’ to find lines of code that takes a long time to run

  • 00:33:30 We only get reasonable results, but nothing great on the leaderboard: WHY ?

  • 00:43:30 Quick look at Rossmann grocery competition winners,
    Looking at the choice of validation set with Favorita Leaderboard by Terence Parr (his @ pseudo here ?)

  • 00:50:30 Lesson2-rf interpretation,
    Why is ‘nas’ an input AND an output variable in ‘proc_df()’

  • 00:55:30 How confident are we in our predictions (based on tree variance) ?
    Using ‘set_rf_samples()’ again.
    ‘parallel_trees()’ for multithreads parallel processing,
    EROPS, OROPS, Enclosure

  • 01:07:15 Feature importance with ‘rf_feat_importance()’

  • 01:12:15 Data leakage example,

Lesson 4

  • 00:00:04 How to deal with version control and notebooks ? Make a copy and rename it with “tmp-blablabla” so it’s hidden from Git Pull

  • 00:01:50 Summarize the relationship between hyperparameters in Random Forests, overfitting and colinearity.
    ‘set_rf_samples()’, ‘oob_score = True’,
    ‘min_samples_leaf=’ 8m45s,
    ‘max_features=’ 12m15s

  • 00:18:50 Random Forest Interpretation lesson2-rf_interpretation,

  • 00:26:50 ‘to_keep = fi[fi.imp>0.005]’ to remove less important features,
    high cardinality variables 29m45s,

  • 00:32:15 Two reasons why Validation Score is not good or getting worse: overfitting, and validation set is not a random sample (something peculiar in it, not in Train),
    The meaning of the five numbers results in ‘print_score(m)’, RMSE of Training & Validation, R² of Train & Valid & OOB.
    We care about the RMSE of Validation set.

  • 00:35:50 How Feature Importance is normally done in Industry and Academics outside ML: they use Logistic Regression Coefficients, not Random Forests Feature/Variable Importance.

  • 00:39:50 Doing One-hot encoding for categorical variables,
    Why and how works ‘max_n_cat=7’ based on Cardinality 49m15s, ‘numericalize’

  • 00:55:05 Removing redundant features using a dendogram and '.spearmanr()'for rank correlation, ‘get_oob(df)’, ‘to_drop = []’ variables, ‘reset_rf_samples()’

  • 01:07:15 Partial dependence: how important features relate to the dependent variable, ‘ggplot() + stat_smooth()’, ‘plot_pdp()’

  • 01:21:50 What is the purpose of interpretation, what to do with that information ?

  • 01:30:15 What is EROPS / OROPS ?

  • 01:32:25 Tree interpreter

Lesson 5

  • 00:00:04 Review of Training, Test set and OOB score, intro to Cross-Validation (CV),
    In Machine Learning, we care about Generalization Accuracy/Error.

  • 00:11:35 Kaggle Public and Private test sets for Leaderboard,
    the risk of using a totally random validation set, rerun the model including Validation set.

  • 00:22:15 Is my Validation set truly representative of my Test set. Build 5 very different models and score them on Validation and on Test. Examples with Favorita Grocery.

  • 00:28:10 Why building a representative Test set is crucial in the Real World machine learning (not in Kaggle),
    Sklearn make train/test split or cross-validation = bad in real life (for Time Series) !

  • 00:31:04 What is Cross-Validation and why you shouldn’t use it most of the time (hint: random is bad)

  • 00:38:04 Tree interpretation revisited, lesson2-rf_interpreter.ipynb, waterfall plot for increase and decrease in tree splits,
    ‘ti.predict(m, row)’

  • 00:48:50 Dealing with Extrapolation in Random Forests,
    RF can’t extrapolate like Linear Model, avoid Time variables as predictors if possible ?
    Trick: find the differences between Train and Valid sets, ie. any temporal predictor ? Build a RF to identify components present in Valid only and not in Train ‘x,y = proc_df(df_ext, ‘is_valid’)’,
    Use it in Kaggle by putting Train and Test sets together and add a column ‘is_test’, to check if Test is a random sample or not.

  • 00:59:15 Our final model of Random Forests, almost as good as Kaggle #1 (Leustagos & Giba)

  • 01:03:04 What to expect for the in-class exam

  • 01:05:04 Lesson3-rf_foundations.ipynb, writing our own Random Forests code.
    Basic data structures code, class ‘TreeEnsemble()’, np.random.seed(42)’ as pseudo random number generator
    How to make a prediction in Random Forests (theory) ?

  • 01:21:04 class ‘DecisionTree()’,
    Bonus: Object-Oriented-Programming (OOP) overview, critical for PyTorch

Lesson 6

Note: this lesson has a VERY practical discussion with USF students about the use of Machine Learning in business/corporation, Jeremy shares his experience as a business consultant (McKinsey) and entrepreneur in AI/ML. Deffo not PhD’s stuff, too real-life.

  • 00:00:04 Review of previous lessons: Random Forests interpretation techniques,
    Confidence based on tree variance,
    Feature importance,
    Removing redundant features,
    Partial dependence…
    And why do we do Machine Learning, what’s the point ?
    Looking at PowerPoint ‘intro.ppx’ in Fastai GitHub: ML applications (horizontal & vertical) in real-life.
    Churn (which customer is going to leave) in Telecom: google “jeremy howard data products”,
    drive-train approach with ‘Defined Objective’ -> ‘Company Levers’ -> ‘Company Data’ -> ‘Models’

  • 00:10:01 "In practice, you’ll care more about the results of your simulation than your predictive model directly ",
    Example with Amazon 'not-that-smart’recommendations vs optimization model.
    More on Churn and Machine Learning Applications in Business

  • 00:20:30 Why is it hard/key to define the problem to solve,
    ICYMI: read “Designing great data products” from Jeremy in March 28, 2012 ^!^
    Healthcare applications like ‘Readmission risk’. Retail applications examples.
    There’s a lot more than what you read about Facebook or Google applications in Tech media.
    Machine Learning in Social Sciences today: not much.

  • 00:37:15 More on Random Forests interpretation techniques.
    Confidence based on tree variance

  • 00:42:30 Feature importance, and Removing redundant features

  • 00:50:45 Partial dependence (or dependance)

  • 01:02:45 Tree interpreter (and a great example of effective technical communications by a student)
    Using Excel waterfall chart from Chris
    Using ‘’, a command-line wrapper for git that makes you better at GitHub.

  • 01:16:15 Extrapolation, with a 20 mins session of live coding by Jeremy

Lesson 7

  • 00:00:01 Review of Random Forest previous lessons,
    Lots of historical/theoritical techniques in ML that we don’t use anymore (like SVM)
    Use of ML in Industry vs Academia, Decision-Trees Ensemble

  • 00:05:30 How big the Validation Set needs to be ? How much the accuracy of your model matters ?
    Demo with Excel, T-distribution and n>22 observations in every class
    Standard Deviation : np(1-p), Standard Error (stdev mean): stdev/sqrt(n)

  • 00:18:45 Back to Random Forest from scratch.
    “Basic data structures” reviewed

  • 00:32:45 Single Branch
    Find the best split given variable with ‘find_better_split’, using Excel demo again

  • 00:45:30 Speeding things up

  • 00:55:00 Full single tree

  • 01:01:30 Predictions with ‘predict(self,x)’,
    and ‘predict_row(self, xi)’

  • 01:09:05 Putting it all together,
    Cython an optimising static compiler for Python and C

  • 01:18:01 “Your mission, for next class, is to implement”:
    Confidence based on tree variance,
    Feature importance,
    Partial dependence,
    Tree interpreter.

  • 01:20:15 Reminder: How to ask for Help on Fastai forums
    Getting a screenshot, resizing it.
    For lines of code, create a “Gist”, using the extension ‘Gist-it’ for “Create/Edit Gist of Notebook” with ‘nbextensions_configurator’ on Jupyter Notebook, ‘Collapsible Headings’, ‘Chrome Clipboard’, ‘Hide Header’

  • 01:23:15 We’re done with Random Forests, now we move on to Neural Networks.
    Random Forests can’t extrapolate, it just averages data that it has already seen, Linear Regression can but only in very limited ways.
    Neural Networks give us the best of both worlds.
    Intro to SGD for MNIST, unstructured data.
    Quick comparison with Fastai/Jeremy’s Deep Learning Course.

Lesson 8

  • 00:00:45 Moving from Decision Trees Ensemble to Neural Nets with Mnist
    lesson4-mnist_sgd.ipynb notebook

  • 00:08:20 About Python ‘pickle()’ pros & cons for Pandas, vs ‘feather()’,
    Flatten a tensor

  • 00:13:45 Reminder on the jargon: a vector in math is a 1d array in CS,
    a rank 1 tensor in deep learning.
    A matrix is a 2d array or a rank 2 tensor, rows are axis 0 and columns are axis 1

  • 00:17:45 Normalizing the data: subtracting off the mean and dividing by stddev
    Important: use the mean and stddev of Training data for the Validation data as well.
    Use the ‘np.reshape()’ function

  • 00:34:25 Slicing into a tensor, ‘plots()’ from Fastai lib.

  • 00:38:20 Overview of a Neural Network
    Michael Nielsen universal approximation theorem: a visual proof that neural nets can compute any function
    Why you should blog (by Rachel Thomas)

  • 00:47:15 Intro to PyTorch & Nvidia GPUs for Deep Learning
    Website to buy a laptop with a good GPU:
    Using cloud services like or AWS (and how to gain access EC2 w/ “Request limit increase”)

  • 00:57:45 Create a Neural Net for Logistic Regression in PyTorch
    ‘net = nn.Sequential(nn.Linear(28*28, 10), nn.LogSoftmax()).cuda()’
    ‘md = ImageClassifierData.from_arrays(path, (x,y), (x_valid, y_valid))’
    Loss function such as ‘nn.NLLLoss()’ or Negative Log Likelihood Loss or Cross-Entropy (binary or categorical)
    Looking at Loss with Excel

  • 01:09:05 Let’s fit the model then make predictions on Validation set.
    ‘fit(net, md, epochs=1, crit=loss, opt=opt, metrics=metrics)’
    Note: PyTorch doesn’t use the word “loss” but the word “criterion”, thus ‘crit=loss’
    ‘preds = predict(net, md.val_dl)’
    ‘preds.shape’ -> (10000, 10)
    ‘preds.argmax(axis=1)[:5]’, argmax will return the index of the value which is the number itself.
    ‘np.mean(preds == y_valid)’ to check how accurate the model is on Validation set.

  • 01:16:05 A second pass on “Michael Nielsen universal approximation theorem”
    A Neural Network can approximate any other function to close accuracy, as long as it’s large enough.

  • 01:18:15 Defining Logistic Regression ourselves, from scratch, not using PyTorch ‘nn.Sequential()’
    Demo explanation with drawings by Jeremy.
    Look at Excel ‘entropy_example.xlsx’ for Softmax and Sigmoid

  • 01:31:05 Assignements for the week, student question on ‘Forward(self, x)’

Lesson 9

Jeremy starts with a selection of students’ posts.

Back to the course.

  • 00:09:01 Why write a post on your learning experience, for you and for newcomers.

  • 00:09:50 Using SGD on MNIST for digit recognition
    . lesson4-mnist_sgd.ipynb notebook

  • 00:11:30 Training the simplest Neural Network in PyTorch
    (long step-by-step demo, 30 mins approx)

  • 00:46:55 Intro to Broadcasting: “The MOST important programming concept in this course and in Machine Learning”
    . Performance comparison between C and Python
    . SIMD: “Single Instruction Multiple Data”
    . Multiple processors/cores and CUDA

  • 00:52:10 Broadcasting in details

  • 01:05:50 Broadcasting goes back to the days of APL (1950’s) and Jsoftware
    . More on Broadcasting

  • 01:12:30 Matrix Multiplication -and not-.
    . Writing our own training loop.

The detailed syllabus by @timlee:

Unofficial release of part 1 v2
(Jeremy Howard) #323

Thanks @EricPB!

(Eric Perbos-Brinck) #324

ML1 Video Timelines updated with Lesson 4.

(Aditya) #325

Can CNN’S be used to detect Photoshoped images?

I wonder how’s that done since we don’t know at all whether they are Photoshoped as they look all alike…

(Andrei S) #326

Thanks for the awesome videos!
Can you suggest any tips how to handle unbalanced data with RF? I’m doing binary classification and dataset has 90% samples of one class. As a result model predicts this one class almost always. Accuracy, precision are good, but recall is quite bad. What can be done in this case? Thanks!


Duplicating data so that you have 50% of each class might be worth giving a shot

(Andrei S) #328

In the Lesson 8 we use fit() and predict() methods to train model and looks like this method belongs to fastai library. What is the original Pytorch way to fit/train and predict models?

(Aditya) #329

No it’s isn’t…
It’s the Decision Trees method which we are using?
(Predict())which in turn calls the original predefined one …

(Andrei S) #330

@ecdrid Thanks for your answer. I meant mnist and neural networks (Lesson 8) speaking of fit and predict

(Eric Perbos-Brinck) #331

ML1 Video Timelines updated with Lesson 5.

(Even Oldridge) #332

Definitely, although in the most recent paper I found I think they’re using RNNs:

(Delgermurun Purevkhuu) #333

Corporación Favorita Grocery Sales Forecasting competition is just finished. Waiting for Jeremy’s source code :slight_smile:


I’m watching lesson 2. At 33:47, Jeremy uses graphviz to draw a decision tree with draw_tree(m.estimators_[0], df_trn, precision=3).

I’m using Crestle. When I run this command, I get the error shown below. Did I miss something that I needed to do to set up Crestle?

FileNotFoundError                         Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/graphviz/ in pipe(engine, format, data, quiet)
    153             stdout=subprocess.PIPE, stderr=subprocess.PIPE,
--> 154             startupinfo=STARTUPINFO)
    155     except OSError as e:

/usr/lib/python3.6/ in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors)
    708                                 errread, errwrite,
--> 709                                 restore_signals, start_new_session)
    710         except:

/usr/lib/python3.6/ in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, start_new_session)
   1343                             err_msg += ': ' + repr(err_filename)
-> 1344                     raise child_exception_type(errno_num, err_msg, err_filename)
   1345                 raise child_exception_type(err_msg)

FileNotFoundError: [Errno 2] No such file or directory: 'dot': 'dot'

During handling of the above exception, another exception occurred:

ExecutableNotFound                        Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/IPython/core/ in __call__(self, obj)
    343             method = get_real_method(obj, self.print_method)
    344             if method is not None:
--> 345                 return method()
    346             return None
    347         else:

/usr/local/lib/python3.6/dist-packages/graphviz/ in _repr_svg_(self)
    105     def _repr_svg_(self):
--> 106         return self.pipe(format='svg').decode(self._encoding)
    108     def pipe(self, format=None):

/usr/local/lib/python3.6/dist-packages/graphviz/ in pipe(self, format)
    123         data = text_type(self.source).encode(self._encoding)
--> 125         outs = backend.pipe(self._engine, format, data)
    127         return outs

/usr/local/lib/python3.6/dist-packages/graphviz/ in pipe(engine, format, data, quiet)
    155     except OSError as e:
    156         if e.errno == errno.ENOENT:
--> 157             raise ExecutableNotFound(args)
    158         else:  # pragma: no cover
    159             raise

ExecutableNotFound: failed to execute ['dot', '-Tsvg'], make sure the Graphviz executables are on your systems' PATH

<graphviz.files.Source at 0x7eff4d873400>

(Alan O'Donnell) #335

@ldlt I think your problem is that you need to install the Graphviz command-line program itself, not just the Python wrapper package. See here:, where they link to I’m not very familiar with Crestle though, but I think you can get access to a shell?


Yeah, I have access to a shell although it looks like I can’t apt-get install it. I guess I can build graphviz from source, but I’d rather not.