Another treat! Early access to Intro To Machine Learning videos


Thank you!

(vittorio) #671

Hi! I’m trying to understand embedded matrices, but I still don’t entirely grasp the concept behind it.

  • In lesson 11 embedded matrices are presented as a computational trick to store a one hot encoded matrix as a sparse matrix and use it to speed up matrix multiplication.
  • Instead, in lesson 12, it seems to me that the concept is different. For example days of the week (and all other categorical variables) is encoded in a matrix with a certain number of columns (half the size of the cardinality and not more than 50) and rows and that is reported as embedded matrix.

So, what is the right view? Thanks

(Etienne Tremblay) #672

In one of the lectures, @jeremy mentions his article Designing great data products and I found the ideas fascinating. I am trying to find other resources about this (maybe books or other moocs) that do deeper in this idea.

Using predictive models as inputs in a simulation of the business allows business people to run what-ifs scenarios in the future but also find optimized versions of the levers to pull to maximize the desired outcome.

After some Googling, the only things I can find is about Response Surface Methodology (RSM), but I would really like to read more about the subject.

Most articles on the web focus on algorithms, getting better accuracy on Kaggle competitions, but I actually think this Drive Train approach would be much more helpful in a business setting.

Thank you for the great resources, I devoured those lectures on random forest, they were really interesting. Now onto the neural network stuff!

(Stas Bekman) #673

@EricPB, here is a small correction for video timelines

You have in lesson 6:

01:16:15 Extrapolation, with a 20 mins session of live coding by Jeremy

but Jeremy doesn’t start on it until 1:24:50, at 1:16:15 he suggests he is about to starts on it and then he goes back to the tree interpreter section.

Moreover, while the video lesson explains that RF trees can’t handle extrapolation, it does not work through section 7. Extrapolation in lesson2-rf_interpretation.ipynb, and there are no notes whatsoever in that notebook section. If I understand that section correctly, it appears to identify and remove time-dependent features.

Thank you for your great work!

(Vineeth Kanaparthi) #674

In one of the lessons jeremy talks about using a column with random numbers, to remove all columns which have less importance than the random number column. Does anyone know which lesson and maybe even the timeline?


Intriguing; ggplot is used in one of the ML1 notebooks but to my knowledge is never imported. A search of the py files reveals no import statement.

What trickery, anyone tell me how. Thanks

Well would you believe ‘plotnine’. I have my answer

(Navin Kumar) #676

This course on ML has been a wonderful learning. Thanks to Jeremy & Rachel. Just finished lesson 12.

I would also like to learn about Unsupervised learning .

Could i get any pointers where I could learn them similar to fastai .
ie read book or watch videos and practice whats been learnt in a jupyter notebook.
To me this style of learning and doing is more effective than just watching academic lecture videos…

(Kofi Asiedu Brempong) #677

Am thinking using deep learning to see if I could better my score. We could work together on that, your thoughts…

(sid) #678

sure. how do you want to collaborate ?

(Navin Kumar) #679

I believe you are referring the feature importance lesson 4 … The time-lines of all ML lessons is posted here: Another treat! Early access to Intro To Machine Learning videos

Hope it helps…

(Utkarsh Mishra) #680

Can any one help me explaining the part in which Jeremy explains the Extrapolation in Random forest(Lecture 5).
There are few things i am confused about.

  1. When we create a new column of [‘is valid’] and use it as a dependent variable and try to predict it with the Random forest, what does the score(0.9999875) signifies in other words what does the score which is closer to 1 signifies , can anyone explain in detail.

2)Why do we use rf_feat_importance ? What does the importance of different feature signify ?


are there notes available for the ML videos like they are available for the DL videos?

(Stas Bekman) #682

Full autogenerated transcripts of the videos are now available and needing help with proofreading. Please see: DL1, DL2, ML1 Transcripts Project - Proofreading Help Needed!

(Stas Bekman) #683

Here you go:

note to admins: it’d be very useful to create a category for IntroML - we are at 650 posts and counting - it’s not easy to navigate, follow and find things when it’s so big.

(Stas Bekman) #684

Your question prompted me to create:

as I remembered seeing the answer to your question, but I just couldn’t remember where. So now you all can search the video transcripts (to a degree until it’s better proofread) and find the answers! yay!

So now that I was able to grep(1) the transcript I found you an answer answered by a student:

Lesson 06. 00:42:30 Feature importance, and Removing redundant features:

“You know, I think it’s like that’s, basically to find out which, which of those which features
are important for your model. So you take each feature and you like randomly sample all the
values in the feature, and you see how the predictions are if it’s very different, it means that
that feature was actually important as if it’s fine to take any random values.”

and here is the original explanation by Jeremy:

Lesson 3 some time after 01:12:15:

transcript quote:

“…column and randomly shuffle it so randomly Permute just that column, so now you made has
exactly the same, like distribution is to follow same mean, standard deviation, but it’s going to
have no relationship as a dependent variable at all, because we totally randomly reorder them
so before we might have found our R squared With point eight nine right and then after we
shuffle ear made we check again and now it’s like point eight.”

Both are from the transcript pdf (see the download link above).

(Nick) #685

With respect to feature importances.It turns out that the default approach that is used to compute the importances in sklearn does not based on permutations. I just stumbled across a cool blogpost from Terence where he explains that in details Also he has a library which uses the same approach Jeremy talked about.


thank you


that is pretty cool, will make changes as required to improve it.

(Pierre Guillou) #688

[EDIT, 01/07/2018] Hi, my post in Medium about “Fastai | How to start ?”. Hope it can help new participants to start this ML course or the DL ones. Feel free for asking me more information.

Lesson 1 (ML)

(notes from the video of the fastai lesson 1 about ML)


ML Fastai

Notebooks of the lesson 1

DL Fastai


NotebookIntro to Random Forests

  • 2 lines in the top of the notebook to allow update of a modified fastai file without relaunching the notebook : %load_ext autoreload; %autoreload 2
  • 1 line in the top of the notebook to publish results into the notebook : %matplotlib inline
  • (TIP) : do not do too much EDA (Exploring Data Analysis) on data before training in order to avoid creating bias
  • define objective (loss function) : here, RMSLE

Learn how to use a jupyter notebook

  • Setup a GPU and learn how to use the Jupyter notebook are points very important ! (knowing python and pandas as well :slight_smile:
  • shift+enter (run the code)
  • get information about a function in a Jupyter notebook : ?name_funtion (get documentation), ?? name_function (get source code)
  • to get information about arguments of a function, you can hint shift+tab after the name of the function (a hit from 1 to 3 times to get more and more details on arguments)
  • You can use Shift+Tab as well to get information about functions
  • You can run a bash command in a Jupyter notebook using ! (exclamation mark) :
    ** !ls {PATH} (python variables must be written into {})
    ** !ls -lh : get size of a file
    ** !wc -l file_name : get number of rows of a csv file
  • (from @pierreguillou) There are as well magic commands in jupyter notebooks using % (percentage)

Use the site Kaggle (Ml & DL competitions)

  • Blue Book for Bulldozers
  • How to get data :
    ** 1) download to your computer and then use scp to upload to AWS for example
    ** 2) with Firefox, you can use Developer (ctrl+shift+I) >> Tab ‘Network’ , click on Download, cancel the download and you get the link (copy as CURL) for downloading. Then, you can copy this curl link to a terminal : curl "https://....." -o Then, you can mkdir a folder, unzip the file (sudo apt-get install unzip if you don’t have unzip)
    ** 3) (from @pierreguillou) with Google Chrome, use CurlWget following the same steps as with Firefox

Linguagem to deal with notebooks about ML (or DL)


  • popular libray to deal with csv files (list of tutorials about pandas)
  • pandas DataFrame looks like a R DataFrame (and a column from a pandas DataFrame is a pandas Series)
  • pandas works well with numpy : you can apply a numpy function on a pandas Series. Ex: df_raw.SalePrice = np.log(df_raw.SalePrice)
  • you can import pandas but it is already imported as pd in the fastai importation : from fastai.imports import * (check the file
  • Remove a column from dataFrame : DataFrame.drop(column_name, axis=1)
  • pd.read_csv(f'{PATH}Train.csv', low_memory=False, parse_dates=["saledate"]) :
    ** low_memory=False : parse all dtypes of the file
    ** parse_dates=[] : give all columns names that are with dtype data (and will convert them to DataTime dtype)
  • (TIP) In Jupyter Notebook, if you type a variable name and press ctrl+enter whether that being Dataframe, video, HTML, etc — it will generally figure out a way of displaying it for you :slight_smile:
  • df_raw.tail() : display last rows from the DataFrame (df_raw.tail().T = transpose)
  • SalePrice is the dependant variable.
  • save/load DataFrame by using feather (Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow) :
    ** save with df_raw.to_feather('tmp/bulldozers-raw')
    ** load with df_raw = pd.read_feather('tmp/bulldozers-raw')

Random Forest

  • (fastai definition) : Random Forest is a kind of universal machine learning technique.
    ** It is a way of predicting something with any kind of type : categorical (ex: dog or cat) or continuous (ex: price).
    ** In general, it does not overfit and it is easy to stop it overfitting.
    ** You do not need a separate validation set in general. It tells you how well it generalizes when you have one dataset.
    ** It does not assume that your data is normally distributed.
    ** It does not assume the relationship is linear.
    ** It requires few pieces of engineering.
  • Definition : Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.
  • 2 concepts (which are wrong in practice) :
    ** curse of dimensionality : the more columns you have, the more empty is the space of you data (the more dimensions you have, the more points are on the edges). What that means in theory is that the distance between points is much less meaningful. But nowadays the world of machine learning has become very empirical and it turns out that in practice, building models on lots of columns works really well.
    ** no free lunch theorem : the claim is that there is no type of model that works well for any kind of dataset. Nowadays, there are empirical researchers who study which techniques work a lot of the time. Ensembles of decisions trees, which random forest for one, is perhaps the technique which most often comes at the top. provides a standard way to pre-process them properly and set their parameters.
  • importation of the RandomForest models : from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
  • Regression (RandomForestRegressor) : prevision of continuous variables
  • Classification (RandomForestClassificador) : prevision of categorical variables
  • methods from scikit-learn : the most important Machine Learning package in python (but not the best : XGBoost is better than Gradient Boosting Tree)
  • regression does not mean linear regression
  • 2 lines :
    ** you create a model : m = RandomForestRegressor(n_jobs=-1)
    ** you train the model by passing first the independent variables and then the dependent variables :'SalePrice', axis=1), df_raw.SalePrice)

Missing values and features engineering

  • key points before to run a ML/DL model ! You must turn your data to numbers in order to train your model !
    ** either continuous
    ** either categorical with a single number. For example, you must transforme the dtype datetime to many columns with categorical numbers as year, month, day, is it a holiday ?, etc. (It really depends on what you are doing) : this is feature engineering.
  1. First, use add_datepart() function on datetime column (Without expanding your date-time into these additional fields, you can’t capture any trend/cyclical behavior as a function of time at any of these granularities) : the DateTime column will be deleted and new columns (dtype : integer) will be added (nameDayofWeek, nameWeeofMonth, etc.).
  2. Then, apply train_cats() on all DataFrame to move columns dtype with strings to pandas category (more, behind the scene, it creates columns with an integer and create a mapping between theses integers and their corresponding string values). To get the same mapping on the validation/testing set than on the training dataframe, you can use apply_cats(test_dataframe, training_dataframe).
    ** When there is no value in a cell, the corresponding integer (created in is -1.
    ** Once a DataFrame column is a category, you can use cat attribute to access informations. Ex: (get list of categories names) or (get list of corresponding codes)
    ** If you prefer to have ordinal category with another order, you can do :['High', 'Medium', 'Low'], ordered=True, inplace=True)
  3. Finally, turn your dataframe to a numerical one without missing value and with normalization. For doing that, use the proc_df() function that split the dependent variable into a separate variable, moves categories to their numeric ones (and add +1 to all values : then the -1 value of missing values becomes 0), handle missing continuous values (missing continuous values replaced by the median on the column and creation of _na colum with 1 when missing value and 0 for others).

Run the RandomForestRegressor

m = RandomForestRegressor(n_jobs=-1), y)


Tried the deep learning course initially - but ended up following these instead. I would have to say that the classroom format actually makes these better than the average online course for two reasons:

It contains recapping of topics at the appropriate points.
The inclusion of students questions with instructor answer meant that a point was either clarified or I was encouraged to think about exactly why I had the right answer.