Another treat! Early access to Intro To Machine Learning videos

I believe you are referring the feature importance lesson 4 … The time-lines of all ML lessons is posted here: Another treat! Early access to Intro To Machine Learning videos

Hope it helps…

Can any one help me explaining the part in which Jeremy explains the Extrapolation in Random forest(Lecture 5).
There are few things i am confused about.

  1. When we create a new column of [‘is valid’] and use it as a dependent variable and try to predict it with the Random forest, what does the score(0.9999875) signifies in other words what does the score which is closer to 1 signifies , can anyone explain in detail.

2)Why do we use rf_feat_importance ? What does the importance of different feature signify ?

are there notes available for the ML videos like they are available for the DL videos?
Thanks

Full autogenerated transcripts of the videos are now available and needing help with proofreading. Please see: Fast.ai DL1, DL2, ML1 Transcripts Project - Proofreading Help Needed!

Here you go:

note to admins: it’d be very useful to create a category for IntroML - we are at 650 posts and counting - it’s not easy to navigate, follow and find things when it’s so big.

1 Like

Your question prompted me to create:


as I remembered seeing the answer to your question, but I just couldn’t remember where. So now you all can search the video transcripts (to a degree until it’s better proofread) and find the answers! yay!

So now that I was able to grep(1) the transcript I found you an answer answered by a student:

Lesson 06. 00:42:30 Feature importance, and Removing redundant features:

“You know, I think it’s like that’s, basically to find out which, which of those which features
are important for your model. So you take each feature and you like randomly sample all the
values in the feature, and you see how the predictions are if it’s very different, it means that
that feature was actually important as if it’s fine to take any random values.”

and here is the original explanation by Jeremy:

Lesson 3 some time after 01:12:15:

transcript quote:

“…column and randomly shuffle it so randomly Permute just that column, so now you made has
exactly the same, like distribution is to follow same mean, standard deviation, but it’s going to
have no relationship as a dependent variable at all, because we totally randomly reorder them
so before we might have found our R squared With point eight nine right and then after we
shuffle ear made we check again and now it’s like point eight.”

Both are from the transcript pdf (see the download link above).

1 Like

With respect to feature importances.It turns out that the default approach that is used to compute the importances in sklearn does not based on permutations. I just stumbled across a cool blogpost from Terence where he explains that in details http://explained.ai/rf-importance/index.html. Also he has a library https://github.com/parrt/random-forest-importances which uses the same approach Jeremy talked about.

1 Like

thank you

that is pretty cool, will make changes as required to improve it.

[EDIT, 01/07/2018] Hi, my post in Medium about “Fastai | How to start ?”. Hope it can help new participants to start this ML course or the DL ones. Feel free for asking me more information.

Lesson 1 (ML)

(notes from the video of the fastai lesson 1 about ML)

Fastai

ML Fastai

Notebooks of the lesson 1

DL Fastai

GPU

NotebookIntro to Random Forests

  • 2 lines in the top of the notebook to allow update of a modified fastai file without relaunching the notebook : %load_ext autoreload; %autoreload 2
  • 1 line in the top of the notebook to publish results into the notebook : %matplotlib inline
  • (TIP) : do not do too much EDA (Exploring Data Analysis) on data before training in order to avoid creating bias
  • define objective (loss function) : here, RMSLE

Learn how to use a jupyter notebook

  • Setup a GPU and learn how to use the Jupyter notebook are points very important ! (knowing python and pandas as well :slight_smile:
  • shift+enter (run the code)
  • get information about a function in a Jupyter notebook : ?name_funtion (get documentation), ?? name_function (get source code)
  • to get information about arguments of a function, you can hint shift+tab after the name of the function (a hit from 1 to 3 times to get more and more details on arguments)
  • You can use Shift+Tab as well to get information about functions
  • You can run a bash command in a Jupyter notebook using ! (exclamation mark) :
    ** !ls {PATH} (python variables must be written into {})
    ** !ls -lh : get size of a file
    ** !wc -l file_name : get number of rows of a csv file
  • (from @pierreguillou) There are as well magic commands in jupyter notebooks using % (percentage)

Use the site Kaggle (Ml & DL competitions)

  • https://www.kaggle.com/
  • Blue Book for Bulldozers
  • How to get data :
    ** 1) download to your computer and then use scp to upload to AWS for example
    ** 2) with Firefox, you can use Developer (ctrl+shift+I) >> Tab ‘Network’ , click on Download, cancel the download and you get the link (copy as CURL) for downloading. Then, you can copy this curl link to a terminal : curl "https://....." -o bulldozers.zip. Then, you can mkdir a folder, unzip the file (sudo apt-get install unzip if you don’t have unzip)
    ** 3) (from @pierreguillou) with Google Chrome, use CurlWget following the same steps as with Firefox

Linguagem to deal with notebooks about ML (or DL)

Pandas

  • popular libray to deal with csv files (list of tutorials about pandas)
  • pandas DataFrame looks like a R DataFrame (and a column from a pandas DataFrame is a pandas Series)
  • pandas works well with numpy : you can apply a numpy function on a pandas Series. Ex: df_raw.SalePrice = np.log(df_raw.SalePrice)
  • you can import pandas but it is already imported as pd in the fastai importation : from fastai.imports import * (check the file imports.py)
  • Remove a column from dataFrame : DataFrame.drop(column_name, axis=1)
  • pd.read_csv(f'{PATH}Train.csv', low_memory=False, parse_dates=["saledate"]) :
    ** low_memory=False : parse all dtypes of the file
    ** parse_dates=[] : give all columns names that are with dtype data (and will convert them to DataTime dtype)
  • (TIP) In Jupyter Notebook, if you type a variable name and press ctrl+enter whether that being Dataframe, video, HTML, etc — it will generally figure out a way of displaying it for you :slight_smile:
  • df_raw.tail() : display last rows from the DataFrame (df_raw.tail().T = transpose)
  • SalePrice is the dependant variable.
  • save/load DataFrame by using feather (Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow) :
    ** save with df_raw.to_feather('tmp/bulldozers-raw')
    ** load with df_raw = pd.read_feather('tmp/bulldozers-raw')

Random Forest

  • (fastai definition) : Random Forest is a kind of universal machine learning technique.
    ** It is a way of predicting something with any kind of type : categorical (ex: dog or cat) or continuous (ex: price).
    ** In general, it does not overfit and it is easy to stop it overfitting.
    ** You do not need a separate validation set in general. It tells you how well it generalizes when you have one dataset.
    ** It does not assume that your data is normally distributed.
    ** It does not assume the relationship is linear.
    ** It requires few pieces of engineering.
  • Definition : Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.
  • 2 concepts (which are wrong in practice) :
    ** curse of dimensionality : the more columns you have, the more empty is the space of you data (the more dimensions you have, the more points are on the edges). What that means in theory is that the distance between points is much less meaningful. But nowadays the world of machine learning has become very empirical and it turns out that in practice, building models on lots of columns works really well.
    ** no free lunch theorem : the claim is that there is no type of model that works well for any kind of dataset. Nowadays, there are empirical researchers who study which techniques work a lot of the time. Ensembles of decisions trees, which random forest for one, is perhaps the technique which most often comes at the top. Fast.ai provides a standard way to pre-process them properly and set their parameters.
  • importation of the RandomForest models : from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
  • Regression (RandomForestRegressor) : prevision of continuous variables
  • Classification (RandomForestClassificador) : prevision of categorical variables
  • methods from scikit-learn : the most important Machine Learning package in python (but not the best : XGBoost is better than Gradient Boosting Tree)
  • regression does not mean linear regression
  • 2 lines :
    ** you create a model : m = RandomForestRegressor(n_jobs=-1)
    ** you train the model by passing first the independent variables and then the dependent variables : m.fit(df_raw.drop('SalePrice', axis=1), df_raw.SalePrice)

Missing values and features engineering

  • key points before to run a ML/DL model ! You must turn your data to numbers in order to train your model !
    ** either continuous
    ** either categorical with a single number. For example, you must transforme the dtype datetime to many columns with categorical numbers as year, month, day, is it a holiday ?, etc. (It really depends on what you are doing) : this is feature engineering.
  1. First, use add_datepart() function on datetime column (Without expanding your date-time into these additional fields, you can’t capture any trend/cyclical behavior as a function of time at any of these granularities) : the DateTime column will be deleted and new columns (dtype : integer) will be added (nameDayofWeek, nameWeeofMonth, etc.).
  2. Then, apply train_cats() on all DataFrame to move columns dtype with strings to pandas category (more, behind the scene, it creates columns with an integer and create a mapping between theses integers and their corresponding string values). To get the same mapping on the validation/testing set than on the training dataframe, you can use apply_cats(test_dataframe, training_dataframe).
    ** When there is no value in a cell, the corresponding integer (created in cat.codes) is -1.
    ** Once a DataFrame column is a category, you can use cat attribute to access informations. Ex: df_raw.UsageBand.cat.categories (get list of categories names) or df_raw.UsageBand.cat.codes (get list of corresponding codes)
    ** If you prefer to have ordinal category with another order, you can do : df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)
  3. Finally, turn your dataframe to a numerical one without missing value and with normalization. For doing that, use the proc_df() function that split the dependent variable into a separate variable, moves categories to their numeric ones (and add +1 to all values : then the -1 value of missing values becomes 0), handle missing continuous values (missing continuous values replaced by the median on the column and creation of _na colum with 1 when missing value and 0 for others).

Run the RandomForestRegressor

m = RandomForestRegressor(n_jobs=-1)
m.fit(df, y)
m.score(df,y)
7 Likes

Tried the deep learning course initially - but ended up following these instead. I would have to say that the classroom format actually makes these better than the average online course for two reasons:

It contains recapping of topics at the appropriate points.
The inclusion of students questions with instructor answer meant that a point was either clarified or I was encouraged to think about exactly why I had the right answer.

@jeremy had mentioned this might be happening. I would definitely love to see a machine learning forum created here to make it easier to discuss machine learning and the awesome lessons.

1 Like

I’m facing the below issue and tried couple things to fix but it doesn’t work -


a. tried to install graphviz pip install graphviz but it showed already installed .
b. added the path to system environment variables and restarted the notebook but still doesn’t work.

Can anyone please help me.

Thanks,
Sumit

pip does not install graphviz executable, you should download it yourself from https://www.graphviz.org/download/ or use conda conda install -c anaconda graphviz

Here is an attempt at waterfall plots with plotnine the ipynb codes cells follow.
This is still a work in progress any comments welcome

%load_ext autoreload
%autoreload 2

%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotnine import *

b0 = pd.DataFrame({'desc': ['sales','returns','credit fees','rebates','late charges','shipping'],
        'amount': [350000,-30000,-7500,-25000,95000,-7000]})

def comma(x):
    'The two args are the value '
    if len(x) >1:
        res = []
        for el in x:
            res.append("{:,.0f}".format(el))
    else:
        res = "{:,.0f}".format(x)
    
    return res


def waterfall_df(balance):
    """
    Expects a two column named 'amount' and  'desc' data frame
    """
    balance.desc = pd.Categorical(balance.desc, categories=balance.desc)
    balance['types'] = ["increase" if v > 0 else "decrease" for v in balance.amount]
    total =  balance.amount.sum()
    balance = balance.append({'amount':total, 'desc':'net', 'types':'net'} , ignore_index=True)
    balance  = pd.concat([balance,pd.Series([v for v in range(balance.shape[0])])], axis=1 )
    cols = balance.columns.values
    cols[-1] = 'ind'

    #print(cols, type(cols), balance.types.unique())
    balance.columns = cols
    #print(balance.amount.cumsum())
    balance.types = pd.Categorical(balance.types, categories=['decrease', 'increase', 'net']) #balance.types.unique())
    balance.iloc[0, len(cols) -2] = "net"
    csum = balance.amount.cumsum()
    zero_s = pd.Series([0.0],index=[len(csum)-1])

    balance['end'] = csum[0:len(csum)-1].append(zero_s)
    balance['start'] = csum[0:len(csum)].shift(1).fillna(0)
    cmap = [ '#d83000' if v < 0 else '#242b73' for v in balance['amount']]
    balance['cmap'] = cmap   

    return  balance

def waterfall_plot(balance):
    ind = balance.ind.values
    end = balance.end.values
    start = balance.start.values
    end_lbl = comma(end)
    start_lbl = comma(start)
    nudge_end = [1 if e < s else -0.3 for e, s in zip(end,start)]
    nudge_start = [-0.3 if e < s else 1 for e, s in zip(end,start)]
    black = '#222222'
    y_min = balance.end.values.min()
    y_max = balance.end.values.max() + (0.2 * balance.end.values.max())

    p1 = (ggplot(balance, aes('ind', fill = 'types')) + 
          geom_rect(aes(x = 'ind',xmin = ind - 0.45, xmax = ind + 0.45, ymin = end,ymax = start)) +
          xlab("") + 
          ylab("") + 
          theme_seaborn() ) #+
          #theme( 
          #    axis_text = element_text(balance.desc, color='#555555', size=8, angle=45, va='bottom', margin={'t':10,'b':10})))
          #    axis_text_x=element_text(color=black)))
    for s, e, i, t , a in zip(balance.start, balance.end, balance.ind, balance.types, balance.amount):
        if t == 'increase' :
            p1 = p1 + geom_text( 
                            aes(x=i,y=e, label = a, nudge_y = 1), va='bottom', size = 8,format_string="{:,.0f}")
        elif (t=='net') & (e > 0):
            p1 = p1 + geom_text( 
                            aes(x=i,y=e,label = a, nudge_y=nudge_end[0] ),  va='bottom',  size = 8, format_string="{:,.0f}") 
        elif (t=='net') & (s > 0):
            p1 = p1 + geom_text(
                            aes(x=i,y=s, label = a, nudge_y = nudge_start[len(nudge_start)-1]), 
                            va='bottom', size = 8,format_string="{:,.0f}")

        elif t=='decrease':
            p1 = p1 + geom_text( 
                            aes(x=i,y=e, label = a, nudge_y = -0.3),  va='top', size = 8,format_string="{:,.0f}")
            
    p1 = p1 + geom_label(aes(y=y_max,label='desc'), color=black, size=8, angle=20, va='center')
    #p1 = p1 + scale_fill_manual(values = [('decrease', "indianred"),('increase' ,"forestgreen"), ('net', "dodgerblue2")])
    return p1

waterfall_plot(waterfall_df(b0))

Screenshot%20from%202018-07-01%2009-05-16

try it on your data

2 Likes

Are these videos enough to say we can start working on Machine Learning Models in real world ? Cna you please help me on it .

What is artificial intelligence?

Sir , cannot thank you enough :pray::pray:

we could each create an initial model and crosscheck to see what we can learn from each other

should we choose a different dataset as houseprices has only 1461 samples for training ?
you could also email me at my username @ hotmail.com