I believe you are referring the feature importance lesson 4 … The time-lines of all ML lessons is posted here: Another treat! Early access to Intro To Machine Learning videos
Hope it helps…
I believe you are referring the feature importance lesson 4 … The time-lines of all ML lessons is posted here: Another treat! Early access to Intro To Machine Learning videos
Hope it helps…
Can any one help me explaining the part in which Jeremy explains the Extrapolation in Random forest(Lecture 5).
There are few things i am confused about.
2)Why do we use rf_feat_importance ? What does the importance of different feature signify ?
are there notes available for the ML videos like they are available for the DL videos?
Thanks
Full autogenerated transcripts of the videos are now available and needing help with proofreading. Please see: Fast.ai DL1, DL2, ML1 Transcripts Project - Proofreading Help Needed!
Here you go:
note to admins: it’d be very useful to create a category for IntroML - we are at 650 posts and counting - it’s not easy to navigate, follow and find things when it’s so big.
Your question prompted me to create:
as I remembered seeing the answer to your question, but I just couldn’t remember where. So now you all can search the video transcripts (to a degree until it’s better proofread) and find the answers! yay!
So now that I was able to grep(1) the transcript I found you an answer answered by a student:
Lesson 06. 00:42:30 Feature importance, and Removing redundant features:
“You know, I think it’s like that’s, basically to find out which, which of those which features
are important for your model. So you take each feature and you like randomly sample all the
values in the feature, and you see how the predictions are if it’s very different, it means that
that feature was actually important as if it’s fine to take any random values.”
and here is the original explanation by Jeremy:
Lesson 3 some time after 01:12:15:
transcript quote:
“…column and randomly shuffle it so randomly Permute just that column, so now you made has
exactly the same, like distribution is to follow same mean, standard deviation, but it’s going to
have no relationship as a dependent variable at all, because we totally randomly reorder them
so before we might have found our R squared With point eight nine right and then after we
shuffle ear made we check again and now it’s like point eight.”
Both are from the transcript pdf (see the download link above).
With respect to feature importances.It turns out that the default approach that is used to compute the importances in sklearn does not based on permutations. I just stumbled across a cool blogpost from Terence where he explains that in details http://explained.ai/rf-importance/index.html. Also he has a library https://github.com/parrt/random-forest-importances which uses the same approach Jeremy talked about.
thank you
that is pretty cool, will make changes as required to improve it.
[EDIT, 01/07/2018] Hi, my post in Medium about “Fastai | How to start ?”. Hope it can help new participants to start this ML course or the DL ones. Feel free for asking me more information.
(notes from the video of the fastai lesson 1 about ML)
Fastai
import pandas as pd
ML Fastai
Notebooks of the lesson 1
DL Fastai
GPU
Notebook “Intro to Random Forests”
%load_ext autoreload; %autoreload 2
%matplotlib inline
Learn how to use a jupyter notebook
shift+enter
(run the code)?name_funtion
(get documentation), ?? name_function
(get source code)shift+tab
after the name of the function (a hit from 1 to 3 times to get more and more details on arguments)Shift+Tab
as well to get information about functions!
(exclamation mark) :!ls {PATH}
(python variables must be written into {})!ls -lh
: get size of a file!wc -l file_name
: get number of rows of a csv file%
(percentage)Use the site Kaggle (Ml & DL competitions)
curl "https://....." -o bulldozers.zip
. Then, you can mkdir a folder, unzip the file (sudo apt-get install unzip
if you don’t have unzip)Linguagem to deal with notebooks about ML (or DL)
f'The {PATH} to data'
Pandas
df_raw.SalePrice = np.log(df_raw.SalePrice)
pd
in the fastai importation : from fastai.imports import *
(check the file imports.py)DataFrame.drop(column_name, axis=1)
pd.read_csv(f'{PATH}Train.csv', low_memory=False, parse_dates=["saledate"])
:low_memory=False
: parse all dtypes of the fileparse_dates=[]
: give all columns names that are with dtype data (and will convert them to DataTime dtype)ctrl+enter
whether that being Dataframe, video, HTML, etc — it will generally figure out a way of displaying it for you
df_raw.tail()
: display last rows from the DataFrame (df_raw.tail().T
= transpose)SalePrice
is the dependant variable.feather
(Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow) :df_raw.to_feather('tmp/bulldozers-raw')
df_raw = pd.read_feather('tmp/bulldozers-raw')
Random Forest
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
RandomForestRegressor
) : prevision of continuous variablesRandomForestClassificador
) : prevision of categorical variablesm = RandomForestRegressor(n_jobs=-1)
m.fit(df_raw.drop('SalePrice', axis=1), df_raw.SalePrice)
Missing values and features engineering
add_datepart()
function on datetime column (Without expanding your date-time into these additional fields, you can’t capture any trend/cyclical behavior as a function of time at any of these granularities) : the DateTime column will be deleted and new columns (dtype : integer) will be added (nameDayofWeek, nameWeeofMonth, etc.).train_cats()
on all DataFrame to move columns dtype with strings to pandas category (more, behind the scene, it creates columns with an integer and create a mapping between theses integers and their corresponding string values). To get the same mapping on the validation/testing set than on the training dataframe, you can use apply_cats(test_dataframe, training_dataframe)
.cat.codes
) is -1
.cat
attribute to access informations. Ex: df_raw.UsageBand.cat.categories
(get list of categories names) or df_raw.UsageBand.cat.codes
(get list of corresponding codes)df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)
proc_df()
function that split the dependent variable into a separate variable, moves categories to their numeric ones (and add +1 to all values : then the -1 value of missing values becomes 0), handle missing continuous values (missing continuous values replaced by the median on the column and creation of _na
colum with 1 when missing value and 0 for others).Run the RandomForestRegressor
m = RandomForestRegressor(n_jobs=-1)
m.fit(df, y)
m.score(df,y)
Tried the deep learning course initially - but ended up following these instead. I would have to say that the classroom format actually makes these better than the average online course for two reasons:
It contains recapping of topics at the appropriate points.
The inclusion of students questions with instructor answer meant that a point was either clarified or I was encouraged to think about exactly why I had the right answer.
@jeremy had mentioned this might be happening. I would definitely love to see a machine learning forum created here to make it easier to discuss machine learning and the awesome lessons.
I’m facing the below issue and tried couple things to fix but it doesn’t work -
a. tried to install graphviz pip install graphviz
but it showed already installed .
b. added the path to system environment variables and restarted the notebook but still doesn’t work.
Can anyone please help me.
Thanks,
Sumit
pip does not install graphviz executable, you should download it yourself from https://www.graphviz.org/download/ or use conda conda install -c anaconda graphviz
Here is an attempt at waterfall plots with plotnine the ipynb codes cells follow.
This is still a work in progress any comments welcome
%load_ext autoreload
%autoreload 2
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotnine import *
b0 = pd.DataFrame({'desc': ['sales','returns','credit fees','rebates','late charges','shipping'],
'amount': [350000,-30000,-7500,-25000,95000,-7000]})
def comma(x):
'The two args are the value '
if len(x) >1:
res = []
for el in x:
res.append("{:,.0f}".format(el))
else:
res = "{:,.0f}".format(x)
return res
def waterfall_df(balance):
"""
Expects a two column named 'amount' and 'desc' data frame
"""
balance.desc = pd.Categorical(balance.desc, categories=balance.desc)
balance['types'] = ["increase" if v > 0 else "decrease" for v in balance.amount]
total = balance.amount.sum()
balance = balance.append({'amount':total, 'desc':'net', 'types':'net'} , ignore_index=True)
balance = pd.concat([balance,pd.Series([v for v in range(balance.shape[0])])], axis=1 )
cols = balance.columns.values
cols[-1] = 'ind'
#print(cols, type(cols), balance.types.unique())
balance.columns = cols
#print(balance.amount.cumsum())
balance.types = pd.Categorical(balance.types, categories=['decrease', 'increase', 'net']) #balance.types.unique())
balance.iloc[0, len(cols) -2] = "net"
csum = balance.amount.cumsum()
zero_s = pd.Series([0.0],index=[len(csum)-1])
balance['end'] = csum[0:len(csum)-1].append(zero_s)
balance['start'] = csum[0:len(csum)].shift(1).fillna(0)
cmap = [ '#d83000' if v < 0 else '#242b73' for v in balance['amount']]
balance['cmap'] = cmap
return balance
def waterfall_plot(balance):
ind = balance.ind.values
end = balance.end.values
start = balance.start.values
end_lbl = comma(end)
start_lbl = comma(start)
nudge_end = [1 if e < s else -0.3 for e, s in zip(end,start)]
nudge_start = [-0.3 if e < s else 1 for e, s in zip(end,start)]
black = '#222222'
y_min = balance.end.values.min()
y_max = balance.end.values.max() + (0.2 * balance.end.values.max())
p1 = (ggplot(balance, aes('ind', fill = 'types')) +
geom_rect(aes(x = 'ind',xmin = ind - 0.45, xmax = ind + 0.45, ymin = end,ymax = start)) +
xlab("") +
ylab("") +
theme_seaborn() ) #+
#theme(
# axis_text = element_text(balance.desc, color='#555555', size=8, angle=45, va='bottom', margin={'t':10,'b':10})))
# axis_text_x=element_text(color=black)))
for s, e, i, t , a in zip(balance.start, balance.end, balance.ind, balance.types, balance.amount):
if t == 'increase' :
p1 = p1 + geom_text(
aes(x=i,y=e, label = a, nudge_y = 1), va='bottom', size = 8,format_string="{:,.0f}")
elif (t=='net') & (e > 0):
p1 = p1 + geom_text(
aes(x=i,y=e,label = a, nudge_y=nudge_end[0] ), va='bottom', size = 8, format_string="{:,.0f}")
elif (t=='net') & (s > 0):
p1 = p1 + geom_text(
aes(x=i,y=s, label = a, nudge_y = nudge_start[len(nudge_start)-1]),
va='bottom', size = 8,format_string="{:,.0f}")
elif t=='decrease':
p1 = p1 + geom_text(
aes(x=i,y=e, label = a, nudge_y = -0.3), va='top', size = 8,format_string="{:,.0f}")
p1 = p1 + geom_label(aes(y=y_max,label='desc'), color=black, size=8, angle=20, va='center')
#p1 = p1 + scale_fill_manual(values = [('decrease', "indianred"),('increase' ,"forestgreen"), ('net', "dodgerblue2")])
return p1
waterfall_plot(waterfall_df(b0))
try it on your data
Are these videos enough to say we can start working on Machine Learning Models in real world ? Cna you please help me on it .
Sir , cannot thank you enough
we could each create an initial model and crosscheck to see what we can learn from each other
should we choose a different dataset as houseprices has only 1461 samples for training ?
you could also email me at my username @ hotmail.com