[EDIT, 01/07/2018] Hi, my post in Medium about “Fastai | How to start ?”. Hope it can help new participants to start this ML course or the DL ones. Feel free for asking me more information.
Lesson 1 (ML)
(notes from the video of the fastai lesson 1 about ML)
Fastai
- Blog : http://www.fast.ai/
- Forum : http://forums.fast.ai
- Fastai github : https://github.com/fastai/
** the fastai folder contains the Fastai library
** in this folder, the file imports.py contains libraries importation like :import pandas as pd
- Fastai = top-down teaching method (How to Learn Deep Learning (when you’re not a computer science PhD)) + a ML/DL library
- Advices from @jeremy :
** To start, focus on what things DO, not what they ARE.
** People learn by: doing (coding and building) and explaining what they’ve learned (by writing or helping others)
** DataScience (prototyping models) is not software engineering.
ML Fastai
- Lesson 1 : https://www.youtube.com/watch?v=CzdWqFTmn0Y&feature=youtu.be)
- Great blog about “Machine Learning 1: Lesson 1” from @hiromi
- Forum and list of videos : Another treat! Early access to Intro To Machine Learning videos
- Fast.ai “Intro to Machine Learning for Coders” Part 1 (2018): the complete collection of video timelines from @EricPB
- Fastai notebooks : https://github.com/fastai/fastai/tree/master/courses/ml1
Notebooks of the lesson 1
DL Fastai
- Deep Learning course : http://course.fast.ai/
- Forum Part 1 : http://forums.fast.ai/c/part1-v2
- Forum Part 2 : http://forums.fast.ai/c/part2-v2
GPU
- All guides on how to set up fastai on a GPU : Deep Learning Brasília - Revisão (lições 1, 2, 3 e 4)
- GPU local : clone https://github.com/fastai/fastai after reading the guide Howto: installation on Windows
- GPU online :
** from @pierreguillou : I recommend you use Google Cloud with a credit of $300 or even Google Colab + Clouderizer (free)
** You can use as well Paperspace, AWS, Amazon SageMaker or Crestle with the fastai library (and notebooks) already installed.
Notebook “Intro to Random Forests”
- 2 lines in the top of the notebook to allow update of a modified fastai file without relaunching the notebook :
%load_ext autoreload; %autoreload 2
- 1 line in the top of the notebook to publish results into the notebook :
%matplotlib inline
- (TIP) : do not do too much EDA (Exploring Data Analysis) on data before training in order to avoid creating bias
- define objective (loss function) : here, RMSLE
Learn how to use a jupyter notebook
- Setup a GPU and learn how to use the Jupyter notebook are points very important ! (knowing python and pandas as well
-
shift+enter
(run the code) - get information about a function in a Jupyter notebook :
?name_funtion
(get documentation),?? name_function
(get source code) - to get information about arguments of a function, you can hint
shift+tab
after the name of the function (a hit from 1 to 3 times to get more and more details on arguments) - You can use
Shift+Tab
as well to get information about functions - You can run a bash command in a Jupyter notebook using
!
(exclamation mark) :
**!ls {PATH}
(python variables must be written into {})
**!ls -lh
: get size of a file
**!wc -l file_name
: get number of rows of a csv file - (from @pierreguillou) There are as well magic commands in jupyter notebooks using
%
(percentage)
Use the site Kaggle (Ml & DL competitions)
- https://www.kaggle.com/
- Blue Book for Bulldozers
- How to get data :
** 1) download to your computer and then use scp to upload to AWS for example
** 2) with Firefox, you can use Developer (ctrl+shift+I) >> Tab ‘Network’ , click on Download, cancel the download and you get the link (copy as CURL) for downloading. Then, you can copy this curl link to a terminal :curl "https://....." -o bulldozers.zip
. Then, you can mkdir a folder, unzip the file (sudo apt-get install unzip
if you don’t have unzip)
** 3) (from @pierreguillou) with Google Chrome, use CurlWget following the same steps as with Firefox
Linguagem to deal with notebooks about ML (or DL)
- Python (Python for Data Analysis: Data Wrangling with Pandas, Numpy, and Ipython)
- Numpy (matematical operations on arrays, matrices, vectors, high dimensional tensors as if they are Python variables) : tutorial about numpy
- Pandas (structured data like csv/excel files with data in columns). Example : pd.read_csv() (pd is an alias for pandas, and read_csv() get into a DataFrame a csv file)
- format string :
f'The {PATH} to data'
Pandas
- popular libray to deal with csv files (list of tutorials about pandas)
- pandas DataFrame looks like a R DataFrame (and a column from a pandas DataFrame is a pandas Series)
- pandas works well with numpy : you can apply a numpy function on a pandas Series. Ex:
df_raw.SalePrice = np.log(df_raw.SalePrice)
- you can import pandas but it is already imported as
pd
in the fastai importation :from fastai.imports import *
(check the file imports.py) - Remove a column from dataFrame :
DataFrame.drop(column_name, axis=1)
-
pd.read_csv(f'{PATH}Train.csv', low_memory=False, parse_dates=["saledate"])
:
**low_memory=False
: parse all dtypes of the file
**parse_dates=[]
: give all columns names that are with dtype data (and will convert them to DataTime dtype) - (TIP) In Jupyter Notebook, if you type a variable name and press
ctrl+enter
whether that being Dataframe, video, HTML, etc — it will generally figure out a way of displaying it for you -
df_raw.tail()
: display last rows from the DataFrame (df_raw.tail().T
= transpose) -
SalePrice
is the dependant variable. - save/load DataFrame by using
feather
(Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow) :
** save withdf_raw.to_feather('tmp/bulldozers-raw')
** load withdf_raw = pd.read_feather('tmp/bulldozers-raw')
Random Forest
- (fastai definition) : Random Forest is a kind of universal machine learning technique.
** It is a way of predicting something with any kind of type : categorical (ex: dog or cat) or continuous (ex: price).
** In general, it does not overfit and it is easy to stop it overfitting.
** You do not need a separate validation set in general. It tells you how well it generalizes when you have one dataset.
** It does not assume that your data is normally distributed.
** It does not assume the relationship is linear.
** It requires few pieces of engineering. - Definition : Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.
- 2 concepts (which are wrong in practice) :
** curse of dimensionality : the more columns you have, the more empty is the space of you data (the more dimensions you have, the more points are on the edges). What that means in theory is that the distance between points is much less meaningful. But nowadays the world of machine learning has become very empirical and it turns out that in practice, building models on lots of columns works really well.
** no free lunch theorem : the claim is that there is no type of model that works well for any kind of dataset. Nowadays, there are empirical researchers who study which techniques work a lot of the time. Ensembles of decisions trees, which random forest for one, is perhaps the technique which most often comes at the top. Fast.ai provides a standard way to pre-process them properly and set their parameters. - importation of the RandomForest models :
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
- Regression (
RandomForestRegressor
) : prevision of continuous variables - Classification (
RandomForestClassificador
) : prevision of categorical variables - methods from scikit-learn : the most important Machine Learning package in python (but not the best : XGBoost is better than Gradient Boosting Tree)
- regression does not mean linear regression
- 2 lines :
** you create a model :m = RandomForestRegressor(n_jobs=-1)
** you train the model by passing first the independent variables and then the dependent variables :m.fit(df_raw.drop('SalePrice', axis=1), df_raw.SalePrice)
Missing values and features engineering
- key points before to run a ML/DL model ! You must turn your data to numbers in order to train your model !
** either continuous
** either categorical with a single number. For example, you must transforme the dtype datetime to many columns with categorical numbers as year, month, day, is it a holiday ?, etc. (It really depends on what you are doing) : this is feature engineering.
- First, use
add_datepart()
function on datetime column (Without expanding your date-time into these additional fields, you can’t capture any trend/cyclical behavior as a function of time at any of these granularities) : the DateTime column will be deleted and new columns (dtype : integer) will be added (nameDayofWeek, nameWeeofMonth, etc.). - Then, apply
train_cats()
on all DataFrame to move columns dtype with strings to pandas category (more, behind the scene, it creates columns with an integer and create a mapping between theses integers and their corresponding string values). To get the same mapping on the validation/testing set than on the training dataframe, you can useapply_cats(test_dataframe, training_dataframe)
.
** When there is no value in a cell, the corresponding integer (created incat.codes
) is-1
.
** Once a DataFrame column is a category, you can usecat
attribute to access informations. Ex:df_raw.UsageBand.cat.categories
(get list of categories names) ordf_raw.UsageBand.cat.codes
(get list of corresponding codes)
** If you prefer to have ordinal category with another order, you can do :df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)
- Finally, turn your dataframe to a numerical one without missing value and with normalization. For doing that, use the
proc_df()
function that split the dependent variable into a separate variable, moves categories to their numeric ones (and add +1 to all values : then the -1 value of missing values becomes 0), handle missing continuous values (missing continuous values replaced by the median on the column and creation of_na
colum with 1 when missing value and 0 for others).
Run the RandomForestRegressor
m = RandomForestRegressor(n_jobs=-1)
m.fit(df, y)
m.score(df,y)