Another treat! Early access to Intro To Machine Learning videos

pierreguillou · June 30, 2018, 1:24pm

[EDIT, 01/07/2018] Hi, my post in Medium about “Fastai | How to start ?”. Hope it can help new participants to start this ML course or the DL ones. Feel free for asking me more information.

Lesson 1 (ML)

(notes from the video of the fastai lesson 1 about ML)

Fastai

Blog : http://www.fast.ai/
Forum : http://forums.fast.ai
Fastai github : https://github.com/fastai/
** the fastai folder contains the Fastai library
** in this folder, the file imports.py contains libraries importation like : import pandas as pd
Fastai = top-down teaching method (How to Learn Deep Learning (when you’re not a computer science PhD)) + a ML/DL library
Advices from @jeremy :
** To start, focus on what things DO, not what they ARE.
** People learn by: doing (coding and building) and explaining what they’ve learned (by writing or helping others)
** DataScience (prototyping models) is not software engineering.

ML Fastai

Lesson 1 : https://www.youtube.com/watch?v=CzdWqFTmn0Y&feature=youtu.be)
Great blog about “Machine Learning 1: Lesson 1” from @hiromi
Forum and list of videos : Another treat! Early access to Intro To Machine Learning videos
Fast.ai “Intro to Machine Learning for Coders” Part 1 (2018): the complete collection of video timelines from @EricPB
Fastai notebooks : https://github.com/fastai/fastai/tree/master/courses/ml1

Notebooks of the lesson 1

DL Fastai

Deep Learning course : http://course.fast.ai/
Forum Part 1 : http://forums.fast.ai/c/part1-v2
Forum Part 2 : http://forums.fast.ai/c/part2-v2

GPU

All guides on how to set up fastai on a GPU : Deep Learning Brasília - Revisão (lições 1, 2, 3 e 4)
GPU local : clone https://github.com/fastai/fastai after reading the guide Howto: installation on Windows
GPU online :
** from @pierreguillou : I recommend you use Google Cloud with a credit of $300 or even Google Colab + Clouderizer (free)
** You can use as well Paperspace, AWS, Amazon SageMaker or Crestle with the fastai library (and notebooks) already installed.

Notebook “Intro to Random Forests”

2 lines in the top of the notebook to allow update of a modified fastai file without relaunching the notebook : %load_ext autoreload; %autoreload 2
1 line in the top of the notebook to publish results into the notebook : %matplotlib inline
(TIP) : do not do too much EDA (Exploring Data Analysis) on data before training in order to avoid creating bias
define objective (loss function) : here, RMSLE

Learn how to use a jupyter notebook

Setup a GPU and learn how to use the Jupyter notebook are points very important ! (knowing python and pandas as well
shift+enter (run the code)
get information about a function in a Jupyter notebook : ?name_funtion (get documentation), ?? name_function (get source code)
to get information about arguments of a function, you can hint shift+tab after the name of the function (a hit from 1 to 3 times to get more and more details on arguments)
You can use Shift+Tab as well to get information about functions
You can run a bash command in a Jupyter notebook using ! (exclamation mark) :
** !ls {PATH} (python variables must be written into {})
** !ls -lh : get size of a file
** !wc -l file_name : get number of rows of a csv file
(from @pierreguillou) There are as well magic commands in jupyter notebooks using % (percentage)

Use the site Kaggle (Ml & DL competitions)

https://www.kaggle.com/
Blue Book for Bulldozers
How to get data :
** 1) download to your computer and then use scp to upload to AWS for example
** 2) with Firefox, you can use Developer (ctrl+shift+I) >> Tab ‘Network’ , click on Download, cancel the download and you get the link (copy as CURL) for downloading. Then, you can copy this curl link to a terminal : curl "https://....." -o bulldozers.zip. Then, you can mkdir a folder, unzip the file (sudo apt-get install unzip if you don’t have unzip)
** 3) (from @pierreguillou) with Google Chrome, use CurlWget following the same steps as with Firefox

Linguagem to deal with notebooks about ML (or DL)

Python (Python for Data Analysis: Data Wrangling with Pandas, Numpy, and Ipython)
Numpy (matematical operations on arrays, matrices, vectors, high dimensional tensors as if they are Python variables) : tutorial about numpy
Pandas (structured data like csv/excel files with data in columns). Example : pd.read_csv() (pd is an alias for pandas, and read_csv() get into a DataFrame a csv file)
format string : f'The {PATH} to data'

Pandas

popular libray to deal with csv files (list of tutorials about pandas)
pandas DataFrame looks like a R DataFrame (and a column from a pandas DataFrame is a pandas Series)
pandas works well with numpy : you can apply a numpy function on a pandas Series. Ex: df_raw.SalePrice = np.log(df_raw.SalePrice)
you can import pandas but it is already imported as pd in the fastai importation : from fastai.imports import * (check the file imports.py)
Remove a column from dataFrame : DataFrame.drop(column_name, axis=1)
pd.read_csv(f'{PATH}Train.csv', low_memory=False, parse_dates=["saledate"]) :
** low_memory=False : parse all dtypes of the file
** parse_dates=[] : give all columns names that are with dtype data (and will convert them to DataTime dtype)
(TIP) In Jupyter Notebook, if you type a variable name and press ctrl+enter whether that being Dataframe, video, HTML, etc — it will generally figure out a way of displaying it for you
df_raw.tail() : display last rows from the DataFrame (df_raw.tail().T = transpose)
SalePrice is the dependant variable.
save/load DataFrame by using feather (Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow) :
** save with df_raw.to_feather('tmp/bulldozers-raw')
** load with df_raw = pd.read_feather('tmp/bulldozers-raw')

Random Forest

(fastai definition) : Random Forest is a kind of universal machine learning technique.
** It is a way of predicting something with any kind of type : categorical (ex: dog or cat) or continuous (ex: price).
** In general, it does not overfit and it is easy to stop it overfitting.
** You do not need a separate validation set in general. It tells you how well it generalizes when you have one dataset.
** It does not assume that your data is normally distributed.
** It does not assume the relationship is linear.
** It requires few pieces of engineering.
Definition : Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.
2 concepts (which are wrong in practice) :
** curse of dimensionality : the more columns you have, the more empty is the space of you data (the more dimensions you have, the more points are on the edges). What that means in theory is that the distance between points is much less meaningful. But nowadays the world of machine learning has become very empirical and it turns out that in practice, building models on lots of columns works really well.
** no free lunch theorem : the claim is that there is no type of model that works well for any kind of dataset. Nowadays, there are empirical researchers who study which techniques work a lot of the time. Ensembles of decisions trees, which random forest for one, is perhaps the technique which most often comes at the top. Fast.ai provides a standard way to pre-process them properly and set their parameters.
importation of the RandomForest models : from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
Regression (RandomForestRegressor) : prevision of continuous variables
Classification (RandomForestClassificador) : prevision of categorical variables
methods from scikit-learn : the most important Machine Learning package in python (but not the best : XGBoost is better than Gradient Boosting Tree)
regression does not mean linear regression
2 lines :
** you create a model : m = RandomForestRegressor(n_jobs=-1)
** you train the model by passing first the independent variables and then the dependent variables : m.fit(df_raw.drop('SalePrice', axis=1), df_raw.SalePrice)

Missing values and features engineering

key points before to run a ML/DL model ! You must turn your data to numbers in order to train your model !
** either continuous
** either categorical with a single number. For example, you must transforme the dtype datetime to many columns with categorical numbers as year, month, day, is it a holiday ?, etc. (It really depends on what you are doing) : this is feature engineering.

First, use add_datepart() function on datetime column (Without expanding your date-time into these additional fields, you can’t capture any trend/cyclical behavior as a function of time at any of these granularities) : the DateTime column will be deleted and new columns (dtype : integer) will be added (nameDayofWeek, nameWeeofMonth, etc.).
Then, apply train_cats() on all DataFrame to move columns dtype with strings to pandas category (more, behind the scene, it creates columns with an integer and create a mapping between theses integers and their corresponding string values). To get the same mapping on the validation/testing set than on the training dataframe, you can use apply_cats(test_dataframe, training_dataframe).
** When there is no value in a cell, the corresponding integer (created in cat.codes) is -1.
** Once a DataFrame column is a category, you can use cat attribute to access informations. Ex: df_raw.UsageBand.cat.categories (get list of categories names) or df_raw.UsageBand.cat.codes (get list of corresponding codes)
** If you prefer to have ordinal category with another order, you can do : df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)
Finally, turn your dataframe to a numerical one without missing value and with normalization. For doing that, use the proc_df() function that split the dependent variable into a separate variable, moves categories to their numeric ones (and add +1 to all values : then the -1 value of missing values becomes 0), handle missing continuous values (missing continuous values replaced by the median on the column and creation of _na colum with 1 when missing value and 0 for others).

Run the RandomForestRegressor

m = RandomForestRegressor(n_jobs=-1)
m.fit(df, y)
m.score(df,y)