Lecture 5
Build a tabular model from scratch
00:00 Tabular model from scratch
00:42 Review Titanic dataset and the two models in excel
01:30 From excel to python
Linear model and neuralnet from scratch
01:57 Clean version of notebook
- What does a clean version of the notebook look like?
02:38 Get comfortable in Paperspace Gradient
-
How to work with jupyterlab mode instead of default mode?
-
How to swift between jupyterlab mode and jupyter notebook mode?
-
Learn some useful keyboard shortcuts
04:30 Things to do with clean notebook
05:17 Same notebook runs on Kaggle and everywhere
06:42 Libraries and format setup
07:24 Read train.csv as Dataframe
- How to read and display a csv file in pandas dataframe format? #data-cleaning
07:47 Find and count missing data with pandas
09:32 Choose mode value for the missing data
-
What is the most common choice for replacing the missing data regardless categorical or continuous? mode #data-cleaning
-
How to select the first mode value if there are two modes available for a column?
10:43 Be proactively curious
12:22 Replace missing data with mode values
- How to fill in the missing data with mode values of those columns with or without creating a new dataframe? #data-cleaning
13:14 Keep things simple where we can
-
Why use the world’s simplest way of filling missing data?
-
Does this simplest way work most of the time?
-
Do we always know the complicated way would help?
13:54 Don’t throw out rows nor columns
14:53 Describe your data or columns
15:52 See your columns in histogram
-
What to do with interesting columns? #data-describing
-
What can you find out with histogram?
-
What is long-tailed distribution of the column? What does it look like?
16:26 Log transformation on long-tailed columns
-
Which models don’t like long-tailed distributions in the data? #best-practice #data-describing
-
What is the easiest way to turn long-tailed to centered distribution?
-
Where to find more about the log and log curve?
-
What does log do in one sentence? 17:11
-
How to avoid the problem of log(0)? adding 1
-
What does the column data (histogram) look like after transformed by log?
17:53 Most likely long-tailed data
- What kind of data are most likely to be long-tailed which need log transformation? #best-practice
18:10 Describe non-numerical columns
-
How to describe seemingly numerical but actual categorical columns? #data-describing
-
How to describe all non-numeric columns altogether?
-
What does this description look like? (how it differ from those of numeric data)
18:55 Apply coefficients on categorical columns
-
How to apply coefficients to categorical columns? #data-cleaning
-
What does it mean by applying dummy variables to categorical columns?
-
What are the two ways of getting dummy variables and what’s Jeremy view on them?
-
What does the dummy variable transformation of categorical variables look like ?
21:13 The secret power of name column
-
Can a model built only on name column score No.1 in Titanic competition? #surprise
-
Where to find more about it?
-
This technique is not covered in this lecture
23:01 Tensor
-
Why focus on pytorch rather than numpy?
-
What data format does pytorch require? How to do this data format conversion? #data-cleaning
-
What is a tensor? Where did it come from?
-
How to turn all independent columns into a single large tensor?
-
What is the number type does tensor need? float
-
How to check the shape of a tensor? (num of rows and columns)
-
How to check the rank/dimensions/axis of a tensor? What is rank?
-
What is the rank of a vector, a table/matrix, or a zero?
26:46 Create random coefficients
-
Why we don’t need a constant here as in excel? #question
-
How many coefficients we need? How we figure it out?
-
How to create a vector of randomized numbers for the coefficients?
-
How to make the coefficients value centered? Why this is important? #question (answered later)
27:54 Reproducibility of coefficients
-
How to create the same set of random numbers for your coefficients each time running the cell?
-
When to and not to use random seed to make your result reproducible?
-
How not using random seed can help understand your model intuitively?
29:30 Broadcasting: data * coefficients operation on GPU
-
What is broadcasting? Isn’t it just matrix and vector multiplication?
-
Where did it come from?
-
What are the benefits of using broadcasting?
-
simple code vs lots of boilerplate
-
coded and optimized in C language for GPU computation
-
What’s the rule of broadcasting and where to find more about it?
-
a great blog post help understand broadcasting
33:43 Normalization: make the same range of values for each column
-
What would happen when the values of a column is much larger than the values of other columns?
-
Why to make every data column to have the same range of values?
-
How to achieve the same range for all column values?
-
What are the two major ways of doing normalization?
-
Does Jeremy favor one over the other?
37:08 Sum up to get predictions
-
How to sum up the multiplication of each row with the coefficients, and do it for all rows?
-
Is the summed-up number the prediction for each person/row of data?
37:32 A default choice for loss function
-
How to make the model better? Gradient descent
-
What is needed to do gradient descent? loss function
-
What does a loss function do? measure the performance of coefficients
-
What is Jeremy’s default/favorite choice for loss function?
-
Why does Jeremy always write the loss function out manually when experimenting?
38:24 Make notebook readable/understandable in the future
39:39 Update coefficients with gradient descent in Pytorch
-
How to ask PyTorch to do gradients on coefficients?
-
How to ask Pytorch to update values on the same coefficients tensor (not create new one)?
-
What does loss function do besides giving us a loss value? What does it store?
-
What function to run with loss to calculate gradients for coefficients?
-
How to access the gradients of coefficients? and how to interpret the gradients?
-
How to decide whether it is to subtract or add gradients to coefficient? #question
-
How to choose on the learning rate? #question
-
How to calc updated loss with renewed coefficients?
41:54 Split the dataset
-
Why did Jeremy randomly split training and validation set for Titanic dataset? #data-splitting
-
Why to use fastai’s random splitter function?
-
How to create the training and validation set with the splitter function?
43:41 Encapsulate functions for model training
-
How does Jeremy create functions like init_coeffs, update_coeffs, one_epoch, train_model from the exploratory steps above?
-
How to use the train_model function to see how well the model works?
44:50 Titanic dataset is a good playground
45:24 Display coefficients
-
How to display the final coefficients?
-
How to interpret it? #question
-
Can we make some sense of the values inside?
46:01 Accuracy as metrics
-
Why not use accuracy as loss function? #question
-
What can we use accuracy function for?
-
What threshold did Jeremy use for survival?
-
How to calculate accuracy and put it into a function?
48:07 Sigmoid function: ease coefficients optimization
-
What you see from the predictions make you think of using sigmoid function? #surprise
-
Why sigmoid function can really make optimization easier for the model?
-
Why the two-ends plateau of the function is good for optimization? (to tolerate very large and small values of predictions rather than forcing every prediction to get closer to 1 or 0)
-
Why the straight-line middle part of the function plot is also what we want? 48:58
-
How to plot any function with just one line of code? What library is this? sympy
-
How to update calc_preds function with sigmoid function easily in Jupyter? 50:52
-
Why to make predictions to center on 0 before sigmoid function? (a reply by Jeremy)
-
Do you remember what did Jeremy do to make prediction center on 0? (see how initial coefficients is defined, a cell link on Kaggle)
-
Why allow predictions to be large or small can make weights optimization easier? (Jeremy’s reply)
-
How python with Jupyter make exploratory work so easy?
-
How come the learning rate jump from 0.1 before sigmoid to 2 after using sigmoid? #question 51:57
-
When or How often (as a rule) should you use sigmoid function to your prediction? 52:23
-
Does HF library specify whether they use sigmoid or not? (probably the others neither)
-
What You need to watch out for optimization is the input and output not the middle for now. Why is that? 53:13
54:17 What if test dataset has extra columns?
55:58 Submit to Kaggle
-
How and why Jeremy replaced a missing value of Fare with 0?
-
How to apply the above data cleaning steps to the test set?
-
How to prepare the output column expected by Kaggle?
-
How to create the submit file expected by Kaggle?
Key steps from linear model to neuralnet
58:24 val_indep * coeffs vs val_indep @ coeffs
-
What do we know about val_indep * coeffs mean? Is it element-wise? Is it matrix and vector multiplication?
-
What do we know about val_indep @ coeffs? Is it matrix-matrix multiplication?
-
Is (val_indep * coeffs).sum(axis=1) equal to val_indep @ coeffs?
-
What should we know about them to distinguish them properly? #question
-
In val_indep @ coeffs, when coeffs is a matrix, do we need to specify its shape? 59:50
-
How to initiate coefficients as a one column matrix rather than a vector?
1:01:26 Transform existing vectors into matrix
- How to turn both trns_dep and vald_dep from existing vectors to matrices which responding to coeffs matrix?
Building a neural net
1:03:31 Keep linear model and neuralnet comparable in output
-
How to create a layer within multi-hidden layers inside (or a matrix of coefficients rather than a vector of coefficients)?
-
why to divide the coefficients of the multi-hidden layers by the number of layers (or the matrix of coefficients by the number of columns)?
-
Is it to make sure the outputs of neuralnet and previous linear model are comparable?
1:05:31 Build the output layer
1:06:10 TRY to getting the training started
-
Why Jeremy make the coefficients of the output layer to minus 0.3?
-
What does it mean by this minus 0.3 can get the training start? #question
-
(I guess Jeremy may tried -0.5 first, experiment to find it out)
1:06:26 Adding Constant or not
-
Why we don’t need a constant for layer 1 (think of the constant of the linear model)?
-
Why we must have a constant for layer 2?
-
Do coefficients of layer1, layer 2 and constants all need their own gradients initiated?
1:07:05 Building the model
-
What is a tuple and how it is used for grouping and separating the three coefficients?
-
How to construct the prediction function by sending data through layer 1 and layer 2 finally add constants? #model-building
1:08:08 A neuralnet done but super fiddly
-
How to update all three coefficients in a loop?
-
Did you notice that the learning rate changed again? (1.4, last time was 2, earlier was 0.1)
-
What did Jeremy say getting this model work was super fiddly?
From neuralnet (1 hidden layer) to deep learning with 2 hidden layers
1:09:09 Initialize coefficients of all hidden layers and constants
-
How to initialize coefficients of 2 hidden layers and 1 output layer and constants, and get gradients ready for all of them, in one compact function? #model-building
-
What are the shape of each coefficient matrix?
1:10:36 Building the 2 hidden layer model
-
What are activation functions?
-
What are the activation functions for 2 hidden layers?
-
What is the activation function for the output layer?
-
What is the most common mistake on applying activation function to the final layer?
1:11:46 Train the model
-
Don’t forget to update gradients
-
Which are those numbers Jeremy still have to fiddle to get right?
-
Did this deep learning model improve on the loss and accuracy?
1:12:12 Dissect and Experiment large functions
- How to experiment on a large function like the init_coeffs by breaking it into small pieces and running them?
1:13:16 Tabular datasets: where deep learning not shining
-
How should we think about that both neuralnet and deep learning models aren’t better?
-
What does it mean that a carefully designed algo beat all deep learning models in Titanic competition?
-
What datasets do deep learning generally perform better?
Framework: DL without super fiddling notebook
1:15:28 Why use framework rather than from scratch
-
Why you should use a library framework in real life rather than building yourself like the above?
-
When to do it from scratch?
-
What can a framework do for us?
-
Can it automate the obvious things like initialization, learning rate, dummy variable, normalization, etc?
-
Can I still make choice on the not-so obvious things?
1:16:34 Feature engineering with pandas
1:18:25 Automate the obvious when preparing dataset
-
How framework make cateorifying data, filling missing data, normalization automatic?
-
How to specify the dependent column to be a category?
1:19:37 Build multi-hidden layers with one line of code
1:19:56 Automate the search of best learning rate
Predict and Submit with ease
1:21:29 Automate transformation of test set in one line of code
- How to automatically apply all transformations done to training and validation sets to test set?
Experiment with Ensembling
1:23:03 Ensemble is easy with fastai
1:23:36 The simplest ensemble
-
What does a simple ensemble look like?
-
How to build, run and predict with 5 models with ease?
-
How different are those 5 models? (only initial coefficients are different)
-
How to combine their predictions?
-
How much improvement does this simple ensemble get us to?
1:25:20 Ways to combine the predictions
-
Why not use mode but mean?
-
What are 3 ways of combining the predictions?
-
Does one is better than the others?
-
What’s Jeremy’s suggestion?
How Random Forest really work
- Is this a good place to also learn pandas and numpy?
1:26:38 Why Random Forest
-
What is the history of Random Forest and Jeremy?
-
What does Jeremy think of random forest?
-
Why random forest is so much easy and better?
-
Why the seemingly simple logistic regression is so easy to get wrong?
1:28:34 Pandas categorical function
-
How to import all the libraries you need at once?
-
How to do fillna with pandas and log with numpy?
-
What does panda categorical function do for us?
-
What’s friendly display after the function applied?
-
What’s the actual data transformation under the hood?
-
Key points to make: No dummy variables, Pclass no long needed to be viewed as categories
1:30:54 Binary splits: bases of random forest
1:31:14 A binary splits on gender
1:32:32 Build a binary splits model on gender with sklearn
1:33:59 Build a binary splits model on Ticket prices with sklearn
1:35:38 Build a score machine on binary splits regardless categorical or continuous
-
What is a good split?
-
Is it good that within each group their dependent values are similar?
-
How to measure the similarity of values within a group? std
-
How to compare standard deviations between two groups appropriately? (multiply by size)
-
How to calc the score for evaluating the splits based on the value of combined std of two groups?
1:39:01 Automate the score machine on all columns
- How to find the best binary splits by trying out all possible split points of a column?
1:41:26 1R model as the baseline
-
What is a random forest? and what is a random forest?
-
What is 1r model?
-
how good was it in the 90s of ML world?
-
Should we always go for complicated models?
-
Should we always start with a 1r model as a baseline model for our problem?