Hi people!!! As part of this study group, we are starting an algorithms meetup to hone our expertise in using data structures and algorithms, which can be useful for interviews as well.
Preparing for Leet-code styled coding interviews can be a very challenging task because the material is scattered and finding the perfect explanation for the problem can take time. I, along with a friend prepared for these interviews and I intend to cover some patterns that we learnt, (related to data-structures and algorithms) that were useful to us. We both got a new job after weeks of preparation and iteratively figuring out how not to fail. Please note that I will be just sharing my experience and by no means am I an expert (yet ). I hope my experience will help others in solving such coding problems and nailing that interview!!!
People who are interested can join the slack for our study-group using the link in the first post of this thread. (We would be using the #coding_interview_prep channel for this specific purpose)
Forum thread for reference and possible further discussion linked below in Resources
In Tendo’s notebook, total size of training set was 3256, so if we choose rows 800-1000 to be our validation set, that means, with 200 samples, we have a validation set that is around 6% of the training set. Is that enough?
test = TabularList.from_df(df.iloc[800:1000].copy(), path=path, cat_names=cat_names, cont_names=cont_names)
I didn’t quite gather if we fully resolved this in the discussion
Also, why 800-1000? Can we not achieve a more random split by using ratio/percentage like in sklearn?
one reason could be that we want a contiguous set for our validation, because much like, video frames, if we have adjacent frames, one in training, one in valid, then our model is not learning anything - it is cheating
Any other explanations? Is 6% enough?
Collaborative Filtering:
How do I differentiate between when to use collaborative filtering vs tabular?
A thought experiment. Taking the ‘US Salary’ example of Tabular, could I instead run Collaborative Filtering on that and come up with a recommendation for a salary?
Basic intuition for this is to look at it as:
Tabular :: Supervised
Collaborative Filtering :: Unsupervised
What are n_factors?
They are the hidden features that the model learns after training
For example, deciding that some movies are family-friendly vs others not. Family-friendliness is one of the n_factors.
So, while we set up the learner, is the number of n_factors we choose one of the hyperparameters?
It could affect speed and accuracy, but need more experiments to determine.
Just a reminder, we are having a meetup tomorrow(Sunday) at 4PM GMT. We will focus on projects showcase. This is the time for you to show off all your cool projects/get inspiration from others To join just use the same zoom link when the time will come.
If you have watched lesson 5 only once/twice, try testing your understanding using the below questions. If you can answer the below questions in two/three sentences, then you have a good understanding of lesson 5 concepts. Else consider reviewing the lecture/notes once again before moving on.
Why ReLUs are needed in the Neural Networks(NN)?
Is Affine function a linear function?
Does Bias-Variance trade-off happen in Deep Learning as well?
What is a Variance?
Do too many parameters in NN mean higher variance?
Why freeze is needed for fine-tuning? What happens when we freeze?
Why unfreeze is needed & train the entire model?
Can you explain how learning rates are applied to the layers in each of the below cases
1e-3
slice(1e-3)
slice(1e-5, 1e-3)
Can you identify the 3 different variants of GD? How much of training samples are used & when weights are computed in each of the variant? Does Stochastic gradient descent mean using mini-batches & updating loss after each mini-batch?
How/When do you update weights and describe the sequence of operations?
What is Learning Rate (LR) annealing ? Why are we applying LR?
Why are we applying the exponential before softmax?
What is the difference between a loss function & a cross function?
What is the difference between epoch and iteration?
Why do we need a cyclical learning rate? And what happens to momentum during one cycle?
Presenter: @gagan (Huge thanks for an excellent learning experience)
Disclaimer: Any mistakes in the notes are solely mine and not of the presenter. If you find any mistakes, please provide feedback in the comments so that I can correct them.
Agenda
Deep Learning internals with backpropagation
Fine-tuning in Transfer Learning
Learning Rate tricks: Annealing, Discriminative LR
Weight Decay
Momentum + RMSProp => ADAM
Mnist SGD with Cross-entropy and softmax
Meeting Notes
Deep Learning internals with backpropagation
Weights & Biases are the parameters that are typically learned.
the output of the matrix multiplication or result of calculating something are called activations
Affine functions : linear functions (a + b ) * c -> this will result in another affine function (another linear).
Why non-linearity? Real-world is not linear, hence we definitely need to introduce non-linearity. In order to model the real world, that’s why ReLU s are needed.
Backpropagation: Chain rule: How does a small change in one weight (eg: w2) affect the final loss J(W) Source: [1]
More number of parameters does not necessarily mean Higher Variance. Deep Learning can have more number of parameters but use Regularization to penalize for complexity.
Fine-tuning in Transfer Learning
why freeze? What happens when we freeze?
It replaces that last layer with newly added two layers (layer groups) with ReLU in between.
Earlier layer weights are good at identifying shapes, color, etc & hence frozen. Only the last layer weights (initially random) are set up for learning that is task-specific (eg: classify pets)
why unfreeze & train the entire model?
Earlier layer weights are completely trained on a different dataset (eg: ImageNet, Wiki). They contain useful information but also contain not useful information for this specific dataset (eg: classify pets).
Learning Rate tricks : Annealing, Discriminative LR
Leslie Smith paper
plots loss vs learning rate
1e-3 -> all layers same lr
slice(1e-3) : (final layers=1e-3, rest = (1e-3)/3) . Lower learning rate for the earlier layers because they are near to an optimum label & to avoid overshooting.
slice(1e-5, 1e-3). 1e-5 applied to first layer group, 1e-3 to last layer group and somewhere in between for middle layers
Applying different learning rate to different layer groups is called Discriminative Learning Rates.
3 layers groups are the default for CNN
Gradient Descent
What is the difference between vanilla Batch GD (avoiding confusion by adding vanilla) , SGD, Mini Batch GD?. We use MiniBatch GD. For more information see this post [5]
Too much of batch size lead to Out of memory error (meaning batch size is too high)
SGD is used during Online learning (product environment) - learning (training) on the go - batch_size is typically 1
Weight Decay : Makes the weight not overly significant
Weight Decay: (Source: Metacademy) When training neural networks, it is common to use “weight decay,” where after each update, the weights are multiplied by a factor slightly less than 1. This prevents the weights from growing too large, and can be seen as gradient descent on a quadratic regularization term.
Epoch vs Iteration
Let us consider a training dataset with 50,000 instances. An epoch is one run of the training algorithm across the entire training set. If we set a batch size of 100, we get 500 batches in 1 epoch or 500 iterations. The iteration count is accumulated over epochs, so that in epoch 2, we get iterations 501 to 1000 for the same batch of 500, and so one. [7]
Learning Rate Annealing reduce the lr dramatically as we are nearing convergence.
Polynomial functions can model anything. We can introduce a lot of complexity using a lot of parameters. Use Regularization to penalize complexity but still use a lot of parameters. One common way to do regularization is to use weight decay
Difference between Loss & Cost : The loss function (or error) is for a single training example, while the cost function is over the entire training set (or mini-batch for mini-batch gradient descent).
Andrew Ng: ““Finally, the loss function was defined with respect to a single training example. It measures how well you’re doing on a single training example. I’m now going to define something called the cost function, which measures how well you’re doing an entire training set. So the cost function J which is applied to your parameters W and B is going to be the average with one of the m of the sum of the loss function applied to each of the training examples and turn.””
Google says: The loss function computes the error for a single training example, while the cost function is the average of the loss functions of the entire training set