Part 1, online study group

Hi gagan hope you’re having a wonderful day!

I found your first post informative and a real joy to read.

Cheers mrfabulous1 :smiley: :smiley:

1 Like

Thanks @mrfabulous1!! I’m glad you found it useful. !!

Hi people!!! As part of this study group, we are starting an algorithms meetup to hone our expertise in using data structures and algorithms, which can be useful for interviews as well.

Preparing for Leet-code styled coding interviews can be a very challenging task because the material is scattered and finding the perfect explanation for the problem can take time. I, along with a friend prepared for these interviews and I intend to cover some patterns that we learnt, (related to data-structures and algorithms) that were useful to us. We both got a new job after weeks of preparation and iteratively figuring out how not to fail. Please note that I will be just sharing my experience and by no means am I an expert (yet ). I hope my experience will help others in solving such coding problems and nailing that interview!!!

People who are interested can join the slack for our study-group using the link in the first post of this thread. (We would be using the #coding_interview_prep channel for this specific purpose)

3 Likes

Just a reminder, there is a meetup today, at 4PM GMT :wink: We will focus on Lesson 4!

1 Like

Hello all! first time in this meetup… just started lesson 4 today :slight_smile:

4 Likes

Right on time @oscarb :slightly_smiling_face:

Meeting Minutes of 02/02/2019

Presentation on Lesson 4 (Tabular and Collaborative Filtering)

Presenter: @Tendo

Thanks to @Tendo for the wonderful Colab notebooks!

Questions

Tabular Data:
  • What are the heuristics or the formula for determining the size of the hidden layers for the tabular learner?

    learn = tabular_learner(data, layers=[200,100], metrics=accuracy)

    • Forum thread for reference and possible further discussion linked below in Resources
  • In Tendo’s notebook, total size of training set was 3256, so if we choose rows 800-1000 to be our validation set, that means, with 200 samples, we have a validation set that is around 6% of the training set. Is that enough?

    test = TabularList.from_df(df.iloc[800:1000].copy(), path=path, cat_names=cat_names, cont_names=cont_names)

    • I didn’t quite gather if we fully resolved this in the discussion
    • Also, why 800-1000? Can we not achieve a more random split by using ratio/percentage like in sklearn?
      • one reason could be that we want a contiguous set for our validation, because much like, video frames, if we have adjacent frames, one in training, one in valid, then our model is not learning anything - it is cheating
      • Any other explanations? Is 6% enough?

Collaborative Filtering:

  • How do I differentiate between when to use collaborative filtering vs tabular?
    • A thought experiment. Taking the ‘US Salary’ example of Tabular, could I instead run Collaborative Filtering on that and come up with a recommendation for a salary?
    • Basic intuition for this is to look at it as:
      • Tabular :: Supervised
      • Collaborative Filtering :: Unsupervised
  • What are n_factors?
    • They are the hidden features that the model learns after training
      • For example, deciding that some movies are family-friendly vs others not. Family-friendliness is one of the n_factors.
    • So, while we set up the learner, is the number of n_factors we choose one of the hyperparameters?
      • It could affect speed and accuracy, but need more experiments to determine.

Resources

Jeremy’s tweet on Tabular:

6 Likes

Awesome work @shimsan

1 Like

Thank you @shimsan!

1 Like

Just a reminder, we are having a meetup tomorrow(Sunday) at 4PM GMT. We will focus on projects showcase. This is the time for you to show off all your cool projects/get inspiration from others :slightly_smiling_face: To join just use the same zoom link when the time will come.

1 Like

The meetup will start in ~15 mins :partying_face: Join zoom !

Overview of Gradient Descent

What is Gradient Descent(GD)?

  • It is a type of optimization algorithm to find the minimum of a function (loss function in NN).

Nice Analogy for understanding GD :

  • A person stuck in the mountain & trying to get down with minimal visibility due to fog (Source : Wikipedia).

Algorithm

Source: [1]

Variants of Gradient Descent

Source [2]

  • Stochastic Gradient Descent: weights updated using one sample at a time hence batch_size is 1, for 100 samples, weights updated 100 times
  • Batch Gradient Descent: weight updated using the whole dataset, for 100 samples, weight updated only once
  • Mini Batch: middle ground and combination of the above two. Splits the dataset into the batch size of samples of our choice & chosen at random

[1] https://medium.com/@divakar_239/stochastic-vs-batch-gradient-descent-8820568eada1
[2] https://suniljangirblog.wordpress.com/2018/12/13/variants-of-gradient-descent/

I hope this clarifies the different variants of gradient descent.

Lesson 5 - Questions

Audience: Beginner-Intermediate

If you have watched lesson 5 only once/twice, try testing your understanding using the below questions. If you can answer the below questions in two/three sentences, then you have a good understanding of lesson 5 concepts. Else consider reviewing the lecture/notes once again before moving on.

  • Why ReLUs are needed in the Neural Networks(NN)?
  • Is Affine function a linear function?
  • Does Bias-Variance trade-off happen in Deep Learning as well?
  • What is a Variance?
  • Do too many parameters in NN mean higher variance?
  • Why freeze is needed for fine-tuning? What happens when we freeze?
  • Why unfreeze is needed & train the entire model?
  • Can you explain how learning rates are applied to the layers in each of the below cases
    • 1e-3
    • slice(1e-3)
    • slice(1e-5, 1e-3)
  • Can you identify the 3 different variants of GD? How much of training samples are used & when weights are computed in each of the variant? Does Stochastic gradient descent mean using mini-batches & updating loss after each mini-batch?
  • How/When do you update weights and describe the sequence of operations?
  • What is Learning Rate (LR) annealing ? Why are we applying LR?
  • Why are we applying the exponential before softmax?
  • What is the difference between a loss function & a cross function?
  • What is the difference between epoch and iteration?
  • Why do we need a cyclical learning rate? And what happens to momentum during one cycle?
  • What are entropy and softmax?
  • When to use cross-entropy instead of, say, RMSE?
2 Likes

Meeting Minutes 16-02-2020

Topic: Lesson 5, Part 1

Start Time: Feb 16, 2020

Recording

Presenter: @gagan (Huge thanks for an excellent learning experience)

Disclaimer: Any mistakes in the notes are solely mine and not of the presenter. If you find any mistakes, please provide feedback in the comments so that I can correct them.

Agenda

  • Deep Learning internals with backpropagation
  • Fine-tuning in Transfer Learning
  • Learning Rate tricks: Annealing, Discriminative LR
  • Weight Decay
  • Momentum + RMSProp => ADAM
  • Mnist SGD with Cross-entropy and softmax

Meeting Notes

  • Deep Learning internals with backpropagation
    • Weights & Biases are the parameters that are typically learned.
    • the output of the matrix multiplication or result of calculating something are called activations
    • Affine functions : linear functions (a + b ) * c -> this will result in another affine function (another linear).
    • Why non-linearity? Real-world is not linear, hence we definitely need to introduce non-linearity. In order to model the real world, that’s why ReLU s are needed.
    • Backpropagation: Chain rule: How does a small change in one weight (eg: w2) affect the final loss J(W) Source: [1]


  • Bias/Variance tradeoff [2]

    • High bias(underfit)
    • High variance (overfit)
  • More number of parameters does not necessarily mean Higher Variance. Deep Learning can have more number of parameters but use Regularization to penalize for complexity.

  • Fine-tuning in Transfer Learning

    • why freeze? What happens when we freeze?
      • It replaces that last layer with newly added two layers (layer groups) with ReLU in between.
      • Earlier layer weights are good at identifying shapes, color, etc & hence frozen. Only the last layer weights (initially random) are set up for learning that is task-specific (eg: classify pets)
    • why unfreeze & train the entire model?
      • Earlier layer weights are completely trained on a different dataset (eg: ImageNet, Wiki). They contain useful information but also contain not useful information for this specific dataset (eg: classify pets).
  • Learning Rate tricks : Annealing, Discriminative LR

    • Leslie Smith paper
    • plots loss vs learning rate
      • 1e-3 -> all layers same lr
      • slice(1e-3) : (final layers=1e-3, rest = (1e-3)/3) . Lower learning rate for the earlier layers because they are near to an optimum label & to avoid overshooting.
      • slice(1e-5, 1e-3). 1e-5 applied to first layer group, 1e-3 to last layer group and somewhere in between for middle layers
    • Applying different learning rate to different layer groups is called Discriminative Learning Rates.
    • 3 layers groups are the default for CNN
  • Gradient Descent

    • What is the difference between vanilla Batch GD (avoiding confusion by adding vanilla) , SGD, Mini Batch GD?. We use MiniBatch GD. For more information see this post [5]
    • Too much of batch size lead to Out of memory error (meaning batch size is too high)
    • SGD is used during Online learning (product environment) - learning (training) on the go - batch_size is typically 1
  • MNIST - SGD
    See this post for code & explanation

    • Neural Network without hidden layers => Logistic Regression

    • Notebook walkthrough lesson5-sgd-mnist notebook

    • Mnist SGD with Cross-Entropy and Softmax (entropy_example.xlsx)

      • Softmax (& exponential) : finite, positive range, % of cattiness (the given feature) in this image. It guarantees
        • All the activations add up to 1
        • All of the activations are > 0
        • All of the activations are < 1
      • entropy : measure of chaos (lack of orderliness)
        • type of loss function
        • High Penalization for wrong answer.
        • Very low penalization for correct answer
      • Loss & Cost are almost similar & related.
      • Loss function (error for each sample) - difference between prediction & actual
      • Cost function (entire dataset or mini-batch) - average of the losses for the mini batch - hence loss function is part of cost function.
      • (Classification/ categorical : +cross-entropy. Regression/Continuous : RMSE)
    • Weight Decay : Makes the weight not overly significant

      • Weight Decay: (Source: Metacademy) When training neural networks, it is common to use “weight decay,” where after each update, the weights are multiplied by a factor slightly less than 1. This prevents the weights from growing too large, and can be seen as gradient descent on a quadratic regularization term.

      • Epoch vs Iteration

        • Let us consider a training dataset with 50,000 instances. An epoch is one run of the training algorithm across the entire training set. If we set a batch size of 100, we get 500 batches in 1 epoch or 500 iterations. The iteration count is accumulated over epochs, so that in epoch 2, we get iterations 501 to 1000 for the same batch of 500, and so one. [7]
  • Learning Rate Annealing reduce the lr dramatically as we are nearing convergence.

  • Polynomial functions can model anything. We can introduce a lot of complexity using a lot of parameters. Use Regularization to penalize complexity but still use a lot of parameters. One common way to do regularization is to use weight decay

  • Momentum + RMSProp => ADAM

  • a pure Pytorch setup for FashionMnist dataset.

Questions

See the above post

Advice / Action Items

  • Watch at least the lectures 1 & 2 of Intro to Deep Learning from MIT [1]
  • Go back and write backpropagation code in pure python [3].
  • Too much of batch size lead to Out of memory error (meaning batch size is too high)

Resources

Misc

  • Difference between Loss & Cost : The loss function (or error) is for a single training example, while the cost function is over the entire training set (or mini-batch for mini-batch gradient descent).
  • Andrew Ng: ““Finally, the loss function was defined with respect to a single training example. It measures how well you’re doing on a single training example. I’m now going to define something called the cost function, which measures how well you’re doing an entire training set. So the cost function J which is applied to your parameters W and B is going to be the average with one of the m of the sum of the loss function applied to each of the training examples and turn.””
  • Google says: The loss function computes the error for a single training example, while the cost function is the average of the loss functions of the entire training set
6 Likes

Thanks @msivanes for posting such detailed and review-again-worthy notes!!!

The meeting is on, join guys! :slightly_smiling_face:

Hi msivanes hope your having a wonderful day!

Thanks for creating excellent notes!

Cheers mrfabulous1 :smiley: :smiley:

Is this still going, in any format? I’d love to be a part of any study group. Thanks!

yep, we are super active now, join to our slack :slight_smile:

Two domains. One is predictive and the other has to do with astronomy.