Lesson 4 In-Class Discussion ✅

IBM open-sourced a framework to detect and mitigate bias in machine learning: https://github.com/IBM/AIF360 However, I don’t completely agree with their approaches to mitigate biases, as it seems to be a lot of manual work, and some work against the idea of machine learning.

I would expect that carefully choosing the right optimization target (loss function) to be helpful, as it allows to penalize undesired biases next to your actual target function. If you want a gender-neural model, you could aim for it, then let the optimizer do its job. In the Amazon hiring case, one could aim for a model that is good at predicting the hiring decision while at the same time being bad at predicting the gender, to prevent the model to learn features that are gender specific. I haven’t faced this problem, but I would be curious if somebody has tried such an approach.

This worked for me… Thanks

Can I do transfer learning on WikiText103 if I want to do it for English text which resembles more like the English of Chaucer or Shakespeare; i.e. medieval to early Modern English. Or do we have have to train the language model from scratch for that purpose?

  • For classification you try to predict probabilities that a specific output corrensponds to a specific class. Therefore every output value should be a probability and therefore have a range between 0 and 1. To achieve this you usually use a non-linear function (softmax or sigmoid function e.t.c) as your last layer to squash the output between 0 and 1. In addition to this (which is wanted in classification tasks), these functions raise the probability output of likely classes while decreasing the probability of all other unlikely classes (therefore enforcing the network to choose for one specific class over the others).
  • For the regression task you are not looking for probability values as your output values but instead for real value numbers. In such a case no activation function is wanted since you want to be able to approximate any possible real value and not probabilities.

This could be useful in many similar applications, like predicting sales based on product description, viewing time based on post content, age based on writing sample and so on.

1 Like

This thread Language Model Zoo 🦍 recommends running the following line if problems persist:


I think this is not well known as it should:


You input your language of choice and the specific vim editor and it returns .vimrc full of wonderful plugins and optimized configuration.

There is a lot more things that an average vim user needs it but you can simply comment it out if you want to exclude something.

It’s truly amazing.


Just to be clear, these extensions don’t actually let you use VSCode on the remote machine; you’re basically “mounting” the files on your local machine and SSHing them back and forth between your local machine and the local machine.


Let me make sure I understand this. It sounds like you’re saying that yes, the LR (learning rate) finder does try different LRs on different mini batches. But this is “distorted” only in the same way that stochastic gradient descent is distorted. Is that right?

As I understand it, stochastic gradient descent (GD) means gradient descent where you update the learnable model parameters after every sample. Plain old (whole batch?) GD is when you update the learned parameters only after going through the entire training set (1 epoch). Mini-batch GD is where you update them after some in-between number of samples which fits comfortably in GPU memory, like 32. So mini-batch GD is still “stochastic” in the weak sense that the update per mini-batch depends on the random choice of the elements in that mini-batch, and you are suggesting the LR finder is stochastic in that same way in that same weak sense. That makes sense.

What still puzzles is that I think of minibatch gradient descent as adding noise to the descent trajectory in a way that adds up to roughly the same overall trajectory over an epoch, because the bias due to any one mini batch gets averaged out over all of them. But assessing the LR per mini batch seems only to add noise, without any additive process that corrects it. If one LR is measured as higher-than-true loss because of the choice of mini batch, then that error just sits there in the LR vs loss chart. It doesn’t get corrected by the next LR measurement being on average more correct.

I’ll dig into the Leslie Smith paper and see if he touches on this!

or every minibatch. In my view minibatch GD = SGD. The bigger the minibatch, the more accurate is the approximation of SGD to GD.

Well, in the lr_find we just want a good guess of the best learning rate. It is not intended to be perfect. If you run lr_find twice, there is no guarantee you will get the same curve.

How, exactly?

Thank you for your thoughtful reply!

Unfreezing layers one by one enables the model to learn more and deeper. As an analogy, you can think of it as studying high school, undergraduate, masters, PHD, instead of directly jumping to PHD


I am very excited to start implementing the Excel Gradient Descent Solver Add-In on a variety of datasets and see what happens. Thank you for another inspirational lecture! The pro-tips are priceless. And I really appreciate @jeremy’s defense of Excel and dislike of Sheets! Our organization makes heavy use of Excel despite a massive volume of data being generated all the time. Excel is another quality tool in the Data Science toolbox. Not great for everything but excellent for certain applications.


Is there an easier way to use a custom tokenizer? Currently doing this:

class MyTokenizer(PreProcessor):
    def __init__(self):

    def process_one(self, item):
        return tokenize(item)
    def process(self, ds):
        tokens = [tokenize(item) for item in ds.items]
        ds.items = tokens

data_lm = (TextList.from_csv(PATH, 'my_file.csv', col=0, processor=[MyTokenizer(),NumericalizeProcessor()])

Quiet strange in one we need to give col and in other cols.
It should be uniform


In imdb lesson when doing


I got the error RuntimeError: CUDA out of memory. Tried to allocate 1.03 GiB (GPU 0; 7.43 GiB total capacity; 4.62 GiB already allocated; 816.94 MiB free; 1.51 GiB cached)
Before running this command gpu was using only 375mb of memory. Why it is taking so much memory that it can’t get fit into my gpu

Did somebody succeeded in running this notebook in GCP?

Did you find a solution to

RuntimeError: CUDA error: out of memory

in the imdb notebook?
I have an 8GB card and I cannot run learn.lr_find() without getting this error even if I change the batch size or bptt?

1 Like

Even i tried bs thing. But got same problem. Didn’t find any solution till now