Lesson 4 In-Class Discussion ✅

If I’m not mistaken when asked the question about the magic number 2.6**4 in one of the learning rates, Jeremy explained the 2.6 but said the **4 would be explained later in lesson 4. Did I miss it ? Why is there a to the fourth?

1 Like

i removed “cols=” and just left the number in there to get past the error

This maybe works too

data = (TextList.from_csv(path, 'texts.csv', col='text')
                .random_split_by_pct(0.1)
                .label_from_df(cols=0)
                .databunch())

He didn’t explain in the video. may be later he explain.

1 Like

Have you tried to restart notebook kernel after getting the CUDA memory error? This resolved the issue for me. I am able to train IMDB notebook with bs=50 in 1080ti (11GB).

The correct syntax is:

data = (TextList.from_csv(path, 'texts.csv', col='text')
        .split_from_df(col=2)
        .label_from_df(cols=0)
        .databunch())

This is for fastai version:

import fastai
fastai.__version__
[9]: '1.0.24'
2 Likes

I ran into the same problem.
I trained the model up to: data_lm.save('tmp_lm')
Then I reset the kernel and loaded the data skipping the step that saved the tmp_lm. I loaded from: data_lm = TextLMDataBunch.load(path, 'tmp_lm')
And decreased the back propagation through time from 70 to 50.
learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.3, bptt=50)
From there on I could run the remainder of the notebook.

2 Likes

@gbecon @seppedl I’m not sure exactly what fixed it, but thanks. Initially restarting the kernel didn’t help. But, I pulled the latest library (went from 1.0.22 to 1.0.24), pulled the latest lesson 3 notebook (had changed a fair bit). Then tried:

  • bs = 50
  • bptt = 50
  • Restarted the kernel after creating data_lm and then started from data_lm = TextLMDataBunch.load(path, ‘tmp_lm’)

Now it seems to be working. It’s a beasty RNN! I’ve done some stuff before that maxed my GPU, but not with bptt at 50, this was at a few hundred!

2 Likes

Spoke a bit too soon. It died again when I started to unfreeze it. Thankfully killing the kernel and loading learn.load(‘fit_head’); seems to be working.

I suspect it’s a PyTorch issue, it seems to have issues leaving stuff in VRAM. I’ve had similar issues like this before. I guess you often get away with it because you’re not maxing it out.

Has not been included in the fastai library yet but will be possible for all applications once a labelling method for regression is created.

1 Like

IBM open-sourced a framework to detect and mitigate bias in machine learning: https://github.com/IBM/AIF360 However, I don’t completely agree with their approaches to mitigate biases, as it seems to be a lot of manual work, and some work against the idea of machine learning.

I would expect that carefully choosing the right optimization target (loss function) to be helpful, as it allows to penalize undesired biases next to your actual target function. If you want a gender-neural model, you could aim for it, then let the optimizer do its job. In the Amazon hiring case, one could aim for a model that is good at predicting the hiring decision while at the same time being bad at predicting the gender, to prevent the model to learn features that are gender specific. I haven’t faced this problem, but I would be curious if somebody has tried such an approach.

This worked for me… Thanks

Can I do transfer learning on WikiText103 if I want to do it for English text which resembles more like the English of Chaucer or Shakespeare; i.e. medieval to early Modern English. Or do we have have to train the language model from scratch for that purpose?

  • For classification you try to predict probabilities that a specific output corrensponds to a specific class. Therefore every output value should be a probability and therefore have a range between 0 and 1. To achieve this you usually use a non-linear function (softmax or sigmoid function e.t.c) as your last layer to squash the output between 0 and 1. In addition to this (which is wanted in classification tasks), these functions raise the probability output of likely classes while decreasing the probability of all other unlikely classes (therefore enforcing the network to choose for one specific class over the others).
  • For the regression task you are not looking for probability values as your output values but instead for real value numbers. In such a case no activation function is wanted since you want to be able to approximate any possible real value and not probabilities.
2 Likes

This could be useful in many similar applications, like predicting sales based on product description, viewing time based on post content, age based on writing sample and so on.

1 Like

This thread Language Model Zoo 🦍 recommends running the following line if problems persist:

torch.cuda.empty_cache()
2 Likes

I think this is not well known as it should:

https://vim-bootstrap.com/

You input your language of choice and the specific vim editor and it returns .vimrc full of wonderful plugins and optimized configuration.

There is a lot more things that an average vim user needs it but you can simply comment it out if you want to exclude something.

It’s truly amazing.

3 Likes

Just to be clear, these extensions don’t actually let you use VSCode on the remote machine; you’re basically “mounting” the files on your local machine and SSHing them back and forth between your local machine and the local machine.

Interesting.

Let me make sure I understand this. It sounds like you’re saying that yes, the LR (learning rate) finder does try different LRs on different mini batches. But this is “distorted” only in the same way that stochastic gradient descent is distorted. Is that right?

As I understand it, stochastic gradient descent (GD) means gradient descent where you update the learnable model parameters after every sample. Plain old (whole batch?) GD is when you update the learned parameters only after going through the entire training set (1 epoch). Mini-batch GD is where you update them after some in-between number of samples which fits comfortably in GPU memory, like 32. So mini-batch GD is still “stochastic” in the weak sense that the update per mini-batch depends on the random choice of the elements in that mini-batch, and you are suggesting the LR finder is stochastic in that same way in that same weak sense. That makes sense.

What still puzzles is that I think of minibatch gradient descent as adding noise to the descent trajectory in a way that adds up to roughly the same overall trajectory over an epoch, because the bias due to any one mini batch gets averaged out over all of them. But assessing the LR per mini batch seems only to add noise, without any additive process that corrects it. If one LR is measured as higher-than-true loss because of the choice of mini batch, then that error just sits there in the LR vs loss chart. It doesn’t get corrected by the next LR measurement being on average more correct.

I’ll dig into the Leslie Smith paper and see if he touches on this!

or every minibatch. In my view minibatch GD = SGD. The bigger the minibatch, the more accurate is the approximation of SGD to GD.

Well, in the lr_find we just want a good guess of the best learning rate. It is not intended to be perfect. If you run lr_find twice, there is no guarantee you will get the same curve.