Part 2 Lesson 10 wiki

rudraksh · April 12, 2018, 6:49pm

I seem to be getting a weird error when calling the fit function on language model learner. Any help in this regard would be highly appreciated!

TypeError                                 Traceback (most recent call last)
<ipython-input-13-08ddcd7c7a23> in <module>()
      1 lr=1e-3
      2 lrs = lr
----> 3 learner.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1)

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in fit(self, lrs, n_cycle, wds, **kwargs)
    249         """
    250         self.sched = None
--> 251         layer_opt = self.get_layer_opt(lrs, wds)
    252         return self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
    253 

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in get_layer_opt(self, lrs, wds)
    221             An instance of a LayerOptimizer
    222         """
--> 223         return LayerOptimizer(self.opt_fn, self.get_layer_groups(), lrs, wds)
    224 
    225     def fit(self, lrs, n_cycle, wds=None, **kwargs):

/usr/local/lib/python3.6/dist-packages/fastai/layer_optimizer.py in __init__(self, opt_fn, layer_groups, lrs, wds)
     15         if len(wds)==1: wds=wds*len(layer_groups)
     16         self.layer_groups,self.lrs,self.wds = layer_groups,lrs,wds
---> 17         self.opt = opt_fn(self.opt_params())
     18 
     19     def opt_params(self):

/usr/local/lib/python3.6/dist-packages/torch/optim/adam.py in __init__(self, params, lr, betas, eps, weight_decay)
     27         defaults = dict(lr=lr, betas=betas, eps=eps,
     28                         weight_decay=weight_decay)
---> 29         super(Adam, self).__init__(params, defaults)
     30 
     31     def step(self, closure=None):

/usr/local/lib/python3.6/dist-packages/torch/optim/optimizer.py in __init__(self, params, defaults)
     37 
     38         for param_group in param_groups:
---> 39             self.add_param_group(param_group)
     40 
     41     def __getstate__(self):

/usr/local/lib/python3.6/dist-packages/torch/optim/optimizer.py in add_param_group(self, param_group)
    149             if not isinstance(param, Variable):
    150                 raise TypeError("optimizer can only optimize Variables, "
--> 151                                 "but one of the params is " + torch.typename(param))
    152             if not param.requires_grad:
    153                 raise ValueError("optimizing a parameter that doesn't require gradients")

TypeError: optimizer can only optimize Variables, but one of the params is float

rudraksh · April 13, 2018, 7:04am

Nevermind, I figured it out after almost 3 hours of debugging. You need to explicitly specify the betas argument while defining the opt_fn, otherwise, the passed tuple is assumed to be a parameter that needs to be optimized.

wgpubs · April 13, 2018, 8:16pm

Every get confirmation of this?

I suspect the same, but wanted to make sure I wasn’t missing something.

wgpubs · April 13, 2018, 8:35pm

Are you running on windows or mac?

If on Windows and you have a GPU that isn’t supported by pytorch, you have two options:

set USE_GPU=False and continue to use the default fastai environment
install the CPU only version of PyTorch for windows:

conda env update -f 'environment-cpy.yml
activate fastai-cpu
conda uninstall pytorch
conda install -c peterjc123 pytorch-cpu

See here for more info: https://github.com/peterjc123/pytorch-scripts

binga · April 25, 2018, 1:19am

Hello,

I’m trying to manually evaluate the quality of the language model by seeding the model with a string and looking at the outputs. so, I’ve taken a sentence, tokenized it, passed it through the model by numericalizing -> creating a variable -> calling the pytorch module.

When I pass in a sentence of 3 words, I’m receiving an output tensor of size 3 x 60000. (60000 is my vocab size). My question is - how do I interpret this output tensor?

np.argmax(output tensor, 1) returns three indices with max probabilities and are these the next three words?
Does np.argmax(output tensor, 1) output the words at 2,3,4 positions while we feed in the words at 1,2,3 positions?

Did anyone try doing this?
Thanks.

KevinB · April 25, 2018, 2:07am

You will want to look at index 0 from the predictions. I made a small predictor that has a vocab size of 13, here is what mine looks like.
What I input:

needPrediction = np.array([[2,3,4]])
probs = learner.model(V(needPrediction))

So my model just increments by one, so I’m looking for 3,4,5

Variable containing:

Columns 0 to 7 
1st Prediction------>-48.4365 -48.4331  -0.5324   5.1874   1.3968   1.2458   3.0119   1.8143
2nd Prediction----->34.0651 -34.0371  -0.7144   0.6367   4.0070   0.7774   0.6088   2.0590
3rd Prediction------>-34.2130 -34.1164  -0.6074   0.4625   0.7307   3.4674   0.4744   1.1649

Columns 8 to 12 
 -2.2317  -0.1768  -0.7748   2.4104 -48.3797
 -0.2430   0.2249  -2.5451  -1.1012 -34.0218
  1.1998  -1.1252  -0.2692  -1.9911 -34.1119
[torch.cuda.FloatTensor of size 3x13 (GPU 0)]

I’m less confident on the predictions[1] and predictions[2]. I believe these two are activations of the hidden layer, but I’m not quite sure why there would be two of them. Maybe somebody else could answer that.

binga · April 25, 2018, 2:46am

Thanks for looking into this Kevin.

The probs variable is a tuple from which I looked at probs[0] and this has the size 3x13 according to your example. If each row corresponds to the outputs 3,4,5 when I pass in 2,3,4, what I’m looking forward to as the next output in the sequence is the 3rd prediction (to_np(probs[0].data)[2,:]), right?!

KevinB · April 25, 2018, 2:54am

The way I get it is probs[0][-1] which gives me:

Variable containing:
-68.5039
-68.7718
 -3.5490
 -1.0605
 -1.5748
  2.6255
  1.6271
  2.0393
 -3.0633
 -0.8552
  1.3549
 -1.7153
-68.8291
[torch.cuda.FloatTensor of size 13 (GPU 0)]

[-1] always gives you the last one which is probably what you would want in most cases.

narvind2003 · April 25, 2018, 3:37am

Can you please share your code @binga? You might be looking at activations from each of the nh layers. If so you would want to pick the last one.

binga · April 25, 2018, 6:23am

Hi @narvind2003,

The first element of the output is not from the nh layers IMHO. I did print out the shapes and sizes though and re-checked it.

The notebook is available here - https://github.com/binga/fastai_notes/blob/master/experiments/notebooks/lang_models/Telugu_Language_Model_inference_test.ipynb
The inference piece is at this code block - In [21]:

Could you verify this inference piece? I’m not sure if I am interpreting the outputs right or my language model isn’t good enough and I’ve to improve it further.

narvind2003 · April 25, 2018, 4:18pm

Sure.
[Edited ]
The linear decoder returns 3 things if you see lm_rnn.py:
result,outputs,output
In this case, you are in fact seeing predictions from the decoder via probs[0]. To get the same prediction from the output, try this roundabout method.

narvind2003 · April 25, 2018, 9:37pm

Also, did you try adjusting your dropouts based on the size of your corpus?

binga · April 25, 2018, 9:40pm

I’m currently trying your approach… I’ll get back to you with my findings.

I have similar number of tokens as the imdb hence I haven’t changed the dropout values yet.

narvind2003 · April 26, 2018, 12:53am

If you see lm_rnn.py Jeremy simply returns 3 things from the decoder/pooling classifier… result,outputs, output.
So probs[0] is the result of the decoder/classifier.

sgugger · April 26, 2018, 1:55am

To be more precise, the output the RNNLearner is indeed a tuple with three things: decoded, raws, outs

decoded, the first one, is the result of the last hidden state that went through the decoder. With a softmax, you can turn it into the probs of each word. Its shape is sequence_length * batch_size by vocab_size
raws, the hidden layers of our LSTMs. There is three of them in the language model (which is our nl) so it’s a list of three tensors that have a size of sequence_lenght by batch_size by the hidden size of its corresponding LSTM.
outs, same as raws, but after the last dropout layer.

The reason it returns all of this, and not only the decoded output, is that sometimes (when you want to build an attention layer on top of your model for instance) you need the hidden states.

nok · April 30, 2018, 6:14pm

Did anyone have idea how do we format the data if there is multi-labels text data (In lecture Jeremy mention about “labels”, follow by “text” csv file)? i.e., it can be both class at the same time. I have tried to look into the source code but I am not sure how I can do it. Did we have any standard api for this kind of text dataset?

In addition, how can we tweak the model to support multi-output? I try to look at lesson 9 when we are doing multi-class output for image classification, but cannot figure it out to work for LM. Would love some help or just point me to things that I should look at, thanks!

wgpubs · May 1, 2018, 5:26pm

Starting to go back through part 2 class and I have a few questions on Lesson 10 and the imdb notebook:

When you build the .csv files for classification you eliminated the “unsup” labels for train.csv but not for test.csv, why?
In your fixup() method you replace a bunch of things with other values based on what you discovered after looking at 12 different datasets. Given a corpus, what can/should we do to figure out what should and should not be “fixed up”?
Instead of using xfld 1 for delimiting fields, would it not be better to use xfld_1 as the token “1” is likely to be used elsewhere in corpus?
Thoughts on using the entire corpus to build the vocab rather than just the training set? I’ve seen both on kaggle competitions and wondering what the consensus is as well as pros/cons for both approaches
At the end of this notebook you mention that “with bidir we get a 95.4% accuracy.” Did you do this by just using the pre-trained language model as is with the bwd_wt103.h5 weights -or- did you fully train a language model using the pre-trained weights to start with (as you did in the notebook)?
Training an LM even with pre-trained weights takes a long time. Is the ultimate objective to be able to use the encoder from a pre-trained LM to do classification without first training an LM on their particular corpus?

jeremy · May 1, 2018, 6:58pm

wgpubs:

When you build the .csv files for classification you eliminated the “unsup” labels for train.csv but not for test.csv, why?

In your fixup() method you replace a bunch of things with other values based on what you discovered after looking at 12 different datasets. Given a corpus, what can/should we do to figure out what should and should not be “fixed up”?

Instead of using xfld 1 for delimiting fields, would it not be better to use xfld_1 as the token “1” is likely to be used elsewhere in corpus?

Thoughts on using the entire corpus to build the vocab rather than just the training set? I’ve seen both on kaggle competitions and wondering what the consensus is as well as pros/cons for both approaches

At the end of this notebook you mention that “with bidir we get a 95.4% accuracy.” Did you do this by just using the pre-trained language model as is with the bwd_wt103.h5 weights -or- did you fully train a language model using the pre-trained weights to start with (as you did in the notebook)?

Training an LM even with pre-trained weights takes a long time. Is the ultimate objective to be able to use the encoder from a pre-trained LM to do classification without first training an LM on their particular corpus?

I don’t think there should be unsup in test
I just looked for odd tokenization issues or markup in the docs manually
The concept of “new field” can only be learned is xfld is a separate token. The RNN can learn about xfld 1 as a concept by using state
If you’re training an LM, makes sense to use the whole thing
I repeated the whole process end to end for the backward model
If you’ve got an LM that’s somewhat close to your target corpus, you could just fine-tune it briefly, or even skip straight to the classifier.

nok · May 3, 2018, 7:21pm

I was trying to have a multi-label model, i.e. with output of 7 class [0,0,0,1,0,0,1].
I change the model crit to F.binaray_cross_entropy and get this error and running.

I struggle to debug this as there was multiple class passing around.

I also try to pass in an input to visualize the output but fail.

tmp = iter(md.trn_dl)
*xs, y = next(tmp)
m(*VV(xs))

jeremy · May 3, 2018, 9:23pm

You need to call reset on your model first, if it’s an RNN, when debugging it in this way. That’s what creates the initial hidden state.