The interpretation of text classification throws an error when used on a pretrained network

miko · February 28, 2019, 9:45am

EDIT: I tried using “Using TextClassificationInterpretation throws a cudnn Runtime error” and the forum told Title seems unclear, is it a complete sentence? wat?

I was quite excited by the new TextClassificationInterpretation, as I have been thinking about it for a while, but cannot get it to work.

I have the following installation on a paperspace vm

=== Software === 
python        : 3.6.7
fastai        : 1.0.46
fastprogress  : 0.1.19
torch         : 1.0.0
nvidia driver : 410.73
torch cuda    : 9.0.176 / is available
torch cudnn   : 7401 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 24449MB | Quadro P6000

=== Environment === 
platform      : Linux-4.4.0-128-generic-x86_64-with-debian-stretch-sid
distro        : #154-Ubuntu SMP Fri May 25 14:15:18 UTC 2018
conda env     : fastai
python        : /home/paperspace/anaconda3/envs/fastai/bin/python
sys.path      : /home/paperspace/anaconda3/envs/fastai/lib/python36.zip
/home/paperspace/anaconda3/envs/fastai/lib/python3.6
/home/paperspace/anaconda3/envs/fastai/lib/python3.6/lib-dynload

/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages
/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/IPython/extensions
/home/paperspace/.ipython

I have been developing on a text classification project at work and so far everything has been quite good, with minor hiccups here and there. I don’t think the details are important, but if I try

bs=256
data_clas = TextClasDataBunch.load(path, 'saved_classifier_data', bs=bs)
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.7)
learn.load(my_trained_classifier');
ci = TextClassificationInterpretation.from_learner(learn) 
ci.show_intrinsic_attention("Please classify this sentence")

I get the following error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-10-93405a760cc1> in <module>
----> 1 ci.show_intrinsic_attention("I want this to be classified please")

~/anaconda3/envs/fastai/lib/python3.6/site-packages/fastai/text/models/awd_lstm.py in show_intrinsic_attention(self, text, class_id, **kwargs)
    256 
    257     def show_intrinsic_attention(self, text:str, class_id:int=None, **kwargs)->None:
--> 258         text, attn = self.intrinsic_attention(text, class_id)
    259         show_piece_attn(text.text.split(), to_np(attn), **kwargs)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/fastai/text/models/awd_lstm.py in intrinsic_attention(self, text, class_id)
    245         cl = self.model[1](self.model[0].module(emb, from_embeddings=True))[0].softmax(dim=-1)
    246         if class_id is None: class_id = cl.argmax()
--> 247         cl[0][class_id].backward()
    248         attn = emb.grad.squeeze().abs().sum(dim=-1)
    249         attn /= attn.max()

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    100                 products. Defaults to ``False``.
    101         """
--> 102         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    103 
    104     def register_hook(self, hook):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     88     Variable._execution_engine.run_backward(
     89         tensors, grad_tensors, retain_graph, create_graph,
---> 90         allow_unreachable=True)  # allow_unreachable flag
     91 
     92 

RuntimeError: cudnn RNN backward can only be called in training mode

As much as I can tell this problem should have gone away with PyTorch 1, but I am at a loss and don’t understand the code enough to make a guess at what I am doing wrong

miko · February 28, 2019, 10:54am

It seems I have maybe have found a way of making it work. I am reluctant to call it a solution as I have basically went by trial and error and there might be nuances in the code that I do not understand.

This works for me

learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.7)
learn.load('saved_classifier');

def intrinsic_attention_modified(self, text:str, class_id:int=None):
        ids = self.data.one_item(text)[0]
        emb = self.model[0].module.encoder(ids).detach().requires_grad_(True)
        self.model.train()
        self.model.zero_grad()
        self.model.reset()
        lstm_output = self.model[0].module(emb, from_embeddings=True)
        self.model.eval()
        self.model.zero_grad()
        self.model.reset()
        cl = self.model[1](lstm_output)[0].softmax(dim=-1)
        if class_id is None: class_id = cl.argmax()
        cl[0][class_id].backward()
        attn = emb.grad.squeeze().abs().sum(dim=-1)
        attn /= attn.max()
        tokens = self.data.single_ds.reconstruct(ids[0])
        return tokens, attn

TextClassificationInterpretation.intrinsic_attention = intrinsic_attention_modified
ci = TextClassificationInterpretation.from_learner(learn)


ci.show_intrinsic_attention("please could you classify this sentence?")

It looks like pytorch only calculate gradients for the RNN in training mode, so I split the network into it’s head and body, put it in training mode for the body and back into eval mode for the head (otherwise it complains that the dimensions are wrong, my guess is because batchnorm does not work without an actual batch, but I might be horribly wrong). I then seem to have reasonable good results

EDIT: It seems though that the attention changes in an unpredictable fashion. I can only guess that this is due to the “training” I introduced, which might modify the gradients. I am way out of my comfort zone now

sgugger · February 28, 2019, 2:19pm

Tagging @herrmann as he is the contributor who developed this feature.

Esteban · March 3, 2019, 5:21am

I am having the same issue with this new feature. Trying to hack a fix, but unsuccessful so far

miko · March 3, 2019, 11:46am

Have you tried the changes I posted above?

Esteban · March 3, 2019, 7:52pm

Yes I gave your changes a try and it did allow the code to run, but it didn’t look like it was working correctly. The colors just got darker from the start of the sentence to the end. I was expecting words like “hated” and “amazing” to be emphasized in the color coding, but that didn’t seem to be the case. Are you getting good results with the modified code?

miko · March 3, 2019, 8:09pm

I am getting meaningful results, although I did not sepnd a long time with it yet. I changed the color map to “Purples” as it reflected better the attention in my opinion. Are you asure you are not overfitting?

Esteban · March 3, 2019, 8:23pm

I will give it another try. I was just using the IMDB_SAMPLE dataset and only ran for a few epochs, so the trained model wasn’t great (~70%). Can I ask what your test cases are? Can you post an example of the output? Thanks.

Esteban · March 4, 2019, 3:10am

Ran some additional tests with a better model (~94% on full IMDB) and got the following results.

This is a really cool feature. If words highlighted in dark green are the ones that influence the prediction the most (and red having the least) then it does work well in many cases. A little strange that xxbos was dark green in a few cases.

[Update] Some of the cells will show slightly different results if I run it multiple time, which is a little strange

miko · March 4, 2019, 9:55am

Yes, the colour map is not ideal: reds up to yellow-white have a score of 0-0.5 and everything above that is on the greenish side of thing (you can hover your mouse over the word for the actual intrinsic attention score). My solution is to use a different colour map, and I found that Purples works quite well. Just import matplotlib.cm as cm and pass the argument cmap=cm.Purples (or whatever to colour map you prefer) to show_intrinsic_attention call.

As for the instability of the answer, it surprised me a bit, but I reckon I just don’t know enough of what the code is looking. As soon as I have the time to read the relevant paper I will know more. In the meanwhile, I added a piece of code that runs the attention calculations 20 times and take the mean to do my data analysis. Horrendously slow, but fits the purpose.

I am sorry I cannot share actual examples of what I have so far, but I am working on company’s data with sensitive personal information, so I cannot make it public. I’ll try to share the code later today (currently a bit bound for time)

Esteban · March 4, 2019, 10:03pm

I will give the Purple color map a try.

The instability might result from the way dropout is applied in training mode. Different parts of the network getting disabled on subsequent runs, thus producing slightly different output. Just a thought.

miko · March 5, 2019, 7:28am

Yes, I think you are right. And if I understand correctly, I turned on dropout by setting the LSTM in train mode. Which begs the question: why were we getting that error in the first place?

miko · March 5, 2019, 11:45am

Good news, I think to have found a hacky way of making it more stable:

def intrinsic_attention_modified(self, text:str, class_id:int=None):
        """Calculate the intrinsic attention of the input w.r.t to an output `class_id`, or the classification given by the model if `None`.
        For reference, see the Sequential Jacobian session at https://www.cs.toronto.edu/~graves/preprint.pdf
        """
        self.model.train()
        eval_dropouts(self.model)
        self.model.zero_grad()
        self.model.reset()
        ids = self.data.one_item(text)[0]
        emb = self.model[0].module.encoder(ids).detach().requires_grad_(True)                
        lstm_output = self.model[0].module(emb, from_embeddings=True)
        self.model.eval()
        cl = self.model[1](lstm_output)[0].softmax(dim=-1)
        if class_id is None: class_id = cl.argmax()
        cl[0][class_id].backward()
        attn = emb.grad.squeeze().abs().sum(dim=-1)
        attn /= attn.max() 
        tokens = self.data.single_ds.reconstruct(ids[0])
        return tokens, attn

def eval_dropouts(mod):
        module_name =  mod.__class__.__name__
        if 'Dropout' in module_name or 'BatchNorm' in module_name: mod.training = False
        for module in mod.children(): eval_dropouts(module)

Essentially, if I manually set the dropouts layers in eval mode, the cudnn error does not trigger and the results are stable (as dropout is skipped). I still have to put the network back into eval mode after the LSTM output has been computed, otherwise I get complaint of the batch being not large enough. This is a fairly minimal set of changes that seems to work, but I am pretty sure there is a better way of checking whether a nn module has dropouts or not…

EDIT: added batchnorms to the layers put manually in eval mode

sgugger · March 5, 2019, 2:54pm

You should also manually put the BatchNorm layers in eval mode.

miko · March 5, 2019, 3:22pm

I put the whole model back in eval mode before evaluating the output of the head, so at least for the awd_lstm it should not be necessary as I don’t see BatchNorm layers in the RNN? Anyway, yes, I might as well in order for the hack to be a little more general

sgugger · March 5, 2019, 3:25pm

Ah, I missed that part! So it should be good then
Do you want to suggest a PR with the updated code? It’s a bit hacky but it works and the current version is broken.

miko · March 5, 2019, 3:26pm

Sure thing, albeit having strings hardcoded in the functions scares me

miko · March 5, 2019, 4:50pm

I was trying to add a test to check that both the problem is there and my patch fixes it, but I am now getting a different error.

This:

words = 'this is just a random set of words to use'.split()
df = pd.DataFrame([[np.random.randint(0, 3), ' '.join(np.random.choice(words, 10, replace=True))] for _ in range(128)])
db = TextClasDataBunch.from_df('.', df, df)
learn = text_classifier_learner(db, AWD_LSTM, pretrained=False)

ci = TextClassificationInterpretation.from_learner(learn)
ci.intrinsic_attention('something something')

now throws this:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-71696478e2c3> in <module>()
     19 
     20 ci = TextClassificationInterpretation.from_learner(learn)
---> 21 ci.intrinsic_attention('something something')

<ipython-input-36-71696478e2c3> in intrinsic_attention(self, text, class_id)
      8         self.model.zero_grad()
      9         self.model.reset()
---> 10         cl = self.model[1](self.model[0].module(emb, from_embeddings=True))[0].softmax(dim=-1)
     11         if class_id is None: class_id = cl.argmax()
     12         cl[0][class_id].backward()

/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/Repos/fastai-fork/fastai/text/learner.py in forward(self, input)
    227 
    228     def forward(self, input:Tuple[Tensor,Tensor, Tensor])->Tuple[Tensor,Tensor,Tensor]:
--> 229         raw_outputs,outputs,mask = input
    230         output = outputs[-1]
    231         avg_pool = output.masked_fill(mask[:,:,None], 0).mean(dim=1)

ValueError: not enough values to unpack (expected 3, got 2)

Has anything changed ?

EDIT: notice that if I try the same trick on v1.0.46, I don’t get this error, and the fix works

sgugger · March 5, 2019, 6:06pm

Ah yes, I added a mask to ignore padding in the mean and max pooling, so the encoder now returns three things. There may be things to adapt in the rest of the code to go along with it.

sgugger · March 13, 2019, 7:10pm

Put your code and updated it to the new version of fastai to fix the issue. What’s your GitHub handle so that I can properly thank you in the CHANGES.md file?