Using QRNN in Language Models


QRNNs where introduced in this article by James Bradbury, Stephen Merity, Caiming Xiong and Richard Socher as an alternative to LSTMs. The main advantage is that they are between 2 and 4 times faster (depending on your batch size/bptt) and can reach same state of the art results.

I’ve adapted their QRNN pytorch implementation into the fastai library. To use it, you must first install the cupy package, then just add the option qrnn=True when you build a Language model, for instance:

learner = md.get_model(opt_fn, em_sz, nh, nl, dropouti=drops[0], dropout=drops[1], 
                      wdrop=drops[2], dropoute=drops[3], dropouth=drops[4], qrnn=True)

To install the cupy package, just use the instruction on their github repo. It should be as easy as pip install cupy-cudaXX (with XX being 80, 90 or 91 depending on your cuda version). Note that on Windows, you must install Visual c++ build tools first for it to work (scroll the page a bit to find them on their own without Visual Studio 2017).

I’m currently trying to find a good set of hyper-parameters (all the dropouts have to change for instance) and will share a notebook as soon as I have something as good as the LSTM version of the Language Model.

Language Model Zoo :gorilla:

Great work! Can’t wait to try it. Is it also multi GPU as the original repo?


It should be, though I have only tested it on one GPU for now.

(Ben Johnson) #4

Have you tried replacing the LSTM in ULMFit w/ the QRNN yet? If not, I can do it if you share the pretrained model with me.


I haven’t pretrained a model with QRNNs on wt103 yet, but I will as soon as I figure the best set of hyper-parameters for training! Then I’ll share the result with the regular LSTMs pretrained models.

(Divyansh Jha) #6

Great Work!!

(Monique Monteiro) #7


Have you tried it with your French language model?

Best regards,


Not yet. Right now, I’ve trained an Enlgish model and I’m looking at redoing the imdb notebook with it.

(Thomas Wolf) #9

That’s really cool!
Are you starting from their latest set of QRNN hyper-parameters in “An Analysis of Neural Language Modeling at Multiple Scales” ( ?


Exactly. With just a few tweaks to use the 1cycle policy and try to achieve super-convergence there.

(Sooheon Kim) #11


I’ve installed cupy-cuda91 (same as my cuda version), and tried to get this working, but it errors out here:

~/fastai/fastai/torchqrnn/ in <module>()
      2 import torch
      3 from torch.autograd import Variable
----> 4 from cupy.cuda import function
      5 from cupy.cuda.compiler import _NVRTCProgram
      6 from collections import namedtuple

ImportError: cannot import name 'function'

Are you using a different version of cupy which has that definition?


Did you install cupy inside the fastai environment?
I’m using the regular version of cupy from the github repo I mentioned in the first post.

(Sooheon Kim) #13

Yep, sorry, this was just installation wonkiness.


I’m getting KeyError: 'unexpected key "0.rnns.0.module.weight_ih_l0" in state_dict, when I run learner.model.load_state_dict(wgts) though it goes away when I remove qrnn=True. Has this happened to anyone else?


This line is to load the pretrained model from Jeremy, which has LSTMs and not QRNNs.
There is no pretrained QRNN model yet (working on this, but for now my pretrained model doesn’t get as good results on imdb).

(adrian) #16

Good stuff. I note you fixed a bug in wdrop in the fastai codebase a couple of weeks ago. I was getting an error when i set wdrop to zero and ended up just setting to a very low value for experiments. On dropout hyperparams, i wrote up some test results here.

I was planning some more analysis, but reached the 80/20 effort mark i think. Im happy to run more experiments if you think of anything.

One thing i tried was to use a ‘reducer’ mask instead of full dropout on weights, but need more checking to ensure is doing what i want before showing results


Yes, I saw your blog post. Very interesting stuff! I’ll look to see if I can find why there is a difference in the weight drop results.

(adrian) #18

I have been re-running the analysis with 100 epochs. Gives a different picture of things. Ive added results so far as can be seen in figures 12-14. Will be another day or two before all runs are complete.

A few things I am not clear on, and I need to look into a bit more: Why the change in loss after a number of epochs (10-20 for other dropouts at zero, 30-50 for defaults for others) exhibit such a smooth response; and why there is a distinct drop in loss just before this occurs.

As one might expect having dropout on all parameters results in a far more stable loss response (Figure 13). Will be interesting to see the results of just using input dropout. Looks like just having dropout at stage at which weight drop and later (output dropout) is applied results in poorer performance over time no-matter what you do (Figure 12 a), c)), but if we apply dropout at the front of the network we can still keep improving (Figure 12 b) embedding dropout >= 0.5)