QRNNs where introduced in this article by James Bradbury, Stephen Merity, Caiming Xiong and Richard Socher as an alternative to LSTMs. The main advantage is that they are between 2 and 4 times faster (depending on your batch size/bptt) and can reach same state of the art results.
I’ve adapted their QRNN pytorch implementation into the fastai library. To use it, you must first install the cupy package, then just add the option qrnn=True when you build a Language model, for instance:
learner = md.get_model(opt_fn, em_sz, nh, nl, dropouti=drops, dropout=drops,
wdrop=drops, dropoute=drops, dropouth=drops, qrnn=True)
To install the cupy package, just use the instruction on their github repo. It should be as easy as pip install cupy-cudaXX (with XX being 80, 90 or 91 depending on your cuda version). Note that on Windows, you must install Visual c++ build tools first for it to work (scroll the page a bit to find them on their own without Visual Studio 2017).
I’m currently trying to find a good set of hyper-parameters (all the dropouts have to change for instance) and will share a notebook as soon as I have something as good as the LSTM version of the Language Model.
Great work! Can’t wait to try it. Is it also multi GPU as the original repo?
It should be, though I have only tested it on one GPU for now.
Have you tried replacing the LSTM in ULMFit w/ the QRNN yet? If not, I can do it if you share the pretrained model with me.
I haven’t pretrained a model with QRNNs on wt103 yet, but I will as soon as I figure the best set of hyper-parameters for training! Then I’ll share the result with the regular LSTMs pretrained models.
Have you tried it with your French language model?
Not yet. Right now, I’ve trained an Enlgish model and I’m looking at redoing the imdb notebook with it.
That’s really cool!
Are you starting from their latest set of QRNN hyper-parameters in “An Analysis of Neural Language Modeling at Multiple Scales” (https://arxiv.org/abs/1803.08240) ?
Exactly. With just a few tweaks to use the 1cycle policy and try to achieve super-convergence there.
I’ve installed cupy-cuda91 (same as my cuda version), and tried to get this working, but it errors out here:
~/fastai/fastai/torchqrnn/forget_mult.py in <module>()
2 import torch
3 from torch.autograd import Variable
----> 4 from cupy.cuda import function
5 from cupy.cuda.compiler import _NVRTCProgram
6 from collections import namedtuple
ImportError: cannot import name 'function'
Are you using a different version of cupy which has that definition?
Did you install cupy inside the fastai environment?
I’m using the regular version of cupy from the github repo I mentioned in the first post.
Yep, sorry, this was just installation wonkiness.
KeyError: 'unexpected key "0.rnns.0.module.weight_ih_l0" in state_dict, when I run
learner.model.load_state_dict(wgts) though it goes away when I remove
qrnn=True. Has this happened to anyone else?
This line is to load the pretrained model from Jeremy, which has LSTMs and not QRNNs.
There is no pretrained QRNN model yet (working on this, but for now my pretrained model doesn’t get as good results on imdb).
Good stuff. I note you fixed a bug in wdrop in the fastai codebase a couple of weeks ago. I was getting an error when i set wdrop to zero and ended up just setting to a very low value for experiments. On dropout hyperparams, i wrote up some test results here.
I was planning some more analysis, but reached the 80/20 effort mark i think. Im happy to run more experiments if you think of anything.
One thing i tried was to use a ‘reducer’ mask instead of full dropout on weights, but need more checking to ensure is doing what i want before showing results
Yes, I saw your blog post. Very interesting stuff! I’ll look to see if I can find why there is a difference in the weight drop results.
I have been re-running the analysis with 100 epochs. Gives a different picture of things. Ive added results so far as can be seen in figures 12-14. Will be another day or two before all runs are complete.
A few things I am not clear on, and I need to look into a bit more: Why the change in loss after a number of epochs (10-20 for other dropouts at zero, 30-50 for defaults for others) exhibit such a smooth response; and why there is a distinct drop in loss just before this occurs.
As one might expect having dropout on all parameters results in a far more stable loss response (Figure 13). Will be interesting to see the results of just using input dropout. Looks like just having dropout at stage at which weight drop and later (output dropout) is applied results in poorer performance over time no-matter what you do (Figure 12 a), c)), but if we apply dropout at the front of the network we can still keep improving (Figure 12 b) embedding dropout >= 0.5)
Can you explain how you adapted it to the fastai library? Or generally how you would go about adapting other implementations?
How come the pretrained LM with QRNN wasn’t as good as the one without?
I’m training LMs using QRNN and so far I’m just writing down the params I used for each training session and the results after each session. I remember Jeremy changing the dropouts depending on the results from each session, so the whole thing seemed a bit “organic”.
What’s a good way to document the LM training process?