QRNNs where introduced in this article by James Bradbury, Stephen Merity, Caiming Xiong and Richard Socher as an alternative to LSTMs. The main advantage is that they are between 2 and 4 times faster (depending on your batch size/bptt) and can reach same state of the art results.
I’ve adapted their QRNN pytorch implementation into the fastai library. To use it, you must first install the cupy package, then just add the option qrnn=True when you build a Language model, for instance:
To install the cupy package, just use the instruction on their github repo. It should be as easy as pip install cupy-cudaXX (with XX being 80, 90 or 91 depending on your cuda version). Note that on Windows, you must install Visual c++ build tools first for it to work (scroll the page a bit to find them on their own without Visual Studio 2017).
I’m currently trying to find a good set of hyper-parameters (all the dropouts have to change for instance) and will share a notebook as soon as I have something as good as the LSTM version of the Language Model.
I haven’t pretrained a model with QRNNs on wt103 yet, but I will as soon as I figure the best set of hyper-parameters for training! Then I’ll share the result with the regular LSTMs pretrained models.
That’s really cool!
Are you starting from their latest set of QRNN hyper-parameters in “An Analysis of Neural Language Modeling at Multiple Scales” (https://arxiv.org/abs/1803.08240) ?
I’ve installed cupy-cuda91 (same as my cuda version), and tried to get this working, but it errors out here:
~/fastai/fastai/torchqrnn/forget_mult.py in <module>()
2 import torch
3 from torch.autograd import Variable
----> 4 from cupy.cuda import function
5 from cupy.cuda.compiler import _NVRTCProgram
6 from collections import namedtuple
ImportError: cannot import name 'function'
Are you using a different version of cupy which has that definition?
I’m getting KeyError: 'unexpected key "0.rnns.0.module.weight_ih_l0" in state_dict, when I run learner.model.load_state_dict(wgts) though it goes away when I remove qrnn=True. Has this happened to anyone else?
This line is to load the pretrained model from Jeremy, which has LSTMs and not QRNNs.
There is no pretrained QRNN model yet (working on this, but for now my pretrained model doesn’t get as good results on imdb).
Good stuff. I note you fixed a bug in wdrop in the fastai codebase a couple of weeks ago. I was getting an error when i set wdrop to zero and ended up just setting to a very low value for experiments. On dropout hyperparams, i wrote up some test results here.
I was planning some more analysis, but reached the 80/20 effort mark i think. Im happy to run more experiments if you think of anything.
One thing i tried was to use a ‘reducer’ mask instead of full dropout on weights, but need more checking to ensure is doing what i want before showing results
I have been re-running the analysis with 100 epochs. Gives a different picture of things. Ive added results so far as can be seen in figures 12-14. Will be another day or two before all runs are complete.
A few things I am not clear on, and I need to look into a bit more: Why the change in loss after a number of epochs (10-20 for other dropouts at zero, 30-50 for defaults for others) exhibit such a smooth response; and why there is a distinct drop in loss just before this occurs.
As one might expect having dropout on all parameters results in a far more stable loss response (Figure 13). Will be interesting to see the results of just using input dropout. Looks like just having dropout at stage at which weight drop and later (output dropout) is applied results in poorer performance over time no-matter what you do (Figure 12 a), c)), but if we apply dropout at the front of the network we can still keep improving (Figure 12 b) embedding dropout >= 0.5)
How come the pretrained LM with QRNN wasn’t as good as the one without?
I’m training LMs using QRNN and so far I’m just writing down the params I used for each training session and the results after each session. I remember Jeremy changing the dropouts depending on the results from each session, so the whole thing seemed a bit “organic”.
What’s a good way to document the LM training process?