Beginning of NLP


#1

It’s been a while since there was a new article here. We still have to do a big edit on the transforms/data augmentation but in the meantime, here is a bit of NLP. I’ll cover two things today: the AWD-LSTM from Stephen Merity (which is our basic LM in fastai) and the pre-processing of texts in general.

AWD-LSTM

As it was the case with fastai, the basic Language Model in fastai_v1 will be the AWD-LSTM from Stephen Merity. There has been a lot of talking about the Transformer model, and we’ll probably propose it as an alternative later on, but the date where the library needs to be ready is near so we decided to focus on something we know well for now. All the new implementation is in notebook 007.

A language model is a model who is tasked to predict what the next word is in a sentence, given all the previous ones. They often use a Recurrent Neural Network which has the particularity of having a hidden state. Each time we feed a new word inside the model, the hidden stage is updated, but since it’s never reinitialized, it keeps track of everything that was said in the sentence, giving the model the aptitude to remember things (in a way).

This gives us a constraint compared to traditional Computer Vision: we can’t shuffle the batches during training, we have to give them in the right order otherwise it will mess up those hidden states. Sadly, the GPU memory is constrained, so we need to reset the gradients after a certain amount of words is passed otherwise we’ll raise the traditional CUDA out of memory error. The number of words for which we keep the gradients is called bptt for backpropagation through time. It’s usually a fixed number, but the first genius idea of the article is to have it slightly modified at each batch (randomly), so that when we run another epoch, the model isn’t fed exactly the same sequences of words in the same order.

The second main idea of the article is to put dropout everywhere. Really everywhere, but in an intelligent way. There is a difference with the usual dropout, which is why you’ll see a RNNDropout module inside notebook 007: we zeros thing, as is usual in dropout, but we always zero the same thing according to the sequence dimension (which is the first dimension in pytorch). This ensures consistency when updating the hidden state through the whole sentences/articles. If we unroll our input like in the picture up there, it means we zero the same coordinates inside i1, i2…

This being given, there are five different dropouts:

  • the first one, embedding dropout, is applied when we look the ids of our tokens inside the embedding matrix (to transform them from numbers to a vector of float). We zero some lines of it, so random ids are sent to a vector of zeros instead of being sent to their embedding vector.
  • the second one, input dropout, is applied to the result of the embedding with dropout. We forget random pieces of the embedding matrix (but as stated in the last paragraph, the same ones in the sequence dimension).
  • the third one is the weight dropout. It’s the trickiest to implement as we randomly replace by 0s some weights of the hidden-to-hidden matrix inside the RNN: this needs to be done in a way that ensure the gradients are still computed and the initial weights still updated.
  • the fourth one is the hidden dropout. It’s applied to the output of one of the layers of the RNN before it’s used as input of the next layer (again same coordinates are zeroed in the sequence dimension). This one isn’t applied to the last output, but rather…
  • the fifth one is the output dropout, it’s applied to the last output of the model (and like the others, it’s applied the same way through the first dimension).

Lastly, we have an additional regularization called AR (for Activation Regularization) and TAR (for Temporal Activation Regularization). They look a bit like weight decay, which can be seen as adding to the loss a scaled factor of the sum of the weights squared.

  • for AR, we add to the loss a scaled factor of the sum of all the squares of the ouputs (with dropout applied) of the various layers of the RNN. Weight decay tries to get the network to learn small weights, this is to get the model to learn to produce smaller activations (intuitively)
  • for TAR, we add to the loss a scaled factor of the sum of the squares of the h_(t+1) - h_t, where h_i is the output (before dropout is applied) of one layer of the RNN at the time step i (word i of the sentence). This will encourage the model to produce activations that don’t vary too fast.

Those are implemented in the callback RNNTrainer (previously in fastai it was the seq2seq_reg function).

Text preprocessing

This is covered in notebook 007a. The basic idea is to make it super easy for the end user and do all the necessary stuff inside the text dataset. This led to the class TextDataset, which can define such a dataset from multiple ways (folders with a train, valid structure, csv file, tokens, ids). It takes a tokenizer if you define it at a level below token and will create (or use if you pass it) the vocab that will convert such tokens to ids.

The tokenizer is wrapped in the Tokenizer class. This on takes a tokenizer function (like a spacy tokenizer), a language, and a set of rules such as remove multiple spaces or replace all caps word by TOK_UP then the word without caps. Special tokens of the fastai_v1 library all begin by xx, so we have xxbos, xxeos, xxpad, xxunk…

If you remember the imdb notebook, you know that tokenizing can take quite a bit of time, so the TextDataset class creates a folder where it stores all the computed arrays (tokens, then ids) so that it doesn’t recalculate them next time you go through the notebook (unless you’ve changed something). All in all, the goal is to have something that is as easy to use as the ImageData in old fastai.


(Brian Muhia) #2

Great progress! I want to test this out. The best way is still to install fastai_v1 using pip install git+https://github.com/fastai/fastai_v1.git, right?


#3

No you should clone the fastai_v1 repo. It’s not installable yet, just notebooks.


(WG) #4

Really looks good.

This is a gist I shared with Jeremy back when I first heard that v1 was happening.

Similar to your work here, I suggest a “pluggable” architecture wherein you can use the base classes for things like string cleaning, tokenization, and vocab, or create your own derived classes as needed. Additionally, I wrap the entire process of getting your tokens (both raw and numericalized), vocab, and labels with a single call to an “orchestration” class I call TextProcessor. With this in place, I can have trainable data in 3-5 or so lines of code with sensible defaults for preprocessing, tokenization, and vocab building.

It currently follows the sklearn method naming conventions (e.g., fit, fit_transform, transform) since they make sense and most folks can infer what they mean without looking at any documentation.

Anyways, I share this here because I didn’t want to infect your code with anything out of line with where the v1 framework is heading … but if you are open to my suggestions, you can infer them from the gist or I’d be glad to make some pull requests. Either way, I’m really liking where things are going.


(nirant) #5

I don’t think I ever appreciated how powerful fastai abstractions for NLP (specially pre-processing and dataloaders) are, till I went ahead and wrote a simple torchtext notebook myself.

Fantastic stuff!

And this is despite the fact that text loaders are not as easy to use as ImageData (as already noted). Look forward to how this turns out!


(Jeremy Howard) #6

Many thanks for the reminder - we’ll definitely take a look at this and see if some ideas make sense to bring in to fastai_v1. Probably doesn’t make sense to do a PR since we’re still figuring out the design.


(Even Oldridge) #8

Great write up Sylvain. I had a basic understanding of AWD-LSTM but this is much more detailed and is quite helpful. Just to confirm my understanding and maybe help clarify the text assuming this is going to be used as a description in a notebook, this consistency is happening at the batch level? And it remains consistent across BPPT and the length of the sequence, but changes with the next batch?

I’m looking forward to the transformer implementation, as that’s an architecture I’ve studied a little. I also think we should be exploring attention for this architecture and more generally. I’m hoping it’ll be a part of Part 2 this year.


#9

Dropout is consistent across sequence length, yes. Then the coordinates dropout will most likely be different from batch to batch (or other dimensions if applicable) since it’s randomly applied. For instance, if we have a 3D-tensor (seq_len = 3 by batch_size = 2 by embedding size or hidden state = 3) like this
[[[0.1,0.2,0.3], [1.,2.,3.]],
[[0.4,0.5,0.6], [4.,5.,6.]]
[[0.7,0.8,0.9], [7.,8.,9.]]]
A RNN-dropped out version could look like:
[[[0.1,0.,0.3], [0.,2.,0.]],
[[0.4,0.,0.6], [0.,5.,0.]]
[[0.7,0.,0.9], [0.,8.,0.]]]


(Yashu Seth) #10

I recently wrote a blog post on the AWD-LSTM paper giving a walk-through of the different techniques employed in it. Hope it can be useful.

What makes the AWD-LSTM great?