Lesson 4 - RNN model details and underlying fast.ai code

I’ve been poking around the library trying to understand what’s going on under the hood of the RNN model @jeremy presented in lesson 4 and reading the associated paper. I’m hoping there’s others digging around as well and trying to understand what’s going on so we can discuss this in more detail.

Right now I have a basic understanding of the architecture, but I’ve got a few questions.

First, in the linear decoder the weights of the decoder are initialized to those of the encoder, but it looks like that relationship isn’t maintained. I may be misreading the code as I’m new to the library and this is my first foray, but it seems like it’s set during init and not during forward or otherwise tied. With that in mind is it safe to assume that the fast.ai library isn’t doing weight tying between the two layers as is talked about in the paper and in the papers that they reference 1 2 on that subject?

Second, the paper mentions non-monotonically triggered averaged gradient descent, which they attribute as one of the most significant factors in the ablation study. Is this something that’s been explored in the library yet? And given it’s comparison to averaged gradient descent where they talk about the limitations being related to unclear methods for setting learning rate schedules is this superseded by SGD with restarts?

It’s not just initializing, but setting them to the exact same object. So the weights are tied!

It’s not, although as you say seems to have a lot of overlap with SGDR (and particularly with snapshot ensembles). I haven’t tried comparing them, but it would be interesting to do so.

1 Like

It’s not just initializing, but setting them to the exact same object. So the weights are tied!

Interesting, thanks for the quick reply. I’ll have to dig into the codebase a little more deeply. This explains why the RNN is symmetric (200->500->200) as well, which was another of the questions I was interested in.

So to clarify my understanding, is the target of the language model the embedding value of the word being predicted? i.e. the RNN output is a predicted embedding value. And in that case is the linear decoder layer just about finding the closest word to that embedding?

This was an idea I’ve seen before and was planning to experiment with here, but it looks like the network is already doing that which is great.

I’m also somewhat surprised that I don’t see a softmax activation happening in the linear layer to classify the word being predicted. Have I missed that or is it not necessary / undesirable for some reason.

I appreciate the discourse, I’m super interested in NLP but it’s hard to find implementation details at this level, let alone someone to talk to about why it’s done the way it’s done.

Yes that sounds about right.

It’s built into the loss function. So it is there. See the pytorch docs for details.

It’s built into the loss function. So it is there. See the pytorch docs for details.

And that assignment is happening in the RNNLearner instantiation where it’s setting ‘crit’ to be a cross_entropy loss, correct?

Feels like it’s starting to come together. :slight_smile: I’m still somewhat new to this style of programming in python but I’m trying to pick it up because it seems incredibly useful.

Thanks for the help.

Exactly right.