`RNN_Encoder` vs. `MultiBatchRNN`

Hi,

I wonder what the differences are between the RNN_Encoder used in the language model and the MultiBatchRNN used in the RNN classifier. Conceptually, I thought they are the same.

For example, after creating a language model using my own data, this is the structure I found:

SequentialRNN (
  (0): RNN_Encoder (
    (encoder): Embedding(3960, 200, padding_idx=1)
    (rnns): ModuleList (
      (0): WeightDrop (
        (module): LSTM(200, 500, dropout=0.05)
      )
      (1): WeightDrop (
        (module): LSTM(500, 500, dropout=0.05)
      )
      (2): WeightDrop (
        (module): LSTM(500, 200, dropout=0.05)
      )
    )
    (dropouti): LockedDropout (
    )
    (dropouth): LockedDropout (
    )
  )
  (1): LinearDecoder (
    (decoder): Linear (200 -> 3960)
    (dropout): LockedDropout (
    )
  )
)

And when I created the RNN classifier, this is what it looks like:

SequentialRNN (
  (0): MultiBatchRNN (
    (encoder): Embedding(3960, 200, padding_idx=1)
    (rnns): ModuleList (
      (0): WeightDrop (
        (module): LSTM(200, 500, dropout=0.3)
      )
      (1): WeightDrop (
        (module): LSTM(500, 500, dropout=0.3)
      )
      (2): WeightDrop (
        (module): LSTM(500, 200, dropout=0.3)
      )
    )
    (dropouti): LockedDropout (
    )
    (dropouth): LockedDropout (
    )
  )
  (1): PoolingLinearClassifier (
    (decoder): Linear (600 -> 4)
    (dropout): LockedDropout (
    )
  )
)

So I noticed the two encoders are different. After doing load_encoder('<pre-trained langauge model encoder>') as in the IMDB example, I checked the model structure again and found that the encoder is still MultiBatchRNN (whereas I thought it would be replaced by SequentialRNN), so did we just copy all the weights and activations over? If so, it means the two structures are highly similar, right?

On a not-so-related note, I noticed that, even though my data has only 3 classes, in the classifier, it says it has 4 (i.e., (decoder): Linear (600 -> 4)). I googled a little and found this thread that explains that “all vocabularies are defaultdict objects, so index zero is reserved for unknown or unseen inputs,” which explains why it adds an extra class. However, I’m not sure how having an extra empty class impacts the performance.

1 Like

Note that pytorch models are much more customizable than keras - specifically, we can write arbitrary custom code in the forward method of a module. So you’ll need to look at the source code to see how these differ. Take a look, and feel free to ask if it doesn’t make sense. We’ll cover it in lesson 7, BTW.