Help understanding padding in language model learner

As far as I can tell, FastAI does not use padding when training language models for next token prediction. However, the embedding layer of the AWD-LSTM model takes a padding_idx as input. This row of the embedding should never be touched, as there are no padding tokens in any of the batches. I wanted to check this, so I created this toy example:

>     tok = Tokenizer(partial(MolTokenizer), pre_rules=[], post_rules=[])
>     path = './databunch/'
> 
>     bs=64; wd=1e-2; drops=0.0
>     train_sub = train_data.sample(n=256).copy()
>     valid_sub = valid_data.sample(n=256).copy() 
>     db1 = TextLMDataBunch.from_df('./debug/', train_sub, valid_sub, bs=bs, tokenizer=tok, text_cols='sequence', min_freq=1, include_bos=False, include_eos=False)
>      db1.train_ds.x.vocab.stoi
defaultdict(<class 'int'>, {'xxpad': 0, 'L': 1, 'A': 2, 'G': 3, 'V': 4, 'E': 5, 'S': 6, 'I': 7, 'K': 8, 'R': 9, 'D': 10, 'T': 11, 'P': 12, 'N': 13, 'F': 14, 'Q': 15, 'Y': 16, 'M': 17, 'H': 18, 'C': 19, 'W': 20, 'GO': 21, 'xxfake': 23})
> 
>     config = awd_lstm_lm_config.copy()
>     config['pad_token']=0
>     learner   = language_model_learner(db1, AWD_LSTM, drop_mult=drops, wd=wd, pretrained=False, config=config)
>     learner.model[0].encoder.weight.data[0]=0.
> 
>     nums=0
>     for  xb,_ in db1.train_dl.dl:
>         nums+=(xb == learner.data.train_ds.x.vocab.stoi['xxpad'] ).sum().item()

> print(nums)
> 0

Thus, the padding token was not present in any of the batches.

I wanted to check what happens to the padding row of the embedding matrix during training;

> learner.fit_one_cycle(1, 1e-3)
> learner.model[0].encoder.weight.data[0]
    > tensor([-0.0092, -0.0105,  0.0060, -0.0091,  0.0001,  0.0049,  0.0060,  0.0014,
    >          0.0107,  0.0057, -0.0050,  0.0043,  0.0106, -0.0109, -0.0056, -0.0086,
    >          0.0117, -0.0116,  0.0051,  0.0104,  0.0084,  0.0087,  0.0039, -0.0081,
    >         -0.0076,  0.0080,  0.0104, -0.0089, -0.0115,  0.0106, -0.0092, -0.0038,
    >          0.0117, -0.0055,  0.0116,  0.0116, -0.0105,  0.0100,  0.0103,  0.0114,
    >          0.0054,  0.0090,  0.0060, -0.0100,  0.0088,  0.0115,  0.0097,  0.0084,
    >          0.0092,  0.0110, -0.0117, -0.0112, -0.0036,  0.0088,  0.0082,  0.0115,
    >         -0.0088,  0.0109, -0.0002,  0.0031,  0.0089, -0.0105,  0.0077, -0.0112,
    >         -0.0116,  0.0107,  0.0069,  0.0116, -0.0038, -0.0105,  0.0055, -0.0112,
    >         -0.0116,  0.0057,  0.0114, -0.0039, -0.0069, -0.0011,  0.0056, -0.0110,
    >         -0.0112, -0.0114,  0.0115,  0.0007, -0.0060, -0.0111,  0.0084,  0.0093,
    >          0.0097,  0.0008, -0.0055, -0.0116,  0.0102, -0.0105, -0.0076,  0.0065,
    >          0.0116, -0.0062,  0.0057, -0.0100, -0.0071, -0.0053, -0.0023, -0.0089,
    >         -0.0083,  0.0017, -0.0043, -0.0112,  0.0054, -0.0004,  0.0108,  0.0113,
    >         -0.0107,  0.0103, -0.0088, -0.0108, -0.0103,  0.0089, -0.0069,  0.0116,
    >          0.0109, -0.0082, -0.0112, -0.0018,  0.0044,  0.0036,  0.0115,  0.0105,
    >          0.0063,  0.0087, -0.0045, -0.0009,  0.0035, -0.0090,  0.0080,  0.0072,
    >          0.0113,  0.0004, -0.0108, -0.0114,  0.0046,  0.0061, -0.0009,  0.0113,
    >          0.0116,  0.0044, -0.0077, -0.0015,  0.0113, -0.0065,  0.0019, -0.0096,
    >          0.0070, -0.0110, -0.0084,  0.0088, -0.0059,  0.0058,  0.0017,  0.0114,
    >          0.0108, -0.0094,  0.0018, -0.0104,  0.0006, -0.0086, -0.0007, -0.0113,
    >          0.0070, -0.0077,  0.0043,  0.0066,  0.0113,  0.0061,  0.0111,  0.0055,
    >          0.0012, -0.0105, -0.0110, -0.0086, -0.0100, -0.0086,  0.0030,  0.0110,
    >         -0.0058, -0.0113,  0.0108, -0.0114,  0.0114, -0.0117, -0.0002, -0.0073,
    >          0.0105,  0.0114,  0.0113,  0.0033,  0.0007,  0.0116,  0.0092, -0.0062,
    >         -0.0029,  0.0009,  0.0077, -0.0002, -0.0067, -0.0078,  0.0054, -0.0106,
    >          0.0048,  0.0080,  0.0004,  0.0061, -0.0103,  0.0096,  0.0108, -0.0041,
    >          0.0065, -0.0025,  0.0094, -0.0111, -0.0117,  0.0072,  0.0033, -0.0061,
    >         -0.0107, -0.0015, -0.0018, -0.0108,  0.0014,  0.0039,  0.0082, -0.0006,
    >          0.0111,  0.0070, -0.0014,  0.0084, -0.0095, -0.0115,  0.0040,  0.0048,
    >         -0.0105,  0.0086,  0.0071, -0.0063,  0.0036, -0.0087,  0.0102,  0.0029,
    >         -0.0042,  0.0045, -0.0030, -0.0066,  0.0103, -0.0113,  0.0111, -0.0081,
    >         -0.0113,  0.0050,  0.0068,  0.0106,  0.0095, -0.0102, -0.0043, -0.0048,
    >          0.0064,  0.0065,  0.0081, -0.0115, -0.0066,  0.0061, -0.0085,  0.0113,
    >          0.0071,  0.0075, -0.0005, -0.0057,  0.0079, -0.0115,  0.0056,  0.0108,
    >         -0.0091,  0.0115,  0.0050, -0.0095, -0.0019, -0.0111, -0.0030, -0.0117,
    >          0.0096,  0.0112,  0.0109,  0.0117,  0.0067, -0.0051, -0.0022, -0.0111,
    >          0.0116,  0.0056, -0.0108, -0.0111,  0.0018,  0.0095,  0.0019,  0.0052,
    >          0.0075,  0.0071,  0.0078,  0.0044, -0.0063,  0.0066,  0.0100,  0.0113,
    >         -0.0011,  0.0103, -0.0044, -0.0116, -0.0052, -0.0092,  0.0101, -0.0006,
    >         -0.0104,  0.0054,  0.0090, -0.0110, -0.0116, -0.0064, -0.0106, -0.0097,
    >         -0.0116,  0.0074, -0.0110,  0.0077, -0.0114,  0.0115, -0.0103,  0.0065,
    >         -0.0020,  0.0075,  0.0109,  0.0030, -0.0113, -0.0098, -0.0090,  0.0016,
    >         -0.0045,  0.0113, -0.0037, -0.0114, -0.0019, -0.0112,  0.0099,  0.0049,
    >          0.0047,  0.0018,  0.0039,  0.0081,  0.0111, -0.0072, -0.0040, -0.0116,
    >          0.0102,  0.0008, -0.0113,  0.0049, -0.0108, -0.0092, -0.0024,  0.0099,
    >          0.0113, -0.0016,  0.0110, -0.0110,  0.0074,  0.0111, -0.0109, -0.0063,
    >          0.0089,  0.0115,  0.0111, -0.0108,  0.0041,  0.0010,  0.0066,  0.0069,
    >         -0.0013, -0.0080, -0.0075,  0.0113,  0.0079,  0.0113, -0.0056, -0.0016,
    >          0.0108,  0.0050, -0.0115,  0.0103, -0.0082,  0.0037, -0.0103, -0.0069],
    >        device='cuda:1')

As this row is no longer just zero, it is updated during training.

Why is the padding index row of the embedding being updated via backprop if it is never present in any of the batches? Why does FastAI specifiy the padding index to the embedding layer if this token is not used when creating batches?

Any updates on that?

@marcossantana, nope! Let me know if you figure this out.

Interesting question. It’s true that a language model, which just takes a stream of concatenated text data as input, shouldn’t have any padding. But from the source code it looks like the same AWD-LSTM body is used for both language model and classifier. The only difference is the head of the model. And since you need padding for the classification model, the AWD_LSTM module takes pad_token as input, which by default is 1.

I’m not sure why there is a row corresponding to a padding token in the embedding matrix of the language model. But isn’t ‘xxunk’ usually index 0 in a vocab? Could you check your vocab if it contains ‘xxunk’?

Hi @stefan-ai, thank you for your response.

Initially, I assumed the language model would indeed take a padding. I think there would not be any need for padding only if the total length of concatenated text divided by batch_size has no remainder. I think FastAI reuses some of the text data to fill out the array if there is a remainder rather than using padding, but I am not positive of this. Do you have any thoughts here?

My vocab does not have xx_unk. Here was the vocab for the example I listed above:

> defaultdict(<class 'int'>, {'xxpad': 0, 'L': 1, 'A': 2, 'G': 3, 'V': 4, 'E': 5, 'S': 6, 'I': 7, 'K': 8, 'R': 9, 'D': 10, 'T': 11, 'P': 12, 'N': 13, 'F': 14, 'Q': 15, 'Y': 16, 'M': 17, 'H': 18, 'C': 19, 'W': 20, 'GO': 21, 'xxfake': 23})

Yeah, xxunk is index 0 and xxpad is index 1. I inspected the embedding matrix and its as you mentioned: there is no zeroed rows corresponding to padding.

Right, I’m also not sure how fastai handles that case. That would be interesting to find out.

Regarding xxunk: If all your the tokens in your corpus are in your vocab, then there is no need to have xxunk as a separate token I guess. But then I don’t know how the model reacts if it encounters unknown tokens during inference.

Edit: I just had another idea. Are you sure that the embedding vector for the padding token needs to be all zeros. Can it be that it’s initialized randomly?

@stefan-ai, yes there is no issue with randomly initializing the padding embedding. But if you look at my example, I initialized it to zero with this line:

learner.model[0].encoder.weight.data[0]=0.

What I am trying to understand is how the padding embedding is getting modified during back-prop if there are no padding tokens in the data, and hence the padding embedding gradients should always be zero.

Sorry, I overlooked that line. I absolutely agree with you that the embedding vector for the padding token shouldn’t be updated during back-prop. If there are no padding tokens in your training data, changing the embedding weights will have absolutely no effect on the loss. No idea why they are being updated…

@sgugger, any idea why the padding embedding is changing even though no padding tokens are present in the training data?

You never said to your model the padding index was 0 (1 is the default in fastai) so my guess is the embedding associated with the index 0 got randomly initialized. It’s the one associated to the index 1 that got initialized with 0.

You should look at both before training and after training to confirm, but I suspect your embeddings for the index 0 never changed during training.

@sgugger, thank you for your response. Can you point out to me where I failed to tell the model the padding index was zero? I modified the config file, and passed that directly to the learner. I am copying those lines again here, and showing where I also initialized the weights to zero:

      config = awd_lstm_lm_config.copy()
>     config['pad_token']=0
>     learner   = language_model_learner(db1, AWD_LSTM, drop_mult=drops, wd=wd, pretrained=False, config=config)
>     learner.model[0].encoder.weight.data[0]=0.

Ah didn’t see that part, sorry.
Mmmm, can you check that the embeddings for 0 are zero indeed before training the model? If they are not, they may change because of weight decay.

I was playing around with fastai v2 nb 10 and tried to replicate the issue there. Turns out it shows the same behavior. I’m setting the embedding vector of the padding token to zero before training and even though there should be no padding tokens in the data, after training the values in the embedding vector are non-zero.

Yes, I figured out why it’s the case: the weights of the encoder and the decoder are tied. In the decoder we update the weights even for the pad index because it’s part of the final probabilities. I guess using something like ignore_index = 1 in the cross entropy could avoid that.

Thank you very much for your help, everyone.

@sgugger, could you explain to us why there is no padding token in the data generated via the language model data bunch? How does FastAI handle the case when the total length of text has a remainder when divided by the batch size? The language model pre loader is a bit of a black box to me, despite multiple attempts to better understand it :slight_smile:

My current understanding is that the lm_preloader is implemented as a callback with an on_batch_begin function. However, when I do

xb, yb = next( iter( learner.data.train_dl ) )

My xb data comes out in dimensions of (bs * bptt) which I thought wouldn’t happen until the data was reshaped via the lm_preloader callback on_batch_begin function was called.

The remainder is dropped.