As far as I can tell, FastAI does not use padding when training language models for next token prediction. However, the embedding layer of the AWD-LSTM model takes a padding_idx as input. This row of the embedding should never be touched, as there are no padding tokens in any of the batches. I wanted to check this, so I created this toy example:
> tok = Tokenizer(partial(MolTokenizer), pre_rules=[], post_rules=[])
> path = './databunch/'
>
> bs=64; wd=1e-2; drops=0.0
> train_sub = train_data.sample(n=256).copy()
> valid_sub = valid_data.sample(n=256).copy()
> db1 = TextLMDataBunch.from_df('./debug/', train_sub, valid_sub, bs=bs, tokenizer=tok, text_cols='sequence', min_freq=1, include_bos=False, include_eos=False)
> db1.train_ds.x.vocab.stoi
defaultdict(<class 'int'>, {'xxpad': 0, 'L': 1, 'A': 2, 'G': 3, 'V': 4, 'E': 5, 'S': 6, 'I': 7, 'K': 8, 'R': 9, 'D': 10, 'T': 11, 'P': 12, 'N': 13, 'F': 14, 'Q': 15, 'Y': 16, 'M': 17, 'H': 18, 'C': 19, 'W': 20, 'GO': 21, 'xxfake': 23})
>
> config = awd_lstm_lm_config.copy()
> config['pad_token']=0
> learner = language_model_learner(db1, AWD_LSTM, drop_mult=drops, wd=wd, pretrained=False, config=config)
> learner.model[0].encoder.weight.data[0]=0.
>
> nums=0
> for xb,_ in db1.train_dl.dl:
> nums+=(xb == learner.data.train_ds.x.vocab.stoi['xxpad'] ).sum().item()
> print(nums)
> 0
Thus, the padding token was not present in any of the batches.
I wanted to check what happens to the padding row of the embedding matrix during training;
> learner.fit_one_cycle(1, 1e-3)
> learner.model[0].encoder.weight.data[0]
> tensor([-0.0092, -0.0105, 0.0060, -0.0091, 0.0001, 0.0049, 0.0060, 0.0014,
> 0.0107, 0.0057, -0.0050, 0.0043, 0.0106, -0.0109, -0.0056, -0.0086,
> 0.0117, -0.0116, 0.0051, 0.0104, 0.0084, 0.0087, 0.0039, -0.0081,
> -0.0076, 0.0080, 0.0104, -0.0089, -0.0115, 0.0106, -0.0092, -0.0038,
> 0.0117, -0.0055, 0.0116, 0.0116, -0.0105, 0.0100, 0.0103, 0.0114,
> 0.0054, 0.0090, 0.0060, -0.0100, 0.0088, 0.0115, 0.0097, 0.0084,
> 0.0092, 0.0110, -0.0117, -0.0112, -0.0036, 0.0088, 0.0082, 0.0115,
> -0.0088, 0.0109, -0.0002, 0.0031, 0.0089, -0.0105, 0.0077, -0.0112,
> -0.0116, 0.0107, 0.0069, 0.0116, -0.0038, -0.0105, 0.0055, -0.0112,
> -0.0116, 0.0057, 0.0114, -0.0039, -0.0069, -0.0011, 0.0056, -0.0110,
> -0.0112, -0.0114, 0.0115, 0.0007, -0.0060, -0.0111, 0.0084, 0.0093,
> 0.0097, 0.0008, -0.0055, -0.0116, 0.0102, -0.0105, -0.0076, 0.0065,
> 0.0116, -0.0062, 0.0057, -0.0100, -0.0071, -0.0053, -0.0023, -0.0089,
> -0.0083, 0.0017, -0.0043, -0.0112, 0.0054, -0.0004, 0.0108, 0.0113,
> -0.0107, 0.0103, -0.0088, -0.0108, -0.0103, 0.0089, -0.0069, 0.0116,
> 0.0109, -0.0082, -0.0112, -0.0018, 0.0044, 0.0036, 0.0115, 0.0105,
> 0.0063, 0.0087, -0.0045, -0.0009, 0.0035, -0.0090, 0.0080, 0.0072,
> 0.0113, 0.0004, -0.0108, -0.0114, 0.0046, 0.0061, -0.0009, 0.0113,
> 0.0116, 0.0044, -0.0077, -0.0015, 0.0113, -0.0065, 0.0019, -0.0096,
> 0.0070, -0.0110, -0.0084, 0.0088, -0.0059, 0.0058, 0.0017, 0.0114,
> 0.0108, -0.0094, 0.0018, -0.0104, 0.0006, -0.0086, -0.0007, -0.0113,
> 0.0070, -0.0077, 0.0043, 0.0066, 0.0113, 0.0061, 0.0111, 0.0055,
> 0.0012, -0.0105, -0.0110, -0.0086, -0.0100, -0.0086, 0.0030, 0.0110,
> -0.0058, -0.0113, 0.0108, -0.0114, 0.0114, -0.0117, -0.0002, -0.0073,
> 0.0105, 0.0114, 0.0113, 0.0033, 0.0007, 0.0116, 0.0092, -0.0062,
> -0.0029, 0.0009, 0.0077, -0.0002, -0.0067, -0.0078, 0.0054, -0.0106,
> 0.0048, 0.0080, 0.0004, 0.0061, -0.0103, 0.0096, 0.0108, -0.0041,
> 0.0065, -0.0025, 0.0094, -0.0111, -0.0117, 0.0072, 0.0033, -0.0061,
> -0.0107, -0.0015, -0.0018, -0.0108, 0.0014, 0.0039, 0.0082, -0.0006,
> 0.0111, 0.0070, -0.0014, 0.0084, -0.0095, -0.0115, 0.0040, 0.0048,
> -0.0105, 0.0086, 0.0071, -0.0063, 0.0036, -0.0087, 0.0102, 0.0029,
> -0.0042, 0.0045, -0.0030, -0.0066, 0.0103, -0.0113, 0.0111, -0.0081,
> -0.0113, 0.0050, 0.0068, 0.0106, 0.0095, -0.0102, -0.0043, -0.0048,
> 0.0064, 0.0065, 0.0081, -0.0115, -0.0066, 0.0061, -0.0085, 0.0113,
> 0.0071, 0.0075, -0.0005, -0.0057, 0.0079, -0.0115, 0.0056, 0.0108,
> -0.0091, 0.0115, 0.0050, -0.0095, -0.0019, -0.0111, -0.0030, -0.0117,
> 0.0096, 0.0112, 0.0109, 0.0117, 0.0067, -0.0051, -0.0022, -0.0111,
> 0.0116, 0.0056, -0.0108, -0.0111, 0.0018, 0.0095, 0.0019, 0.0052,
> 0.0075, 0.0071, 0.0078, 0.0044, -0.0063, 0.0066, 0.0100, 0.0113,
> -0.0011, 0.0103, -0.0044, -0.0116, -0.0052, -0.0092, 0.0101, -0.0006,
> -0.0104, 0.0054, 0.0090, -0.0110, -0.0116, -0.0064, -0.0106, -0.0097,
> -0.0116, 0.0074, -0.0110, 0.0077, -0.0114, 0.0115, -0.0103, 0.0065,
> -0.0020, 0.0075, 0.0109, 0.0030, -0.0113, -0.0098, -0.0090, 0.0016,
> -0.0045, 0.0113, -0.0037, -0.0114, -0.0019, -0.0112, 0.0099, 0.0049,
> 0.0047, 0.0018, 0.0039, 0.0081, 0.0111, -0.0072, -0.0040, -0.0116,
> 0.0102, 0.0008, -0.0113, 0.0049, -0.0108, -0.0092, -0.0024, 0.0099,
> 0.0113, -0.0016, 0.0110, -0.0110, 0.0074, 0.0111, -0.0109, -0.0063,
> 0.0089, 0.0115, 0.0111, -0.0108, 0.0041, 0.0010, 0.0066, 0.0069,
> -0.0013, -0.0080, -0.0075, 0.0113, 0.0079, 0.0113, -0.0056, -0.0016,
> 0.0108, 0.0050, -0.0115, 0.0103, -0.0082, 0.0037, -0.0103, -0.0069],
> device='cuda:1')
As this row is no longer just zero, it is updated during training.
Why is the padding index row of the embedding being updated via backprop if it is never present in any of the batches? Why does FastAI specifiy the padding index to the embedding layer if this token is not used when creating batches?