Hi all,
not sure I am missing something really basic here but, when I create dataloaders from a DataBlock for a language model, my train_dl and valid_dl show different seq_len (bptt).
I wanted to add a second part to the above question, but I refrained myself as I was not even sure the first one made sense.
Now I will go all-in
How can the training part not fail (it works perfectly fine)?
Specifically, the validation chunk, e.g. the model gets trained on torch.Size([128, 60]) but then validated on torch.Size([128, 72])
seq_len is just the length you feed the model at each time. The model will produce exactly the same results whichever seq_len you use thanks to its hidden state. It is the gradients that will be different (computed over 60 timesteps instead of 72 in this example).