it seems that the current Transformer implementation code can not support input_mask.
when constructing the input in Transformer encoder or Bert, we always pad the input, e.g,
batch : A B C [pad] [pad] --> input_mask 1 1 1 0 0
batch: D E [pad] [pad] [pad] --> input_mask 1 1 0 0 0,
in which the input_mask is applied to MultiHeadAttention to avoid accessing padding information.
New feature discussed here
Am I wrong or it is indeed not implemented in fastiai.text.model.Transformer
Also, the notebook about transformer in here does not consider the input mask neither.