Trash predictions for test set besides great values on dev set

You could avoid sampling as you suggest, but grouping sentences by length helps speeding up the inference (the impact can be seen by observing how tqdm time estimates are corrected during inference). We use a batch of shape bs x len, where bs is a batch size and len is a length of the longest sentence in a given batch (shorter sentences are padded to that length). For example, with bs=10, 10 sentences of length 200 and 90 sentences of length 30, you can get:

  • by sorting: one batch of shape 10 x 200, nine batches of shape 10 x 30, so 1 * 2000 + 9 * 300 = 4700 elements in total,
  • worst case scenario: 10 batches of shape 10 x 200 (with one long sentence and 9 short in each), 20000 elements in total.

You can watch one of fast.ai videos where Jeremy describes the difference between SortSampler and SortishSampler. Please note, that in our case (i.e., only language modelling) the order of samples doesn’t matter, because we needed infer.py to compute only the perplexity.

1 Like