Pretrain MLM and fintune on GLUE with fastai - 2 - TextDataloader for sliding window over sentences or tokens

Consider you want to pretrain a model with max_sequence_length=512, but avrage length of your corpus is about 100~200. And you think you would like to concatenate sentences to full use the max sequence length and get a broader context for prediction.

Here comes the TextDataloader, which is able to use a sliding window over tokens or sentences, and deal with bos(CLS) and eos(SEP) token as you want.

Also it is a general dataloader, you can use it as SortedDL or LMDataLoader.

See more examples and try it with this notebooks.


(Note that these are just examples, we can deal with bos and eos in other ways we want)

Lines mode: sliding window over sentences


Window mode: sliding window over tokens

Question / Help

  • Is every sample used in BERT, RoBerta, Electra,…, always have only a CLS at the head and a SEP at the tail, no matter how long it is and whether it is concatenation of tokens or lines? Only when finetune on GLUE will CLS…SEP…SEP appear ?
  • Is there more efficient implementation ?
  • Try it on your / other problem to see if there is any problem.

Follow the series.

Help me in this thread tag someone might be interested in or could help !

Also follow my twitter Richard Wang for update of this series.

(Spoiler alert: cache the dataset, prepare GLUE data, single/multi task training on GLUE is in the line !!)

=== 2020.05.18 ===

  1. Cache TextDataloader
    Initialization of TextDataloader(and also SortedDL /LMDataloader) takes time, so I make TextDataloader be able to cache itself but dataset inside it, so you need to only initialize it once, save you the time of initialization. (To cache dataset, you can see huggingface/nlp)

  2. Speed comparison
    TextDataloader is a new general dataloader who can also behave as the same as SortedDL or LMDataloader, but faster or as fast as them both in initialization or batch loading.
    I add a speed comparison between TextDataloader and SortedDL/LMDataloader to the notebook.

Initialization: (Dataset pre created)

Batch Loading:

  • When using TextDataloader as SortedDL to load batch, TextDataloader is often very slightly slower (less than 0.5 s) in this setting. Hoping somebody help me figure out why…

  • When using TextDataloader as LMDataloader to load batch, TextDataloader is largely faster. Note that assume seq_len=50 , the shape of last batch of LMDataloader will be sth like (64, 30), the one of TextDataloader will be sth like (37, 50), and some pad in last sample. But you should be easily add some code to TextDataloader.__init__ to adjust last batch_size samples to get the same behavior as
    LMDataloader do.


=== 2020.05.21 ===

  1. Load cache with different args
    When loading cached TextDataloader, you can change setting for TfmDL, such as bs, before_batch …, and also some of TextDataloader (sort_by_len,…). Also it will tell you when you try to change invalidly.

  2. Loading bar
    Now you won’t be worried about if there is bug but know that it was just need more time.

1 Like

Small update:

  • Now TextDataloader will always trigger bar, and the bar will disappear after execution, give you a clean space !

  • Fix broken twitter link, Richard Wang