[HuggingFace/nlp] Create fastai Dataloaders, show batch, and create dataset for LM, MLM

This post includes: :hugs::hugs::hugs:

  1. Integration of fastai and hf/nlp - create fastai Dataloaders from nlp.Dataset s

  1. A method for creating aggregated nlp.Dataset for LM, MLM, … from any nlp.Dataset.

and `LMTransform` is just a class with 3 methods
class LMTransform(AggregateTransform):
  def __init__(self, hf_dset, max_len, text_col, x_text_col='x_text', y_text_col='y_text', **kwargs):
    self.text_col, self.x_text_col, self.y_text_col = text_col, x_text_col, y_text_col
    self._max_len = max_len + 1
    self.residual_len, self.new_text = self._max_len, []
    super().__init__(hf_dset, inp_cols=[text_col], out_cols=[x_text_col, y_text_col], init_attrs=['residual_len', 'new_text'], **kwargs)

  def accumulate(self, text): # *inp_cols
    "text: a list of indices"
    usable_len = len(text)
    cursor = 0
    while usable_len != 0:
      use_len = min(usable_len, self.residual_len)
      self.new_text += text[cursor:cursor+use_len]
      self.residual_len -= use_len
      usable_len -= use_len
      cursor += use_len
      if self.residual_len == 0:

  def create_example(self):
    # when read all data, the accumulated new_text might be less than two characters.
    if len(self.new_text) >= 2: 
      example = {self.x_text_col:self.new_text[:-1], self.y_text_col:self.new_text[1:]}
      example = None # mark "don't commit this"
    # reset accumulators
    self.new_text = []
    self.residual_len = self._max_len

    return example

See hf_nlp_aggregation_and_with_fastai.ipynb for full code. Also includes turning a dataset into a dataloader for ELECTRA

“Pretrain MLM and fintune on GLUE with fastai”

This post is actually the 7 th post of the series, and deprecates the the 2nd and 3rd.

  1. MaskedLM callback and ELECTRA callback

  2. (deprecated) TextDataLoader - faster/as fast as, but also with sliding window, cache, and progress bar

  3. (deprecated) Novel Huggingface/nlp integration: train on and show_batch hf/nlp datasets

  4. Warm up & linearly decay lr shedule + discriminative lr

  5. General Multi-task learning

  6. Reproduce GLUE finetuning results of ELECTRA with fastai

Things on the way

  • Reproduce ELECTRA (pretraining from the scratch)
  • Try to improve GLUE finetuning with ranger, fp16, one_cycle, and recent papers.
  • Ensemble
  • wnli tricks

I’ll post all updates of this series in my twitter Richard Wang so you won’t miss it.


Hey @Richard-Wang, love the work you’re doing. Just a thought, might it be a good idea to create a mega-thread for all the related work (e.g. all the HF/ELECTRA work for example) as opposed to creating separate posts? That way we could subscribe to 1 thread and get notified every time you update :slight_smile: Totally up to you though, I also the value in keeping separate topics in separate discussions…


Hi @morgan, Thanks for your feedback, sorry I am too be a person of organization. I plan to make only one more extra post for pretraining, as to other works I will take your advice and post below the existing threads.
I would like to hear more your opinions, thanks!

1 Like

Didn’t expect so much attention😂

I add some doc to the code, so ppl interested in it can get better understanding on how to apply them to their problem. And also fix a small bug that accidentally pass the wrong dataset.