[HuggingFace/nlp] Create fastai Dataloaders, show batch, and create dataset for LM, MLM

Richard-Wang · June 12, 2020, 6:55am

This post includes:

Integration of fastai and hf/nlp - create fastai Dataloaders from nlp.Dataset s

A method for creating aggregated nlp.Dataset for LM, MLM, … from any nlp.Dataset.

and `LMTransform` is just a class with 3 methods

class LMTransform(AggregateTransform):
  def __init__(self, hf_dset, max_len, text_col, x_text_col='x_text', y_text_col='y_text', **kwargs):
    self.text_col, self.x_text_col, self.y_text_col = text_col, x_text_col, y_text_col
    self._max_len = max_len + 1
    self.residual_len, self.new_text = self._max_len, []
    super().__init__(hf_dset, inp_cols=[text_col], out_cols=[x_text_col, y_text_col], init_attrs=['residual_len', 'new_text'], **kwargs)
    

  def accumulate(self, text): # *inp_cols
    "text: a list of indices"
    usable_len = len(text)
    cursor = 0
    while usable_len != 0:
      use_len = min(usable_len, self.residual_len)
      self.new_text += text[cursor:cursor+use_len]
      self.residual_len -= use_len
      usable_len -= use_len
      cursor += use_len
      if self.residual_len == 0:
        self.commit_example(self.create_example())   

  def create_example(self):
    # when read all data, the accumulated new_text might be less than two characters.
    if len(self.new_text) >= 2: 
      example = {self.x_text_col:self.new_text[:-1], self.y_text_col:self.new_text[1:]}
    else:
      example = None # mark "don't commit this"
    # reset accumulators
    self.new_text = []
    self.residual_len = self._max_len

    return example

See hf_nlp_aggregation_and_with_fastai.ipynb for full code. Also includes turning a dataset into a dataloader for ELECTRA

“Pretrain MLM and fintune on GLUE with fastai”

This post is actually the 7 th post of the series, and deprecates the the 2nd and 3rd.

Things on the way

Reproduce ELECTRA (pretraining from the scratch)
Try to improve GLUE finetuning with ranger, fp16, one_cycle, and recent papers.
Ensemble
wnli tricks

I’ll post all updates of this series in my twitter Richard Wang so you won’t miss it.

morgan · June 12, 2020, 2:00pm

Hey @Richard-Wang, love the work you’re doing. Just a thought, might it be a good idea to create a mega-thread for all the related work (e.g. all the HF/ELECTRA work for example) as opposed to creating separate posts? That way we could subscribe to 1 thread and get notified every time you update Totally up to you though, I also the value in keeping separate topics in separate discussions…

Richard-Wang · June 13, 2020, 2:33pm

Hi @morgan, Thanks for your feedback, sorry I am too be a person of organization. I plan to make only one more extra post for pretraining, as to other works I will take your advice and post below the existing threads.
I would like to hear more your opinions, thanks!

Richard-Wang · June 13, 2020, 3:14pm

Didn’t expect so much attention😂

I add some doc to the code, so ppl interested in it can get better understanding on how to apply them to their problem. And also fix a small bug that accidentally pass the wrong dataset.