Hugdatafast: hugginface/nlp + fastai

Richard-Wang · September 4, 2020, 7:17am

Hugdatafast: huggingface/nlp fastai

An integration to make use of hundreds of datasets with fastai, and some handy transforms to make concatenated dataset like language model dataset.

pip install hugdatafast
Documentation: https://hugdatafast.readthedocs.io/en/latest/

Doing NLP ?
See if you can turn your data pipeline into just 3 lines.

The updates will also be tweeted on my Twitter Richard Wang.

chansung18 · September 4, 2020, 12:31pm

So with this awesome plug in, every hugging face models are compatible ready to be used?

muellerzr · September 4, 2020, 12:46pm

What are some differences in the approach of yours compared to @morgan’s fasthugs?

Richard-Wang · September 4, 2020, 1:04pm

Hi @chansung18 @muellerzr, thanks for your interest.
AFAIK, fasthug is fastai + huggingface / transformers, which get you model training and tokenizers (correct me if I am wrong).
And hugdatafast is fastai + huggingface / nlp, which get you Dataloaders.
You should be able to use hugdatafast with blur or fasthug. (Although I haven’t tried.)

muellerzr · September 4, 2020, 2:09pm

Makes perfect sense. Thank you both!

wgpubs · September 5, 2020, 9:20pm

Working on a blogpost/notebook using hugdatafast with blurr sometime this weekend (after I work through some PRs).

I think things should work out fine for the most part, though I’m not so sure about the use cases where post-processing is applied to the raw inputs to get things to line up right with the subword/BPE tokenization (I’m thinking things like question/answering and token classification/NER).

Nice lib! Thanks for making it available.

Richard-Wang · September 8, 2020, 3:13am

Update: add a example for preparing any hugginface/nlp dataset for (traditional) langugage model, or implement custom context window.

Update the update: I cancel the updates.
— Reason — (just for notes, skipping is ok)
Originally I want to introduce LMTransform and CombineTransform, which can do context window over examples. But I suddenly thought there is few cases we need context window across examples. Examples in a dataset are often not consecutive, we don’t need to concatenate texts not related. So these classes might be only useful for my personal use case.

wgpubs · September 8, 2020, 6:28pm

I tried working up an example with blurr, but I failed

Maybe I’m missing something, but I think the problem is that the datasets/dataloaders returned from Hugdatafast only returns the “input_ids” … whereas blurr is designed to return (and work against) all the other things associated to a text sequence depending on the architecture (e.g., input_ids, attention_mask, token_type_ids, etc…)

So for example, here is what one batch of Hugdatafast looks like …

… and here is what it looks like in blurr

Screen Shot 2020-09-08 at 11.26.17 AM

If you got any ideas on how to make the two play nicely together, I’m all ears. Lmk what you think.

-wg

Richard-Wang · September 9, 2020, 1:10am

Hi @wgpubs,
Although it is a more sophisticated example, you might want to take a look at https://github.com/richarddwang/electra_pytorch/blob/8d6c97360c59c7e3be3df2d9d4da0eb63210830a/finetune.py#L161-L217, where I can get this.

hugdatafast doesn’t provide built-in preprocessing, you need to map your hf/nlp dataset (write your own preprocessing logic) and then hugdatafast make the preprocessed hf/nlp dataset all the way to Dataloaders we are familiar with.