Any documentation or examples of huggingface training with text files in a folder instead of dataframes?

checkmate404 · May 20, 2022, 5:26am

I’m trying to follow along here:

But with a folder of text files. The closest I got is with this example using the blurr library, but I get CUDA out of memory errors every time (on a 16GB GPU), even with a batch size of 1 and only a few text files in each folder.

Note that this example took much tweaking, and I had to include some imports from this notebook which was the base of this example:

gist.github.com

https://gist.github.com/tgalery/fa0de7b0c69ab48534b26a9151676fc1

gpt_lm.py

from pathlib import Path
from blurr.data.language_modeling import (AutoModelForCausalLM, BLURR, CausalLMStrategy,
                                          HF_LMBeforeBatchTransform, HF_CausalLMInput, HF_TextBlock, noop)
from fastai.text.all import mask2idxs, L
from fastai.data.block import DataBlock
from fastai.text.data import get_text_files, LMDataLoader


# Splitter for train and validatation
def _parent_idxs(items, name):

This file has been truncated. show original

Does anyone have any examples of huggingface training with a folder of text files as a data source they can share?

stefan-ai · June 25, 2022, 4:25pm

The Hugging Face datasets library should have good support for loading text files from a folder (also csv, json, parquet…). Check out their documentation

jeremy · June 25, 2022, 9:21pm

Sounds like your documents are too big. You might want to try ULMFiT or Longformer.