Any documentation or examples of huggingface training with text files in a folder instead of dataframes?

I’m trying to follow along here:

But with a folder of text files. The closest I got is with this example using the blurr library, but I get CUDA out of memory errors every time (on a 16GB GPU), even with a batch size of 1 and only a few text files in each folder.

Note that this example took much tweaking, and I had to include some imports from this notebook which was the base of this example:

Does anyone have any examples of huggingface training with a folder of text files as a data source they can share?

The Hugging Face datasets library should have good support for loading text files from a folder (also csv, json, parquet…). Check out their documentation

Sounds like your documents are too big. You might want to try ULMFiT or Longformer.

1 Like