But with a folder of text files. The closest I got is with this example using the blurr library, but I get CUDA out of memory errors every time (on a 16GB GPU), even with a batch size of 1 and only a few text files in each folder.
Note that this example took much tweaking, and I had to include some imports from this notebook which was the base of this example:
Does anyone have any examples of huggingface training with a folder of text files as a data source they can share?
The Hugging Face datasets library should have good support for loading text files from a folder (also csv, json, parquet…). Check out their documentation