I believe you should be able to use something like from_folder
. For large datasets you can’t fit in RAM you shouldn’t use Dataframes. Those inherently load the whole dataset into RAM first which won’t work in your case. You should instead probably just have .txt
files where each record is its own file and they are loaded on the fly. For such a large dataset it may be worth looking into writing a custom dataloader. Also you probably want to create a small subset to work with to get everything set up and then only train on the full dataset at the end.
Example from the Docs:
or
Example from the book:
I would focus on figuring out your dataset/dataloader rather than trying to hack the lr schedule.
You may also want to check out Huggingface Transformers. The co-author of the fast.ai book works on at HuggingFace on that library among others.