Loading text data iteratively

ps40 · May 18, 2023, 8:37pm

I have a very large text data which takes more than 100GB to load. I tried to break it down into pieces and just run training on one piece at a time. But it looks like after each dls change, the training starts from scratch. How can I get around this. Here is some pseudo code for what I have tried:

for path in paths:
train_df = pd.read_pickle(path)
dls = dblock.dataloaders(train_df, bs=batchsize, val_bs=8, num_workers=16)
learn.dls = dls
learn.fit_one_cycle(…)

ps40 · May 19, 2023, 8:47pm

apparently you need to recreate the learner like so:
learn = Learner(dls, learn.model, …)

This seems to do the trick.

matdmiller · May 21, 2023, 3:54pm

This may work but probably isn’t ideal as your learning rate schedule is getting reset for each chunk of new text. Ideally you want to set up your dataloader to be able to iterate over your entire dataset for each epoch. You shouldn’t need to load the entire dataset into RAM first. You should just need to reconfigure how your dataset and/or dataloader are set up.

ps40 · May 21, 2023, 4:38pm

Hello, thanks for responding. Yes, the lr schedule will get reset to the original value. Thanks for pointing it out. Maybe I can manually use some kind of decay. I will look inside fastai code to see what they are doing.

I don’t know how to do this. Can you point me to some sample/tutorial/forum post where I can learn a bit more.

matdmiller · May 21, 2023, 10:00pm

I believe you should be able to use something like from_folder. For large datasets you can’t fit in RAM you shouldn’t use Dataframes. Those inherently load the whole dataset into RAM first which won’t work in your case. You should instead probably just have .txt files where each record is its own file and they are loaded on the fly. For such a large dataset it may be worth looking into writing a custom dataloader. Also you probably want to create a small subset to work with to get everything set up and then only train on the full dataset at the end.

Example from the Docs:

or
Example from the book:

github.com

fastai/fastbook/blob/master/10_nlp.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#hide\n",
    "! [ -e /content ] && pip install -Uqq fastbook\n",
    "import fastbook\n",
    "fastbook.setup_book()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [

This file has been truncated. show original

I would focus on figuring out your dataset/dataloader rather than trying to hack the lr schedule.

You may also want to check out Huggingface Transformers. The co-author of the fast.ai book works on at HuggingFace on that library among others.

krasin · May 24, 2023, 4:52am

I don’t understand why this is not working. Can someone explain?

matdmiller · May 28, 2023, 6:17pm

Based on the pseudo code it should not be starting over from scratch, but it is far from ideal (at best) due to the one cycle learning rate schedule starting over for each loop iteration. There are a number of other things that could be going wrong with this approach as well that make it look like it’s starting from scratch, but not enough information was provided to know for sure.