Loading text data iteratively

matdmiller · May 21, 2023, 10:00pm

I believe you should be able to use something like from_folder. For large datasets you can’t fit in RAM you shouldn’t use Dataframes. Those inherently load the whole dataset into RAM first which won’t work in your case. You should instead probably just have .txt files where each record is its own file and they are loaded on the fly. For such a large dataset it may be worth looking into writing a custom dataloader. Also you probably want to create a small subset to work with to get everything set up and then only train on the full dataset at the end.

Example from the Docs:

or
Example from the book:

github.com

fastai/fastbook/blob/master/10_nlp.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#hide\n",
    "! [ -e /content ] && pip install -Uqq fastbook\n",
    "import fastbook\n",
    "fastbook.setup_book()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [

This file has been truncated. show original

I would focus on figuring out your dataset/dataloader rather than trying to hack the lr schedule.

You may also want to check out Huggingface Transformers. The co-author of the fast.ai book works on at HuggingFace on that library among others.