TextDataBunch for IMDB Dataset

There are lots of docs for how to deal with the smaller IMDB dataset behind URLs.IMDB_SAMPLE, but the larger dataset (URLS.IMDB) is structured differently.
It looks like this:

/tmp/aclImdb
├── imdb.vocab
├── imdbEr.txt
├── test
│   ├── neg  # .txt files under here
│   ├── pos # .txt files under here
└── train
    ├── neg  # .txt files under here
    ├── pos # .txt files under here

The course notebook is dealing with the whole dataset.

1 Like