Memory Error: loading 530MB dataset into 24GB RAM

Hi guys,

I’ve got a dataset of about 100’000 articles totaling around 532MB.
I’m using a TextList to load it on a P100 with 24GB RAM on Paperspace.com.
Loading more than 50’000 articles results in a Memory Error:

Surely 24GB RAM should be enough for a dataset of 530MB, right?
The largest article is about 128kb whereas the average is 5,3kb.

from fastai.text import *

path = "data/sv-wiki-articles-100k"
bs = 64

data_lm = (TextList.from_folder(path + '/', extensions=['.txt'])
          .use_partial_data(sample_pct=0.5)
          .split_by_rand_pct(0.1)
          .label_for_lm()
          .databunch(bs=bs))

I’ve tried to systematically decreasing the batch size all the way down to 2, which doesn’t change the outcome.

Do you have any suggestions?

The error:

MemoryError                               Traceback (most recent call last)
<ipython-input-7-3c6a208eb876> in <module>
    1 data_lm = (TextList.from_folder(path + '/', extensions=['.txt'])
    2               .use_partial_data(sample_pct=1.0)
 -> 3               .split_by_rand_pct(0.1)
    4               .label_for_lm()
    5               .databunch(bs=bs))

    ...

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/core.py in array(a, dtype, **kwargs)
    271     if np.int_==np.int32 and dtype is None and is_listy(a) and len(a) and isinstance(a[0],int):
    272         dtype=np.int64
 -> 273     return np.array(a, dtype=dtype, **kwargs)
    274 
    275 class EmptyLabel(ItemBase):

MemoryError: 

Thanks for your help :slight_smile:

I know when I had to do mine on paper space I had to only use 1,200 articles as it would kill the memory otherwise when I did an analysis with the FakeNewsCorpus for a little event. Also if it says it’s not, it’s not. I believe we can limit this by pre-tokenizing our data (correct me if I’m wrong) but as that was just a weekend thing for me I am uncertain of this.

You can then just go down and continue training on new databunches until you’ve finished the dataset.

Thank you mullerzr

From your experiences I’m gathering that tokenising might take a huge amount of memory.

I will try pretokenizing everything beforehand.

The last sentence you suggest breaking the data set up into multiple data bunches. Did I understand that correctly?
Did you use the transfer learning / fine-tuning method shown in Lesson 4?