Memory Error: loading 530MB dataset into 24GB RAM

Moezzie · June 3, 2019, 8:31pm

Hi guys,

I’ve got a dataset of about 100’000 articles totaling around 532MB.
I’m using a TextList to load it on a P100 with 24GB RAM on Paperspace.com.
Loading more than 50’000 articles results in a Memory Error:

Surely 24GB RAM should be enough for a dataset of 530MB, right?
The largest article is about 128kb whereas the average is 5,3kb.

from fastai.text import *

path = "data/sv-wiki-articles-100k"
bs = 64

data_lm = (TextList.from_folder(path + '/', extensions=['.txt'])
          .use_partial_data(sample_pct=0.5)
          .split_by_rand_pct(0.1)
          .label_for_lm()
          .databunch(bs=bs))

I’ve tried to systematically decreasing the batch size all the way down to 2, which doesn’t change the outcome.

Do you have any suggestions?

The error:

MemoryError                               Traceback (most recent call last)
<ipython-input-7-3c6a208eb876> in <module>
    1 data_lm = (TextList.from_folder(path + '/', extensions=['.txt'])
    2               .use_partial_data(sample_pct=1.0)
 -> 3               .split_by_rand_pct(0.1)
    4               .label_for_lm()
    5               .databunch(bs=bs))

    ...

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/core.py in array(a, dtype, **kwargs)
    271     if np.int_==np.int32 and dtype is None and is_listy(a) and len(a) and isinstance(a[0],int):
    272         dtype=np.int64
 -> 273     return np.array(a, dtype=dtype, **kwargs)
    274 
    275 class EmptyLabel(ItemBase):

MemoryError:

Thanks for your help

muellerzr · June 3, 2019, 8:43pm

I know when I had to do mine on paper space I had to only use 1,200 articles as it would kill the memory otherwise when I did an analysis with the FakeNewsCorpus for a little event. Also if it says it’s not, it’s not. I believe we can limit this by pre-tokenizing our data (correct me if I’m wrong) but as that was just a weekend thing for me I am uncertain of this.

You can then just go down and continue training on new databunches until you’ve finished the dataset.

Moezzie · June 4, 2019, 9:54am

Thank you mullerzr

From your experiences I’m gathering that tokenising might take a huge amount of memory.

I will try pretokenizing everything beforehand.

The last sentence you suggest breaking the data set up into multiple data bunches. Did I understand that correctly?
Did you use the transfer learning / fine-tuning method shown in Lesson 4?