Large text files in unsupervised set causes MemoryError in transfer learning

Pablo · January 23, 2019, 4:43pm

I am training a language model, loading data like (batch size 12):

data = (TextList.from_folder(path)
    .filter_by_folder(include=include_folders)
    .random_split_by_pct(0.1)
    .label_for_lm()
    .databunch(bs=batch_size))

When I select a folder with 48550 documents (316.4 MB) it works fine. If I choose a folder with 87141 documents (only 304.8 MB!) I get a “MemoryError”.

Any quick fix for this?

What is Fastais’s team’s intuition about the ideal size for an unlabelled dataset for the language model?

Pablo · January 25, 2019, 9:42am

I have determined now that the problem is not caused by either the number of files (reducing it to 50000 did not help) nor total size (we already saw that 316.4 MB was fine, but another set of 304.8 MB was not). The problem, apparently, is caused by individual very large files. In the problematic case I have files of up to 709.5 kB (of plain text). In the one that worked fine the largest was 120.5 kB. Removing all files over 100 kB in the problematic set solved the problem.

sgugger · January 25, 2019, 2:35pm

That is weird, I had never seen an error like this.

Pablo · January 25, 2019, 2:41pm

It IS weird! I suppose you have worked with files larger than that? If you have, then perhaps I can take a better look at the problematic files and see if I can find any more clues…

sgugger · January 25, 2019, 2:55pm

I’m not sure if I have. I’m guessing it’s due to the multi-processing and the weird way memory is allocated by python then, so it would probably depend on the hardware. Did you try reducing the number of processes? You can pass n_cpus to the tokenizer for that, or change the value of defaults.cpus.

Pablo · January 25, 2019, 3:12pm

I haven’t tried that, no. I will check that out and get back to you

daleevans · February 19, 2019, 10:33pm

I’ve come across this, and the easiest solution was to just split the data up using

split -l 100000 train.txt --additional-suffix=.txt
rm train.txt

you may need to fiddle with the max lines per file of course

Pablo · February 21, 2019, 9:04am

I’m sorry I haven’t checked this in a long time (almost a month since my last reply), but it’s still in my to-do list. Thanks for the suggestion, @daleevans

Pablo · May 16, 2019, 3:18pm

Hi @daleevans, could you please elaborate a bit on your answer? I am currently finding this problem in the language model part, where I have many small texts in a folder, so I don’t know if I have any equivalent of train.txt

Pablo · May 17, 2019, 10:33am

Now on version 1.0.52:

The problem persists. I consistently get the MemoryError, even after removing most of the larger and shorter texts. I was trying to try changing n_cpus as per your suggestion, but a mid step seems to solve this issue somehow.

I am simply passing this:

        tokenizer = Tokenizer(SpacyTokenizer, 'es')
        processor = [TokenizeProcessor(tokenizer=tokenizer),
                     NumericalizeProcessor(max_vocab=30000)]
        kwargs = {'processor': processor}

To

            data = (TextList.from_folder(path, **kwargs)
                     .filter_by_folder(include=include_folders)
                     .split_by_rand_pct(0.1)
                     .label_for_lm()
                     .databunch(bs=batch_size))

Note I did not get to include n_cpus anywhere yet! If I remove that kwarg the error comes back. If I leave this in, it goes away (and data is loaded amazingly fast). The same happens with ‘en’ for the tokenizer.

The problem is that this solution implies using SpacyTokenizer, and I wonder if this removes some of the cool features from Fastai text-processing (all the special tokens used to mark repetitions, capitalization and so on). I can’t check this myself yet since I am fixing other version upgrade issues.

sgugger · May 17, 2019, 12:50pm

This is weird because the only thing you are changing from the default is setting max_vocab to 30,000 (instead of 60,000).

Pablo · May 17, 2019, 1:10pm

Well, that is puzzling.

I have tested the code above with max_vocab 60000 and even Tokenizer(SpacyTokenizer, 'en'), and it works fine. But if I call TeextList without **kwargs then I get the MemoryError again.

Can there possibly be any other thing I am changing by doing this? Otherwise, is there anything else I can do to help understand the situation? I can, for example, copy the complete error message?

sgugger · May 17, 2019, 1:14pm

Oh it’s silly! You’re not opening your files so you’re tokenizing the filenames. You should add OpenFileProcessor to your processor.

Side note: we are working on a tokenizer that saves files on disks instead of consuming all RAM, it will be in v1.1 (testing version should be out in a few weeks).

Pablo · May 17, 2019, 1:33pm

Ahhh hahaha, that also explains why it goes way faster than it should! Alright then, I’ll use OpenFileProcessor to bring back the MemoryError and then see if I can solve it with n_cpus (which will make things very slow, but still).

sgugger · May 17, 2019, 1:57pm

From what I’ve seen in my experiments, it won’t help. You just don’t have enough RAM to store all the tokenized texts, so you should preprocess bits by bits.

Pablo · May 17, 2019, 1:59pm

That makes sense. Do I need to train on only part of the data at a time, or can I preprocess the data in parts but then get a single data bunch? Perhaps this is explained somewhere?

sgugger · May 17, 2019, 2:01pm

Your whole numericalized dataset should fit in RAM (otherwise you have either too little RAM or a dataset that’s way to huge!). You should do the preprocessing yourself and create a DataBunch directly from your ids.

Pablo · May 17, 2019, 2:27pm

I have 2.3 GB of text for the language model. Is that awfully huge? Do you guys have an intuition about what’s a good amount of text to train the language model?

sgugger · May 17, 2019, 3:31pm

It’s seems big, so you cna probably train a good model on a quarter of that. Again, once v1.1 is out, you won’t have problems tokenizing the whole stuff.

Pablo · May 17, 2019, 3:34pm

Looking forward to v1.1! In the meantime I will probably train the language model on a reduced number of texts. Otherwise I will put together the databunch manually. Thanks!