Large text files in unsupervised set causes MemoryError in transfer learning

#1

I am training a language model, loading data like (batch size 12):

data = (TextList.from_folder(path)
    .filter_by_folder(include=include_folders)
    .random_split_by_pct(0.1)
    .label_for_lm()
    .databunch(bs=batch_size))

When I select a folder with 48550 documents (316.4 MB) it works fine. If I choose a folder with 87141 documents (only 304.8 MB!) I get a “MemoryError”.

Any quick fix for this?

What is Fastais’s team’s intuition about the ideal size for an unlabelled dataset for the language model?

1 Like

#2

I have determined now that the problem is not caused by either the number of files (reducing it to 50000 did not help) nor total size (we already saw that 316.4 MB was fine, but another set of 304.8 MB was not). The problem, apparently, is caused by individual very large files. In the problematic case I have files of up to 709.5 kB (of plain text). In the one that worked fine the largest was 120.5 kB. Removing all files over 100 kB in the problematic set solved the problem.

0 Likes

#3

That is weird, I had never seen an error like this.

0 Likes

#4

It IS weird! I suppose you have worked with files larger than that? If you have, then perhaps I can take a better look at the problematic files and see if I can find any more clues…

0 Likes

#5

I’m not sure if I have. I’m guessing it’s due to the multi-processing and the weird way memory is allocated by python then, so it would probably depend on the hardware. Did you try reducing the number of processes? You can pass n_cpus to the tokenizer for that, or change the value of defaults.cpus.

0 Likes

#6

I haven’t tried that, no. I will check that out and get back to you :slight_smile:

0 Likes

(Dale Evans) #7

I’ve come across this, and the easiest solution was to just split the data up using

split -l 100000 train.txt --additional-suffix=.txt
rm train.txt

you may need to fiddle with the max lines per file of course

0 Likes

#8

I’m sorry I haven’t checked this in a long time (almost a month since my last reply), but it’s still in my to-do list. Thanks for the suggestion, @daleevans :slight_smile:

0 Likes

#9

Hi @daleevans, could you please elaborate a bit on your answer? I am currently finding this problem in the language model part, where I have many small texts in a folder, so I don’t know if I have any equivalent of train.txt

0 Likes

#10

Now on version 1.0.52:

The problem persists. I consistently get the MemoryError, even after removing most of the larger and shorter texts. I was trying to try changing n_cpus as per your suggestion, but a mid step seems to solve this issue somehow.

I am simply passing this:

        tokenizer = Tokenizer(SpacyTokenizer, 'es')
        processor = [TokenizeProcessor(tokenizer=tokenizer),
                     NumericalizeProcessor(max_vocab=30000)]
        kwargs = {'processor': processor}

To

            data = (TextList.from_folder(path, **kwargs)
                     .filter_by_folder(include=include_folders)
                     .split_by_rand_pct(0.1)
                     .label_for_lm()
                     .databunch(bs=batch_size))

Note I did not get to include n_cpus anywhere yet! If I remove that kwarg the error comes back. If I leave this in, it goes away (and data is loaded amazingly fast). The same happens with ‘en’ for the tokenizer.

The problem is that this solution implies using SpacyTokenizer, and I wonder if this removes some of the cool features from Fastai text-processing (all the special tokens used to mark repetitions, capitalization and so on). I can’t check this myself yet since I am fixing other version upgrade issues.

0 Likes

#11

This is weird because the only thing you are changing from the default is setting max_vocab to 30,000 (instead of 60,000).

0 Likes

#12

Well, that is puzzling.

I have tested the code above with max_vocab 60000 and even Tokenizer(SpacyTokenizer, 'en'), and it works fine. But if I call TeextList without **kwargs then I get the MemoryError again.

Can there possibly be any other thing I am changing by doing this? Otherwise, is there anything else I can do to help understand the situation? I can, for example, copy the complete error message?

0 Likes

#13

Oh it’s silly! You’re not opening your files so you’re tokenizing the filenames. You should add OpenFileProcessor to your processor.

Side note: we are working on a tokenizer that saves files on disks instead of consuming all RAM, it will be in v1.1 (testing version should be out in a few weeks).

2 Likes

#14

Ahhh hahaha, that also explains why it goes way faster than it should! Alright then, I’ll use OpenFileProcessor to bring back the MemoryError and then see if I can solve it with n_cpus (which will make things very slow, but still).

0 Likes

#15

From what I’ve seen in my experiments, it won’t help. You just don’t have enough RAM to store all the tokenized texts, so you should preprocess bits by bits.

0 Likes

#16

That makes sense. Do I need to train on only part of the data at a time, or can I preprocess the data in parts but then get a single data bunch? Perhaps this is explained somewhere?

0 Likes

#17

Your whole numericalized dataset should fit in RAM (otherwise you have either too little RAM or a dataset that’s way to huge!). You should do the preprocessing yourself and create a DataBunch directly from your ids.

1 Like

#18

I have 2.3 GB of text for the language model. Is that awfully huge? Do you guys have an intuition about what’s a good amount of text to train the language model?

0 Likes

#19

It’s seems big, so you cna probably train a good model on a quarter of that. Again, once v1.1 is out, you won’t have problems tokenizing the whole stuff.

2 Likes

#20

Looking forward to v1.1! In the meantime I will probably train the language model on a reduced number of texts. Otherwise I will put together the databunch manually. Thanks!

0 Likes