Tokenizer.process_all error BrokenProcessPool

Hi everyone, I’m trying to tokenize a large dataset, and I got the following error
BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
I got that error when doing:
tok = Tokenizer()
tok.process_all(texts)
where texts is an array that contains around 500k pieces of texts. I can load everything in memory without a problem( it takes 10 Go, but i have 32 available).

Any idea where this is coming from?

It may be due to python inefficient memory attribution in multi-processing: it copies the thing several times so you may get out of RAM even if you think you have enough.

I’m monitoring my RAM usage with htop while doing that. Everything appears fine(I never go over 15 Go usage). Is it possible that I get out of RAM while not being able to see it on htop?

No, normally not. Does it work on 1 cpu (set defaults.cpus = 1 to force that)? You may have a broken text.

I tried to run it again, and was able to see the whole memory being used just before the crash. Thanks for the help.

After some digging into the library, I was able to solve my problem by creating a TokenizeProcessor. If anyone encounters the same problem than me, I recommend sticking to base constructors (DataBunch.from_csv or similar) which will take care of this for you. If like me, you need to manually tokenize your dataset, TokenizeProcessor should do the trick, and take care of memory issue as well (by loading the dataset chunksize by chunksize).

A TokenizeProcessor is created automatically behind the scenes whatever the API you’re using. What default did you change to make it work on your dataset?

Originally, i was creating a Tokenizer and would call process_all_core on my dataset. This caused the memory error, instead, I created a TokenizeProcessor(tokenizer=mytokenizer), and used the process method from this class on my dataset. This solved my issue.

Aaaaah, I hadn’t understood! Yes you don’t want to apply the tokenization to the whole dataset, that’s why there is this chunksize argument. Note that if you let the fastai library tokenize for you it will do it by chunks of 10,000 entries :wink:

1 Like

Can you please explain how did you create mytokenizer for your dataset

Hello All,

Writing my first project in Fast-AI,
I am facing same issue while running the code on my local m/c, but it runs fine on a kaggle notebook.

If I understood correct, do we have to write an iterator that takes data in small batches, I tried making my dataset smaller by just 10 entries, still the same error.