BrokenProcessPool exception when building a LanguageModel

sebastian · October 11, 2018, 12:44pm

My apologies if this belongs somewhere else, but I keep getting an BrokenProcessPool exception when trying to build a LanguageModel.

tokenizer = Tokenizer(tok_func=SpacyTokenizer, lang="nl")
data_lm = text_data_from_csv(path=(DATA_PATH/"data").expanduser(),
                             tokenizer=tokenizer, 
                             data_func=lm_data,
                             chunksize=10_000,
                             n_labels=1,
                             max_vocab=100_000)

results in the following exception during tokenization: BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

The dataset I’m tokenizing is the Dutch Wikepedia. A test with a sample of 200.000 texts caused no problems, but the full dataset (about 2.7 million texts) raised the exception.

I’m trying to find out what might be causing this, before submitting it as an issue. Does anyone have any idea or suggestion? A smaller chunksize (1_000) didn’t help.

Thanks in advance!

sgugger · October 11, 2018, 1:39pm

Unfortunately, to get a clearer error message, you’ll need to run this on one CPU (with n_cpus=1 in argument). It’s going to take a lot longer, but that’s the only way to know more about this error.

dipesh_pal · December 23, 2019, 9:00am

Where we have to run that?