BrokenProcessPool exception when building a LanguageModel


#1

My apologies if this belongs somewhere else, but I keep getting an BrokenProcessPool exception when trying to build a LanguageModel.

tokenizer = Tokenizer(tok_func=SpacyTokenizer, lang="nl")
data_lm = text_data_from_csv(path=(DATA_PATH/"data").expanduser(),
                             tokenizer=tokenizer, 
                             data_func=lm_data,
                             chunksize=10_000,
                             n_labels=1,
                             max_vocab=100_000)

results in the following exception during tokenization: BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

The dataset I’m tokenizing is the Dutch Wikepedia. A test with a sample of 200.000 texts caused no problems, but the full dataset (about 2.7 million texts) raised the exception.

I’m trying to find out what might be causing this, before submitting it as an issue. Does anyone have any idea or suggestion? A smaller chunksize (1_000) didn’t help.

Thanks in advance!


#2

Unfortunately, to get a clearer error message, you’ll need to run this on one CPU (with n_cpus=1 in argument). It’s going to take a lot longer, but that’s the only way to know more about this error.