My apologies if this belongs somewhere else, but I keep getting an BrokenProcessPool
exception when trying to build a LanguageModel.
tokenizer = Tokenizer(tok_func=SpacyTokenizer, lang="nl")
data_lm = text_data_from_csv(path=(DATA_PATH/"data").expanduser(),
tokenizer=tokenizer,
data_func=lm_data,
chunksize=10_000,
n_labels=1,
max_vocab=100_000)
results in the following exception during tokenization: BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
The dataset I’m tokenizing is the Dutch Wikepedia. A test with a sample of 200.000 texts caused no problems, but the full dataset (about 2.7 million texts) raised the exception.
I’m trying to find out what might be causing this, before submitting it as an issue. Does anyone have any idea or suggestion? A smaller chunksize (1_000) didn’t help.
Thanks in advance!