Imdb.ipynb: chunksize cannot larger than 5 or Python will stop working

Hi guys,

My local computer has 256GB memory with Windows, run the notebook in chrome.

In the “Language model tokens” section, when I run the below code in the notebook with chunksize larger than 5, the notebook will raise the error “Python has stopped working”. So the low chunksize leads to very slow tokenization.

Does anyone else meet this problem? Thanks!

chunksize=50 # original number is 24000
df_trn = pd.read_csv(LM_PATH/‘train.csv’, header=None, chunksize=chunksize)
df_val = pd.read_csv(LM_PATH/‘test.csv’, header=None, chunksize=chunksize)
tok_trn, trn_labels = get_all(df_trn, 1)
tok_val, val_labels = get_all(df_val, 1


BrokenProcessPool Traceback (most recent call last)
in ()
----> 1 tok_trn, trn_labels = get_all(df_trn, 1)
2 tok_val, val_labels = get_all(df_val, 1)

in get_all(df, n_lbls)
3 for i, r in enumerate(df):
4 print(i)
----> 5 tok_, labels_ = get_texts(r, n_lbls)
6 tok += tok_;
7 labels += labels_

in get_texts(df, n_lbls)
6
7 # proc_all_mp can use all the cores of your CPU, in Jeremy’s machine it accelerate from 1h+ to be 2mins
----> 8 tok = Tokenizer().proc_all_mp(partition_by_cores(texts))
9 return tok, list(labels)

E:\fastai2\fastai\courses\dl2\fastai\text.py in proc_all_mp(ss, lang)
99 ncpus = ncpus or num_cpus()//2
100 with ProcessPoolExecutor(ncpus) as e:
–> 101 return sum(e.map(Tokenizer.proc_all, ss, [lang]*len(ss)), [])
102
103

~\AppData\Local\Continuum\anaconda3\envs\fastai\lib\concurrent\futures\process.py in _chain_from_iterable_of_lists(iterable)
364 careful not to keep references to yielded objects.
365 “”"
–> 366 for element in iterable:
367 element.reverse()
368 while element:

~\AppData\Local\Continuum\anaconda3\envs\fastai\lib\concurrent\futures_base.py in result_iterator()
584 # Careful not to keep a reference to the popped future
585 if timeout is None:
–> 586 yield fs.pop().result()
587 else:
588 yield fs.pop().result(end_time - time.time())

~\AppData\Local\Continuum\anaconda3\envs\fastai\lib\concurrent\futures_base.py in result(self, timeout)
430 raise CancelledError()
431 elif self._state == FINISHED:
–> 432 return self.__get_result()
433 else:
434 raise TimeoutError()

~\AppData\Local\Continuum\anaconda3\envs\fastai\lib\concurrent\futures_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
–> 384 raise self._exception
385 else:
386 return self._result

BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

@allensun I got the same problem here. Did you manage to solve the problem?