Reg: Fastai.Text tokenizer n_cpu

codeck · October 23, 2018, 8:10pm

@jeremy @sgugger I found that all my cores are not used while process pooling the tokenization, is there a reason we utilize os.cpu_count()/2 ? and not just all the cores?

jeremy · October 23, 2018, 8:31pm

Yeah it seems often hyper-threading leads to worse performance here. I haven’t tested it that carefully however.

codeck · October 23, 2018, 10:00pm

I’ve done testing across multiple cores and assigning varied n_cpus() not using fastai but just with the process pool lib, will share it across soon.

codeck · October 23, 2018, 10:01pm

also do u mean worse performance in this case as in a df oriented tokenization or spacy tokenization ? or in general tokenization

jeremy · October 23, 2018, 10:11pm

Spacy tokenization. Although that was when we used a different approach…