Reg: Fastai.Text tokenizer n_cpu

@jeremy @sgugger I found that all my cores are not used while process pooling the tokenization, is there a reason we utilize os.cpu_count()/2 ? and not just all the cores?

Yeah it seems often hyper-threading leads to worse performance here. I haven’t tested it that carefully however.

1 Like

I’ve done testing across multiple cores and assigning varied n_cpus() not using fastai but just with the process pool lib, will share it across soon.

also do u mean worse performance in this case as in a df oriented tokenization or spacy tokenization ? or in general tokenization

Spacy tokenization. Although that was when we used a different approach…