@jeremy @sgugger I found that all my cores are not used while process pooling the tokenization, is there a reason we utilize os.cpu_count()/2 ? and not just all the cores?
Yeah it seems often hyper-threading leads to worse performance here. I haven’t tested it that carefully however.
1 Like
I’ve done testing across multiple cores and assigning varied n_cpus() not using fastai but just with the process pool lib, will share it across soon.
also do u mean worse performance in this case as in a df oriented tokenization or spacy tokenization ? or in general tokenization
Spacy tokenization. Although that was when we used a different approach…