Dataloader - parallel processing error - ParallelNative.cpp

Problem: dataloader is not using all CPUs. Because of that training is very slow.

My environment:

Python 3.6

I am creating a dataloader like this:

textblock = TextBlock.from_df(
    '_VALUE',  # Which dataframe column to read
    is_lm=True,  # We only have X and no Y for the language model
    rules=[]  # Diable default fastai rules

datablock = DataBlock(
    blocks=textblock,  # That's how we read, tokenize and get X
    get_x=ColReader('text'),  # After going through TextBlock, tokens are in the column `text`
    splitter=RandomSplitter(0.2)  # Splitting to train/validation

        dataloader = datablock.dataloaders(
            subset,  # Source of data
            bs=256,  # Batch size

When I attempt to train, I get a bunch of duplicate errors:

[W ParallelNative.cpp:206] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)

From this article, it appears that Pytorch has an error in 1.7 related to parallel processing.
Setting an environment variable to 1 removes the error:

(Potentially) as a result of it CPU load is very low (no parallel computing?) and training is slow too. I am training on CPU-only Mac, so I expect all CPUs to run at 100% to reach good speeds. But seems like the dataloader is the bottleneck due to lack of parallel computing.

I wonder if anyone had the same problem?

The problem also does not happen if dataloader has num_workers=0

I am also using the exact same environment as you @versus. Just that my python version is 3.8.3.

I am creating an instance of ImageDataLoader class and training it like this:

path = untar_data(URLs.PETS)
files = get_image_files(path/"images")
def label_func(f): return f[0].isupper()
dls = ImageDataLoaders.from_name_func(path, files, label_func, item_tfms=Resize(224))
learn = cnn_learner(dls, resnet34, metrics=error_rate)

When I run the the last two lines of code on Jupyter notebook, I get the exact same error as you get. I then tried setting:

torch.set_num_threads = 1

It resolved the problem but the training was too slow. If you meanwhile were able to solve it, please do let me know.

I was not able to solve it on CPU-only machine. I tried py3.6 and 3.8 - same result. The difference in 3.8 is that multiprocessing works differently, but it doesn’t seem to affect the problem.
The problem is not reproduced on GPU-machines. I am able to use multi-threaded data processing and there is no warning like this.

As I wrote in my first post, seems like it is a pytorch bug reported here: