How to solve RuntimeError: DataLoader worker (pid(s)) exited unexpectedly?

bpbd · November 22, 2020, 11:47am

I’m a beginner with deep learning, and I’m using Google Colab to run my code.
(pytorch 1.4.0, touchvision 0.5.0)
I generated my databunch by using these code:

tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.5)

src = (ImageList.from_folder(path=data_folder)
       .split_by_rand_pct(0.2)
       .label_from_folder())

img_data = (src.transform(tfms, size=128)
            .databunch()
            .normalize(imagenet_stats))

When I try to run

model = cnn_learner(img_data, models.resnet34, metrics=[accuracy, error_rate])
model.data = img_data
model.fit_one_cycle(3)

I got the RuntimeError: DataLoader worker (pid(s) XXX) exited unexpectedly

RuntimeError                              Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout)
    760         try:
--> 761             data = self._data_queue.get(timeout=timeout)
    762             return (True, data)

13 frames
RuntimeError: DataLoader worker (pid 303) is killed by signal: Segmentation fault. 

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout)
    772             if len(failed_workers) > 0:
    773                 pids_str = ', '.join(str(w.pid) for w in failed_workers)
--> 774                 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
    775             if isinstance(e, queue.Empty):
    776                 return (False, None)

RuntimeError: DataLoader worker (pid(s) 303) exited unexpectedly

I’ve tried to reduce my batch_size, but that didn’t work.
I’ve also searched about this error, and it says I need to use num_workers=0 to solve this problem, but I didn’t use any code about DataLoader method.
How could I solve this problem?

bpbd · November 22, 2020, 12:28pm

By the way, I’ve tried using pytorch 1.5.0+touchvision 0.6.0, and it’ll not report this error, but will get a warning:

/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:2854: 
UserWarning: The default behavior for interpolate/upsample with float scale_factor will change in 1.6.0 to align with other frameworks/libraries, and use scale_factor directly, instead of relying on the computed output size. 
If you wish to keep the old behavior, please set recompute_scale_factor=True. 
See the documentation of nn.Upsample for details. 
warnings.warn("The default behavior for interpolate/upsample with float scale_factor will change ")

And when I using pytorch 1.4.0, it took about 20~30 minutes to epoch once, but when I using pytorch 1.5.0, it would take over an hour to epoch once.

bpbd · November 25, 2020, 2:30am

OK, It seems like that all the origin of the problem is the version of pytorch.
After trying all the ways I could find on Google, I use the default version(pytorch 1.7.0 + touchvision 0.8.0) on colab to run my code and it could work. But the warnings are still exsiting.
So I used

import warnings
warnings.filterwarnings("ignore")

to ignore them.
And the time to epoch once also became decent. What a strange problem.
Maybe the fastaiv2 casued the problem, or maybe colab’s pytorch.
That’s the end of this question.