How to solve RuntimeError: DataLoader worker (pid(s)) exited unexpectedly?

I’m a beginner with deep learning, and I’m using Google Colab to run my code.
(pytorch 1.4.0, touchvision 0.5.0)
I generated my databunch by using these code:

tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.5)

src = (ImageList.from_folder(path=data_folder)

img_data = (src.transform(tfms, size=128)

When I try to run

model = cnn_learner(img_data, models.resnet34, metrics=[accuracy, error_rate]) = img_data

I got the RuntimeError: DataLoader worker (pid(s) XXX) exited unexpectedly

RuntimeError                              Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/torch/utils/data/ in _try_get_data(self, timeout)
    760         try:
--> 761             data = self._data_queue.get(timeout=timeout)
    762             return (True, data)

13 frames
RuntimeError: DataLoader worker (pid 303) is killed by signal: Segmentation fault. 

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/torch/utils/data/ in _try_get_data(self, timeout)
    772             if len(failed_workers) > 0:
    773                 pids_str = ', '.join(str( for w in failed_workers)
--> 774                 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
    775             if isinstance(e, queue.Empty):
    776                 return (False, None)

RuntimeError: DataLoader worker (pid(s) 303) exited unexpectedly

I’ve tried to reduce my batch_size, but that didn’t work.
I’ve also searched about this error, and it says I need to use num_workers=0 to solve this problem, but I didn’t use any code about DataLoader method.
How could I solve this problem?


By the way, I’ve tried using pytorch 1.5.0+touchvision 0.6.0, and it’ll not report this error, but will get a warning:

UserWarning: The default behavior for interpolate/upsample with float scale_factor will change in 1.6.0 to align with other frameworks/libraries, and use scale_factor directly, instead of relying on the computed output size. 
If you wish to keep the old behavior, please set recompute_scale_factor=True. 
See the documentation of nn.Upsample for details. 
warnings.warn("The default behavior for interpolate/upsample with float scale_factor will change ")

And when I using pytorch 1.4.0, it took about 20~30 minutes to epoch once, but when I using pytorch 1.5.0, it would take over an hour to epoch once.

1 Like

OK, It seems like that all the origin of the problem is the version of pytorch.
After trying all the ways I could find on Google, I use the default version(pytorch 1.7.0 + touchvision 0.8.0) on colab to run my code and it could work. But the warnings are still exsiting.
So I used

import warnings

to ignore them.
And the time to epoch once also became decent. What a strange problem.
Maybe the fastaiv2 casued the problem, or maybe colab’s pytorch.
That’s the end of this question.

1 Like