RuntimeError: DataLoader worker is killed by signal

Just to be clear, I have also only ever gotten these errors on ‘image’ datasets, not text. But what I tried to show also with the notebook, is that it has nothing to do with the actual content of the data, it is the lists of filenames, stored as strings, and the dicts of labels that are enough to cause this, if they are large enough. I just meant that this problem would be even worse if you were using other large lists of objects such as tokens for language models etc. within the dataloaders… And I have not looked at that, but I would assume the pytorch ImageFolder method will also be storing filenames in some sort of list, as long as that receives no special treatment, the same problems would therefore apply.
The case of image size causing “killed by bus” can not be explained by my statements above…

And the problem we had on quickdraw only appears when stuff combined doesn’t fit in RAM, as long as everything fits, so amount of RAM per process x num processes is available, this problem doesn’t appear, which is why with the small datasets within the lessons etc. of course this never pops up and probably for most people this edge case will not really matter either.

@balnazzar I am working on the whale competition where I have only ~75,000 images. I got the error:
RuntimeError: DataLoader worker (pid 5421) is killed by signal: Segmentation fault.

It is a local machine and it did not fill more than 15% of the 64GB RAM
Ubuntu 16
64GB
Cuda 10
Pytorch 1 stable

Error happened right after fit_one_cycle for a resnet50 model (image size 448), num_workers= 4

It seems if I let the data augmentation transforms as fastai defaults, I will not get the error like:
.transform(get_transforms(do_flip=False), size=SZ, resize_method=ResizeMethod.SQUISH)

instead of

.transform(get_transforms(do_flip=False, max_zoom=1.5, max_lighting=0.5, max_warp=0.7), size=SZ, resize_method=ResizeMethod.SQUISH)

Did you try to stick with the transforms default and still you get the error?

And I think the error that you and me are getting is different than the memory leak in the case of @marcmuc @devforfu where the RAM is filled before reaching the end of the epoch.

I noticed such error also in the Quick draw comp, where I could not change anything in the default transforms parameters. It does not seem even related to the number of images in the dataset. This whale competition has only few tens of thousands of images.

Edit: seems there are certain limits for the transforms arguments that cannot be increased over. For example, max_warp=0.6 will fail immediatly after learn.fit_one_cycle(8) in the pets notebook. However if I set it to 0.5 (default is 0.2) it will fail ~ 2nd epoch. Keeping in mind that the maximum value is just randomly (rarely applied on images), and if by chance it is appliead somewhere in the 1st epoch or subsequent epochs then it will fail.

I will make tests on what limits are acceptable on the lesson1 pets notebook after resnet50 fit_one_cycle.

Reproduced the error both on GCP and local machine.

ScreenClip

Trace of the error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<timed exec> in <module>

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
     19     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
     20                                         pct_start=pct_start, **kwargs))
---> 21     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     22 
     23 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    164         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    165         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 166             callbacks=self.callbacks+callbacks)
    167 
    168     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     92     except Exception as e:
     93         exception = e
---> 94         raise e
     95     finally: cb_handler.on_train_end(exception)
     96 

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     82             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     83                 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 84                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     85                 if cb_handler.on_batch_end(loss): break
     86 

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     25         loss = cb_handler.on_backward_begin(loss)
     26         loss.backward()
---> 27         cb_handler.on_backward_end()
     28         opt.step()
     29         cb_handler.on_step_end()

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/callback.py in on_backward_end(self)
    229     def on_backward_end(self)->None:
    230         "Handle end of gradient calculation."
--> 231         self('backward_end', False)
    232     def on_step_end(self)->None:
    233         "Handle end of optimization step."

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/callback.py in __call__(self, cb_name, call_mets, **kwargs)
    186         "Call through to all of the `CallbakHandler` functions."
    187         if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
--> 188         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    189 
    190     def on_train_begin(self, epochs:int, pbar:PBar, metrics:MetricFuncList)->None:

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/callback.py in <listcomp>(.0)
    186         "Call through to all of the `CallbakHandler` functions."
    187         if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
--> 188         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    189 
    190     def on_train_begin(self, epochs:int, pbar:PBar, metrics:MetricFuncList)->None:

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/train.py in on_backward_end(self, **kwargs)
     75     def on_backward_end(self, **kwargs):
     76         "Clip the gradient before the optimizer step."
---> 77         if self.clip: nn.utils.clip_grad_norm_(self.learn.model.parameters(), self.clip)
     78 
     79 def clip_grad(learn:Learner, clip:float=0.1)->Learner:

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/utils/clip_grad.py in clip_grad_norm_(parameters, max_norm, norm_type)
     30         total_norm = 0
     31         for p in parameters:
---> 32             param_norm = p.grad.data.norm(norm_type)
     33             total_norm += param_norm.item() ** norm_type
     34         total_norm = total_norm ** (1. / norm_type)

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/tensor.py in norm(self, p, dim, keepdim)
    250     def norm(self, p="fro", dim=None, keepdim=False):
    251         r"""See :func: `torch.norm`"""
--> 252         return torch.norm(self, p, dim, keepdim)
    253 
    254     def btrifact(self, info=None, pivot=True):

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/functional.py in norm(input, p, dim, keepdim, out)
    716             return torch._C._VariableFunctions.frobenius_norm(input)
    717         elif p != "nuc":
--> 718             return torch._C._VariableFunctions.norm(input, p)
    719 
    720     if p == "fro":

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/utils/data/dataloader.py in handler(signum, frame)
    272         # This following call uses `waitid` with WNOHANG from C side. Therefore,
    273         # Python can still get and update the process status successfully.
--> 274         _error_if_any_worker_fails()
    275         if previous_handler is not None:
    276             previous_handler(signum, frame)

RuntimeError: DataLoader worker (pid 5421) is killed by signal: Segmentation fault. 
2 Likes

Yes. I suspected the transformer was part of the problem, and tried to stick with defaults. The error still persisted. But if I do NOT transform anything at all the error does not pop up. Note that without the transformations tha total amount of data going all around is much smaller.

Indeed. The memory of the DGX is hard to fill during a regular DL project.

However, it seem that lighting is involved too, apart from the amount of max_warp.

Since your findings are quite interesting try and tag Jeremy. Since he’s participated in this thread, we should not be at risk of being quartered and beheaded.

2 Likes

Today, I was working on the whale competition and it failed even with only max_warp = 0.4 (after few epochs). So I had to decrease it to 0.3. And returning back to fastai defaults did not give me errors.
Which means even my table in my previous post is not consistent with all datasets. So perhaps defaults also can cause this error like in some cases (default max_warp = 0.1)?

Did you get the same error like mine?:

RuntimeError: DataLoader worker (pid 5421) is killed by signal: Segmentation fault. 

If I remember well, in the quickdraw comp., even slight change in max_zoom will trigger an error at some point in some epoch. The larger the argument value, the sooner the error pops up.

I bet @jeremy knows about this issue :slight_smile: . Perhaps we should wait a bit to settle other more important things in fastai v1.

1 Like

No. Mine was killed by bus signal :face_with_raised_eyebrow:

I was curious to know if they (the fastai developers) get that kind of error too…

I have had segmentation faults before but can‘t remember what solved it :frowning_face:. But this is from bookmarks I saved then:

This last definitely helped in some non-pytorch cases, it‘s about increasing the stack that the system provides for (python) processes.

maybe it helps.

2 Likes

Hm, these are interesting observations guys. I had some issues with transformations also. Not that critical as yours though. I am trying to train the model for face landmarks detection dataset, and in my case, the data block API raises some warnings about data inconsistency.

I am going to continue experiments on various datasets using fastai and plain torch. Probably it is worth to try some other ways to retrieve the data, as @marcmuc mentioned previously in one of his links about Redis caching.

1 Like

May you provide examples of such inconsistencies? Thanks!

Sure, will do! I am trying to re-write that code with the most recent version of the library and using a bit modified dataset to check if the issue still exists.

I can’t remember the exact error message (though try to replicate it later) but It was something related to the size of target tensor. In my case, each target is a 2D-array with (y, x) face landmarks coordinates. And, there are 42 (21 times 2) elements in that array. After transformations were applied, I’ve got a warning that my observations cannot be gathered into a batch because their shapes are different, like 19x2, 18x2, 20x2, etc. So it was like some of the landmarks were “lost” during the transformation process.

1 Like

For those who are getting:
RuntimeError: DataLoader worker (pid 173) is killed by signal: Bus error.

Maybe increasing the shared memory of the system will solve the issue. More details here:
https://www.kaggle.com/product-feedback/72606

I am working on Kaggle and I uploaded a dataset of images of mine.
But I get the error
DataLoader worker (pid 54) is killed by signal: Bus error.
when I try to inspect my data with
data.show_batch(rows=3, figsize=(7,6))

The thread on Kaggle.com (see above) has no solution whatsoever :frowning:

I mentioned the link here because this forum thread had been discussing such memory issues on local and remote servers for several months. Maybe increasing the shared memory on local or remote servers solve it with such errors.I know that kaggle kernels should be solved by the kaggle team, but I didn’t say it will solve this issue for kaggle kernels. Hopefully they will do it soon as they promised.

By the way, here is a reference to the issue I had. I was thinking that it is somehow related to data augmentation, i.e., the landmarks are falling outside of the image after various image transformations. However, then I’ve got another error:

UserWarning: 
There seems to be something wrong with your dataset, can't access self.train_ds[i] for all i in 
[65237, 47545, 8078, 53990, ..., 758]

I am going to try to reproduce this issue on some small/dummy dataset to see if it still exists in the library.

1 Like

You know, I thought about the transformations too. The ideal setup would be a very small dataset where you can visualize exactly all the transfs performed upon every img.

But that’s about vision. I had plenty of killed by signal back when I was working upon text…

Thanks, however: you commitment in finding a solution is commendable.

1 Like

RuntimeError: DataLoader worker (pid 81) is killed by signal: Bus error.
I am getting this error while running an kernel on kaggle, any solution please

Hi David,

Please see here. I think Pytorch 1.0.1 fixed this problem.

Yijin

I think Kaggle still doesn’t have a high enough shared memory limit for their Docker containers.

Some options:

  1. reduce your batch size, say to bs=16 maybe, instead of the default 64.
  2. reduce the number of workers. This will slow down your training.
  3. train on Colab instead of Kaggle. Colab fixed this issue in fall 2018.

I would favor option #1 or #3.

2 Likes

I ran into a memory problem yesterday (with 24 hours to go in a competition, of course!) and I’m sharing it here in case it’s a useful clue. The error messages were mostly about pin memory, and seemed to be related to a few lines in one_batch(), which is in basic_data.py and is called during the normalize() step of setting up a DataBunch. The lines are:

w = dl.num_workers
dl.num_workers = 0
try:     x,y = next(iter(dl))
finally: dl.num_workers = w <== This is where it crashed
if detach: x,y = to_detach(x,cpu=cpu),to_detach(y,cpu=cpu)...

In addition to resetting num_workers (and spawning/re-spawning worker proceses?), it looks like there are may be movement onto/off the GPU at that point in the code as well.

Commenting out the lines that save and restore num_workers fixed the memory problem, but not surprisingly I got a new error about not being able to re-initialize CUDA in a forked process.

UPDATE: I totally solved the problem by fixing the code in a custom callback that I was writing. I think CUDA is just very restrictive about what it will accept; in my case the memory issues were caused by a combination of me not being careful enough about what was on/off the GPU, not being careful enough about tensor copy semantics, and using Pytorch functions that worked on the CPU but not the GPU.

John

I was running into this same issue on my mac. Ran fine with num_workers=0 but not with more than zero unless I set the pixels to 80 or lower.

I made a few changes and now things are working:

Unfortunately I did all of these things at once so it’s unclear which of them actually solved the issue for me. If I have some time later, I may try to pinpoint but in a bit of a rush right now.

I’m now able to train with num_workers=16 and 500 pixel images without issue. I hope this helps someone.

1 Like

I was also having this issue, with errors about workers being unexpectedly killed and segmentation faults. I only increased my shared memory settings as described in the link (sharing again), and it solved the issue for me.