RuntimeError: DataLoader worker (pid 137) is killed by signal: Bus error

Thanks

Hi,

@sgugger

I just did a clean/fresh machine load and a new git clone from Master branch course-v3 and it still shows the same errors in running Lesson3-IMDB. The fixes do not seem to resolve all the issues.

I have had to revert to 1.0.42 and the matching notebook. That version works fine in my environment.

David

Forgot one cell to change, you’re right. It should work now.

I encounter this error when doing TTA:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 511, in _try_get_batch
    data = self.data_queue.get(timeout=timeout)
  File "/srv/conda/envs/notebook/lib/python3.7/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/srv/conda/envs/notebook/lib/python3.7/threading.py", line 300, in wait
    gotit = waiter.acquire(True, timeout)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 16) is killed by signal: Bus error.

The code is for a competition and is running on a docker container on their side. Any ideas? Maybe @sgugger or @gurvinder For the heroes wanting to help, I am not familiar at all with the working of Docker containers so I would need a very plain explanation… :wink:

That error is a generic error from PyTorch saying… something failed.
So no one can help you without:

  1. seeing the code you ran
  2. seeing the error message you pass when you add num_workers=True in your call to DataBunch (to deactivate multiprocessing, which is the thing throwing the cryptic error right now)

Hi. Not sure if you scrolled through this thread or not (it’s only 25 posts~! : )

If it’s in a docker container, it might be something to do with setting / lack of shm space. See this post above you, and give that a try?

If that does not solve your problem, then it’s as Sylvain said – you’ll need to post more details / code here for anyone to help…

Good luck!

Yijin

Thanks a lot! @utkb and @sgugger. See my code as of now:

learn = load_learner(path, 'export2.pkl')
learn.data.add_test(good_images)
preds, _ = learn.TTA(ds_type=DatasetType.Test)

My question is where should I add num_workers = True (this is the same as setting it to 1 worker, right). I check the function TTA but I do not see it as an argument.

Hi. Try it with load_learner - see docs here. Also, did you try what I mentioned about shm? Did that do anything? Thanks.

Hi, @utkb. First of all, thanks for your help. Today, I tried it again with the following line modified:

learn = load_learner(path, 'export2.pkl', num_workers=True)

but I still get the same error. What is num_workers doing? As for the shm I do not really understand where should I wrote I am not myself creating the container, the image is being built in the back (I just submit the specifications of my conda environment) but if you could clarify it I will of course try it out.

Hi,

num_workers stuff you will need to talk to Pytorch, as I am not familiar with what exactly is happening there. As for the shm stuff, it was just a solution to problems faced when running fastai from within a Docker container. Quite a few people managed to get things working by adding the argument –shm-size=256m when running the Docker container. I myself had to (and have been) using --shm-size=1024m because 256m somehow was not enough for my setup. The overall command when running the Docker container was thus something like (add your own further arguments):

nvidia-docker run --name ContainerName --shm-size=1024m ImageName

Still might not solve your problem though. It was already well-observed that there are difficulties in getting Pytorch+fastai running properly and robustly in Docker containers…

Thanks.

Yijin

1 Like