RuntimeError: DataLoader worker (pid 137) is killed by signal: Bus error

dwcar49us · February 20, 2019, 1:03am

Thanks

dwcar49us · February 20, 2019, 6:34am

Hi,

I just did a clean/fresh machine load and a new git clone from Master branch course-v3 and it still shows the same errors in running Lesson3-IMDB. The fixes do not seem to resolve all the issues.

I have had to revert to 1.0.42 and the matching notebook. That version works fine in my environment.

David

sgugger · February 20, 2019, 3:23pm

Forgot one cell to change, you’re right. It should work now.

mgloria · July 25, 2019, 9:44pm

I encounter this error when doing TTA:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 511, in _try_get_batch
    data = self.data_queue.get(timeout=timeout)
  File "/srv/conda/envs/notebook/lib/python3.7/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/srv/conda/envs/notebook/lib/python3.7/threading.py", line 300, in wait
    gotit = waiter.acquire(True, timeout)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 16) is killed by signal: Bus error.

The code is for a competition and is running on a docker container on their side. Any ideas? Maybe @sgugger or @gurvinder For the heroes wanting to help, I am not familiar at all with the working of Docker containers so I would need a very plain explanation…

sgugger · July 26, 2019, 12:16am

That error is a generic error from PyTorch saying… something failed.
So no one can help you without:

seeing the code you ran
seeing the error message you pass when you add num_workers=True in your call to DataBunch (to deactivate multiprocessing, which is the thing throwing the cryptic error right now)

utkb · July 26, 2019, 9:29am

Hi. Not sure if you scrolled through this thread or not (it’s only 25 posts~! : )

If it’s in a docker container, it might be something to do with setting / lack of shm space. See this post above you, and give that a try?

If that does not solve your problem, then it’s as Sylvain said – you’ll need to post more details / code here for anyone to help…

Good luck!

Yijin

mgloria · July 26, 2019, 10:12am

Thanks a lot! @utkb and @sgugger. See my code as of now:

learn = load_learner(path, 'export2.pkl')
learn.data.add_test(good_images)
preds, _ = learn.TTA(ds_type=DatasetType.Test)

My question is where should I add num_workers = True (this is the same as setting it to 1 worker, right). I check the function TTA but I do not see it as an argument.

utkb · July 26, 2019, 2:26pm

Hi. Try it with load_learner - see docs here. Also, did you try what I mentioned about shm? Did that do anything? Thanks.

mgloria · July 29, 2019, 6:12am

Hi, @utkb. First of all, thanks for your help. Today, I tried it again with the following line modified:

learn = load_learner(path, 'export2.pkl', num_workers=True)

but I still get the same error. What is num_workers doing? As for the shm I do not really understand where should I wrote I am not myself creating the container, the image is being built in the back (I just submit the specifications of my conda environment) but if you could clarify it I will of course try it out.

utkb · July 30, 2019, 3:47pm

Hi,

num_workers stuff you will need to talk to Pytorch, as I am not familiar with what exactly is happening there. As for the shm stuff, it was just a solution to problems faced when running fastai from within a Docker container. Quite a few people managed to get things working by adding the argument –shm-size=256m when running the Docker container. I myself had to (and have been) using --shm-size=1024m because 256m somehow was not enough for my setup. The overall command when running the Docker container was thus something like (add your own further arguments):

nvidia-docker run --name ContainerName --shm-size=1024m ImageName

Still might not solve your problem though. It was already well-observed that there are difficulties in getting Pytorch+fastai running properly and robustly in Docker containers…

Thanks.

Yijin