RuntimeError: DataLoader worker (pid 137) is killed by signal: Bus error

salil_23 · October 21, 2018, 12:46pm

When I am running the course-v3 notebook for lesson1-pets on Google Colab, it shows the above-mentioned error on line learn.fit_one_cycle(5) for both resnet34 and resnet50.

How to resolve it?

sgugger · October 21, 2018, 1:24pm

This an issue from pyotch so you should ask it on their forum. From what I understand it has to do with shared memory while multiprocessing and setting num_workers to 0 will fix the problem (but make your training slower).

salil_23 · October 21, 2018, 1:45pm

Exactly where I have to put num_workers=0 to use this function?

sgugger · October 21, 2018, 6:12pm

Inside any function that create a DataBunch. See the docs for more information.

matdmiller · October 21, 2018, 8:07pm

I run fastai in nvidia docker on my local rig. I was trying to get fast.ai v1 loaded and ready to go for tomorrow and it wasn’t working and I was also getting this error. I was able to fix this when starting by docker container by adding --ipc=host into the docker run command. There was also a docker command for increasing shared memory size but I didn’t really understand the science of what values I should increase it to so I opted for the ipc flag instead. I know this doesn’t help you w/ Colab but I suspect this problem will come up again for others.

I did not have this issue when running fastai v0.7 in docker and I suspect it has something to do with the software changes in v1. I did upgrade my version of my Nvidia drivers to 396 and Docker-CE to the latest version immediately prior to building my fastai v1 container so it could have been related to those upgrades as well although I suspect it wasn’t.

sgugger · October 21, 2018, 8:44pm

It is an issue that is often mentioned in the pytorch repo with docker installs. fastai v0.7 didn’t use the pytorch dataloader, which is the source of this bug. This is probably why it appears now.

jeffhale · October 22, 2018, 2:02am

Was able to fix similar issue on Colab by setting num_workers=0 here:

data = ImageDataBunch.from_folder(
    path, 
    ds_tfms=get_transforms(), 
    tfms=imagenet_norm, 
    size=256,
    num_workers=0
)
img,label = data.valid_ds[-1]
img.show(title=data.classes[label])

vijaysai · October 22, 2018, 4:45am

@jeffhale I was reading your other post on setting up Google Colab. I was doing the same and got such with the same issue. I tried setting the num_workers=0, but the training is taking forever. How much time did it take for the training to complete for resnet34 for the pets problem ?

matdmiller · October 22, 2018, 5:26am

I think Colab would be pretty limiting from a performance perspective. With the bug described on this page and fix of setting num_workers=0 for your image loading, it will slow things down considerably. In addition Colab uses K80 GPU’s (same as AWS P2 instance) which are quite slow as compared to a 1080ti. https://medium.com/initialized-capital/benchmarking-tensorflow-performance-and-cost-across-different-gpu-options-69bd85fe5d58

I tried setting num_workers=0 on my machine and training (on resnet50) went from 48s per epoch to 132s per epoch. I am using a 1080ti which is ~4x faster than a K80 and i have nvme storage.

jeffhale · October 22, 2018, 4:47pm

with size = 224

learn = ConvLearner(data, models.resnet50, metrics=accuracy)
learn.fit_one_cycle(1)

Took 12 minutes. Not blazing fast. But a huge improvement over a CPU and free.

epoch train loss valid loss accuracy
1 0.045373 0.029649 0.989500
CPU times: user 10min 21s, sys: 1min 45s, total: 12min 7s
Wall time: 12min 8s

Then i took 14 more minutes to run:

learn.unfreeze()
learn.fit_one_cycle(1, slice(1e-5,3e-4), pct_start=0.05)

Not so fast.

epoch  train loss  valid loss  accuracy
1      0.026436    0.016208    0.993500
CPU times: user 12min 18s, sys: 1min 46s, total: 14min 5s
Wall time: 14min 9s

accuracy(*learn.TTA())
CPU times: user 4min 44s, sys: 8.48 s, total: 4min 53s Wall time: 4min 53s

tensor(0.9965)

So 30 minutes all together to run very few epochs.

The same test on Paperspace’s basic P4000 GPU setup took a little over 8 minutes.

gurvinder · October 24, 2018, 8:15pm

Yeah it is shared memory issue. As by default it seems /dev/shm is not created when running in container which pytorch needs for workers to communicate. So I created one with size of 256MB and it worked fine

matdmiller · October 24, 2018, 9:24pm

Awesome thanks! So to confirm, is this a docker volume mounted to that location inside the container or the —shm-size=256m?

gurvinder · October 25, 2018, 5:07am

--shm-size=256m should be ok if you running directly from docker. But if you are using kubernetes then you need to do this https://github.com/Uninett/helm-charts/blob/master/repos/stable/deep-learning-tools/templates/deployment.yaml#L41 and then mount it as https://github.com/Uninett/helm-charts/blob/master/repos/stable/deep-learning-tools/templates/deployment.yaml#L171

vibhorsood · October 26, 2018, 5:07am

@matdmiller

where and how to add this in colab or clouderizer : --shm-size=256m?

matdmiller · October 26, 2018, 12:35pm

I don’t know. I run docker/portainer locally. I’m pretty sure you can’t do this with colab, never heard of the other services.

utkb · November 10, 2018, 3:00am

256m did not seem enough for me when I ran the Docker container on my own machine. I changed it to --shm-size=1024m, and it works fine now. Thanks for pointing out this solution!

musedivision · December 16, 2018, 10:30am

This fix worked for me as well.
I’m running in docker on a p2 instance and added --shm-size 50G.
But I had also forgotten to specify the runtime --runtime=nvidia

jbencook · February 19, 2019, 9:58pm

Thanks for the --ipc=host tip - worked on my machine too.

dwcar49us · February 19, 2019, 10:41pm

@sgugger
I am having an issue today with fast.ai v1, DL course v3 Pt 1, lesson 3, IMDB. I set up a new account on a VAST.ai server using Linux Ubuntu 16.04 with Nvidea RTX 2080Ti GPU with 52 GB memory & 12 cores. I used their Pytorch 1.0 image with CUDA 10.0 which is installed from Docker image. I then updated conda and installed fastai as suggested in the Crestle setup docs.

conda update conda
conda install -c fastai fastai

I ran the NB 1 [lesson1-pets.ipynb] on pets from scratch successfully with no issues.

However when running lesson3-imdb.ipynb I get an error which appears to be due to changes in the fast.ai library. I am using fastai ver 1.0.45

at cell 10:

data = TextDataBunch.load(path)

FileNotFoundErrorTraceback (most recent call last)
<ipython-input-10-4ff358ceed81> in <module>
----> 1 data = TextDataBunch.load(path)

/opt/conda/lib/python3.7/site-packages/fastai/text/data.py in load(cls, path, cache_name, processor, **kwargs)
    167                 Use `load_data` for data saved with v1.0.44 or later.""", DeprecationWarning) 
    168         cache_path = Path(path)/cache_name
--> 169         vocab = Vocab(pickle.load(open(cache_path/'itos.pkl','rb')))
    170         train_ids,train_lbls = np.load(cache_path/f'train_ids.npy'), np.load(cache_path/f'train_lbl.npy')
    171         valid_ids,valid_lbls = np.load(cache_path/f'valid_ids.npy'), np.load(cache_path/f'valid_lbl.npy')

FileNotFoundError: [Errno 2] No such file or directory: '/root/.fastai/data/imdb_sample/tmp/itos.pkl'

There is no file data/imdb_sample/tmp/itos.pkl on my system. The IMDB files are there.

It appears that itos.pkl is a Vocab file, which would not have been generated yet, or downloaded if it is for the wikitext103 model.

Please advise if this is my error or an issue.

Thanks

sgugger · February 20, 2019, 12:45am

There was a breaking change in v1.0.45 and everyone forgot to update the IMDB notebook, I’ve just done it. Data should be loaded with load_data now (and it works in all the applications).