RuntimeError: DataLoader worker (pid 137) is killed by signal: Bus error


(Salil Mishra) #1

When I am running the course-v3 notebook for lesson1-pets on Google Colab, it shows the above-mentioned error on line learn.fit_one_cycle(5) for both resnet34 and resnet50.

How to resolve it?


Lesson 1 Discussion ✅
Fastai.tabular module not found on kaggle kernal
Unofficial Setup thread (Local, AWS)
FAQ, resources, and official course updates ✅
Lesson 1 Discussion ✅
RuntimeError: DataLoader worker is killed by signal
#2

This an issue from pyotch so you should ask it on their forum. From what I understand it has to do with shared memory while multiprocessing and setting num_workers to 0 will fix the problem (but make your training slower).


Unofficial Setup thread (Local, AWS)
(Salil Mishra) #3

Exactly where I have to put num_workers=0 to use this function?


Fastai v1 install issues thread
#4

Inside any function that create a DataBunch. See the docs for more information.


(Mat M) #5

I run fastai in nvidia docker on my local rig. I was trying to get fast.ai v1 loaded and ready to go for tomorrow and it wasn’t working and I was also getting this error. I was able to fix this when starting by docker container by adding --ipc=host into the docker run command. There was also a docker command for increasing shared memory size but I didn’t really understand the science of what values I should increase it to so I opted for the ipc flag instead. I know this doesn’t help you w/ Colab but I suspect this problem will come up again for others.

I did not have this issue when running fastai v0.7 in docker and I suspect it has something to do with the software changes in v1. I did upgrade my version of my Nvidia drivers to 396 and Docker-CE to the latest version immediately prior to building my fastai v1 container so it could have been related to those upgrades as well although I suspect it wasn’t.


#6

It is an issue that is often mentioned in the pytorch repo with docker installs. fastai v0.7 didn’t use the pytorch dataloader, which is the source of this bug. This is probably why it appears now.


(Jeff Hale) #7

Was able to fix similar issue on Colab by setting num_workers=0 here:

data = ImageDataBunch.from_folder(
    path, 
    ds_tfms=get_transforms(), 
    tfms=imagenet_norm, 
    size=256,
    num_workers=0
)
img,label = data.valid_ds[-1]
img.show(title=data.classes[label])

(vijaysai) #8

@jeffhale I was reading your other post on setting up Google Colab. I was doing the same and got such with the same issue. I tried setting the num_workers=0, but the training is taking forever. How much time did it take for the training to complete for resnet34 for the pets problem ?


(Mat M) #9

I think Colab would be pretty limiting from a performance perspective. With the bug described on this page and fix of setting num_workers=0 for your image loading, it will slow things down considerably. In addition Colab uses K80 GPU’s (same as AWS P2 instance) which are quite slow as compared to a 1080ti. https://medium.com/initialized-capital/benchmarking-tensorflow-performance-and-cost-across-different-gpu-options-69bd85fe5d58

I tried setting num_workers=0 on my machine and training (on resnet50) went from 48s per epoch to 132s per epoch. I am using a 1080ti which is ~4x faster than a K80 and i have nvme storage.


(Jeff Hale) #11

with size = 224

learn = ConvLearner(data, models.resnet50, metrics=accuracy)
learn.fit_one_cycle(1)

Took 12 minutes. Not blazing fast. But a huge improvement over a CPU and free. :grinning:

epoch train loss valid loss accuracy
1 0.045373 0.029649 0.989500
CPU times: user 10min 21s, sys: 1min 45s, total: 12min 7s
Wall time: 12min 8s

Then i took 14 more minutes to run:

learn.unfreeze()
learn.fit_one_cycle(1, slice(1e-5,3e-4), pct_start=0.05)

Not so fast.

epoch  train loss  valid loss  accuracy
1      0.026436    0.016208    0.993500
CPU times: user 12min 18s, sys: 1min 46s, total: 14min 5s
Wall time: 14min 9s

accuracy(*learn.TTA())
CPU times: user 4min 44s, sys: 8.48 s, total: 4min 53s Wall time: 4min 53s

tensor(0.9965)

So 30 minutes all together to run very few epochs.

The same test on Paperspace’s basic P4000 GPU setup took a little over 8 minutes.


#12

Yeah it is shared memory issue. As by default it seems /dev/shm is not created when running in container which pytorch needs for workers to communicate. So I created one with size of 256MB and it worked fine


(Mat M) #13

Awesome thanks! So to confirm, is this a docker volume mounted to that location inside the container or the —shm-size=256m?


#14

--shm-size=256m should be ok if you running directly from docker. But if you are using kubernetes then you need to do this https://github.com/Uninett/helm-charts/blob/master/repos/stable/deep-learning-tools/templates/deployment.yaml#L41 and then mount it as https://github.com/Uninett/helm-charts/blob/master/repos/stable/deep-learning-tools/templates/deployment.yaml#L171


(vibhor sood) #15

@matdmiller

where and how to add this in colab or clouderizer : --shm-size=256m?


(Mat M) #16

I don’t know. I run docker/portainer locally. I’m pretty sure you can’t do this with colab, never heard of the other services.


(Yijin) #17

256m did not seem enough for me when I ran the Docker container on my own machine. I changed it to --shm-size=1024m, and it works fine now. Thanks for pointing out this solution!


(Patrick Mccaffrey) #18

This fix worked for me as well.
I’m running in docker on a p2 instance and added --shm-size 50G.
But I had also forgotten to specify the runtime --runtime=nvidia :sweat_smile:


(Ben) #19

Thanks for the --ipc=host tip - worked on my machine too.


(David Carroll) #20

@sgugger
I am having an issue today with fast.ai v1, DL course v3 Pt 1, lesson 3, IMDB. I set up a new account on a VAST.ai server using Linux Ubuntu 16.04 with Nvidea RTX 2080Ti GPU with 52 GB memory & 12 cores. I used their Pytorch 1.0 image with CUDA 10.0 which is installed from Docker image. I then updated conda and installed fastai as suggested in the Crestle setup docs.

conda update conda
conda install -c fastai fastai

I ran the NB 1 [lesson1-pets.ipynb] on pets from scratch successfully with no issues.

However when running lesson3-imdb.ipynb I get an error which appears to be due to changes in the fast.ai library. I am using fastai ver 1.0.45

at cell 10:

data = TextDataBunch.load(path)

FileNotFoundErrorTraceback (most recent call last)
<ipython-input-10-4ff358ceed81> in <module>
----> 1 data = TextDataBunch.load(path)

/opt/conda/lib/python3.7/site-packages/fastai/text/data.py in load(cls, path, cache_name, processor, **kwargs)
    167                 Use `load_data` for data saved with v1.0.44 or later.""", DeprecationWarning) 
    168         cache_path = Path(path)/cache_name
--> 169         vocab = Vocab(pickle.load(open(cache_path/'itos.pkl','rb')))
    170         train_ids,train_lbls = np.load(cache_path/f'train_ids.npy'), np.load(cache_path/f'train_lbl.npy')
    171         valid_ids,valid_lbls = np.load(cache_path/f'valid_ids.npy'), np.load(cache_path/f'valid_lbl.npy')

FileNotFoundError: [Errno 2] No such file or directory: '/root/.fastai/data/imdb_sample/tmp/itos.pkl'

There is no file data/imdb_sample/tmp/itos.pkl on my system. The IMDB files are there.

It appears that itos.pkl is a Vocab file, which would not have been generated yet, or downloaded if it is for the wikitext103 model.

Please advise if this is my error or an issue.

Thanks


#21

There was a breaking change in v1.0.45 and everyone forgot to update the IMDB notebook, I’ve just done it. Data should be loaded with load_data now (and it works in all the applications).