I run fastai in nvidia docker on my local rig. I was trying to get fast.ai v1 loaded and ready to go for tomorrow and it wasn’t working and I was also getting this error. I was able to fix this when starting by docker container by adding --ipc=host into the docker run command. There was also a docker command for increasing shared memory size but I didn’t really understand the science of what values I should increase it to so I opted for the ipc flag instead. I know this doesn’t help you w/ Colab but I suspect this problem will come up again for others.
I did not have this issue when running fastai v0.7 in docker and I suspect it has something to do with the software changes in v1. I did upgrade my version of my Nvidia drivers to 396 and Docker-CE to the latest version immediately prior to building my fastai v1 container so it could have been related to those upgrades as well although I suspect it wasn’t.
It is an issue that is often mentioned in the pytorch repo with docker installs. fastai v0.7 didn’t use the pytorch dataloader, which is the source of this bug. This is probably why it appears now.
@jeffhale I was reading your other post on setting up Google Colab. I was doing the same and got such with the same issue. I tried setting the num_workers=0, but the training is taking forever. How much time did it take for the training to complete for resnet34 for the pets problem ?
I tried setting num_workers=0 on my machine and training (on resnet50) went from 48s per epoch to 132s per epoch. I am using a 1080ti which is ~4x faster than a K80 and i have nvme storage.
Yeah it is shared memory issue. As by default it seems /dev/shm is not created when running in container which pytorch needs for workers to communicate. So I created one with size of 256MB and it worked fine
256m did not seem enough for me when I ran the Docker container on my own machine. I changed it to --shm-size=1024m, and it works fine now. Thanks for pointing out this solution!
This fix worked for me as well.
I’m running in docker on a p2 instance and added --shm-size 50G.
But I had also forgotten to specify the runtime --runtime=nvidia
@sgugger
I am having an issue today with fast.ai v1, DL course v3 Pt 1, lesson 3, IMDB. I set up a new account on a VAST.ai server using Linux Ubuntu 16.04 with Nvidea RTX 2080Ti GPU with 52 GB memory & 12 cores. I used their Pytorch 1.0 image with CUDA 10.0 which is installed from Docker image. I then updated conda and installed fastai as suggested in the Crestle setup docs.
conda update conda
conda install -c fastai fastai
I ran the NB 1 [lesson1-pets.ipynb] on pets from scratch successfully with no issues.
However when running lesson3-imdb.ipynb I get an error which appears to be due to changes in the fast.ai library. I am using fastai ver 1.0.45
at cell 10:
data = TextDataBunch.load(path)
FileNotFoundErrorTraceback (most recent call last)
<ipython-input-10-4ff358ceed81> in <module>
----> 1 data = TextDataBunch.load(path)
/opt/conda/lib/python3.7/site-packages/fastai/text/data.py in load(cls, path, cache_name, processor, **kwargs)
167 Use `load_data` for data saved with v1.0.44 or later.""", DeprecationWarning)
168 cache_path = Path(path)/cache_name
--> 169 vocab = Vocab(pickle.load(open(cache_path/'itos.pkl','rb')))
170 train_ids,train_lbls = np.load(cache_path/f'train_ids.npy'), np.load(cache_path/f'train_lbl.npy')
171 valid_ids,valid_lbls = np.load(cache_path/f'valid_ids.npy'), np.load(cache_path/f'valid_lbl.npy')
FileNotFoundError: [Errno 2] No such file or directory: '/root/.fastai/data/imdb_sample/tmp/itos.pkl'
There is no file data/imdb_sample/tmp/itos.pkl on my system. The IMDB files are there.
It appears that itos.pkl is a Vocab file, which would not have been generated yet, or downloaded if it is for the wikitext103 model.
There was a breaking change in v1.0.45 and everyone forgot to update the IMDB notebook, I’ve just done it. Data should be loaded with load_data now (and it works in all the applications).
I just did a clean/fresh machine load and a new git clone from Master branch course-v3 and it still shows the same errors in running Lesson3-IMDB. The fixes do not seem to resolve all the issues.
I have had to revert to 1.0.42 and the matching notebook. That version works fine in my environment.
Traceback (most recent call last):
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 511, in _try_get_batch
data = self.data_queue.get(timeout=timeout)
File "/srv/conda/envs/notebook/lib/python3.7/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/srv/conda/envs/notebook/lib/python3.7/threading.py", line 300, in wait
gotit = waiter.acquire(True, timeout)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 16) is killed by signal: Bus error.
The code is for a competition and is running on a docker container on their side. Any ideas? Maybe @sgugger or @gurvinder For the heroes wanting to help, I am not familiar at all with the working of Docker containers so I would need a very plain explanation…