@jeffhale I was reading your other post on setting up Google Colab. I was doing the same and got such with the same issue. I tried setting the num_workers=0, but the training is taking forever. How much time did it take for the training to complete for resnet34 for the pets problem ?
I think Colab would be pretty limiting from a performance perspective. With the bug described on this page and fix of setting num_workers=0 for your image loading, it will slow things down considerably. In addition Colab uses K80 GPU’s (same as AWS P2 instance) which are quite slow as compared to a 1080ti. https://medium.com/initialized-capital/benchmarking-tensorflow-performance-and-cost-across-different-gpu-options-69bd85fe5d58
I tried setting num_workers=0 on my machine and training (on resnet50) went from 48s per epoch to 132s per epoch. I am using a 1080ti which is ~4x faster than a K80 and i have nvme storage.
with size = 224
learn = ConvLearner(data, models.resnet50, metrics=accuracy)
learn.fit_one_cycle(1)
Took 12 minutes. Not blazing fast. But a huge improvement over a CPU and free.
epoch train loss valid loss accuracy
1 0.045373 0.029649 0.989500
CPU times: user 10min 21s, sys: 1min 45s, total: 12min 7s
Wall time: 12min 8s
Then i took 14 more minutes to run:
learn.unfreeze()
learn.fit_one_cycle(1, slice(1e-5,3e-4), pct_start=0.05)
Not so fast.
epoch train loss valid loss accuracy
1 0.026436 0.016208 0.993500
CPU times: user 12min 18s, sys: 1min 46s, total: 14min 5s
Wall time: 14min 9s
accuracy(*learn.TTA())
CPU times: user 4min 44s, sys: 8.48 s, total: 4min 53s Wall time: 4min 53s
tensor(0.9965)
So 30 minutes all together to run very few epochs.
The same test on Paperspace’s basic P4000 GPU setup took a little over 8 minutes.
Yeah it is shared memory issue. As by default it seems /dev/shm
is not created when running in container which pytorch needs for workers to communicate. So I created one with size of 256MB and it worked fine
Awesome thanks! So to confirm, is this a docker volume mounted to that location inside the container or the —shm-size=256m?
--shm-size=256m
should be ok if you running directly from docker. But if you are using kubernetes then you need to do this https://github.com/Uninett/helm-charts/blob/master/repos/stable/deep-learning-tools/templates/deployment.yaml#L41 and then mount it as https://github.com/Uninett/helm-charts/blob/master/repos/stable/deep-learning-tools/templates/deployment.yaml#L171
I don’t know. I run docker/portainer locally. I’m pretty sure you can’t do this with colab, never heard of the other services.
256m did not seem enough for me when I ran the Docker container on my own machine. I changed it to --shm-size=1024m, and it works fine now. Thanks for pointing out this solution!
This fix worked for me as well.
I’m running in docker on a p2 instance and added --shm-size 50G.
But I had also forgotten to specify the runtime --runtime=nvidia
Thanks for the --ipc=host
tip - worked on my machine too.
@sgugger
I am having an issue today with fast.ai v1, DL course v3 Pt 1, lesson 3, IMDB. I set up a new account on a VAST.ai server using Linux Ubuntu 16.04 with Nvidea RTX 2080Ti GPU with 52 GB memory & 12 cores. I used their Pytorch 1.0 image with CUDA 10.0 which is installed from Docker image. I then updated conda and installed fastai as suggested in the Crestle setup docs.
conda update conda
conda install -c fastai fastai
I ran the NB 1 [lesson1-pets.ipynb] on pets from scratch successfully with no issues.
However when running lesson3-imdb.ipynb I get an error which appears to be due to changes in the fast.ai library. I am using fastai ver 1.0.45
at cell 10:
data = TextDataBunch.load(path)
FileNotFoundErrorTraceback (most recent call last)
<ipython-input-10-4ff358ceed81> in <module>
----> 1 data = TextDataBunch.load(path)
/opt/conda/lib/python3.7/site-packages/fastai/text/data.py in load(cls, path, cache_name, processor, **kwargs)
167 Use `load_data` for data saved with v1.0.44 or later.""", DeprecationWarning)
168 cache_path = Path(path)/cache_name
--> 169 vocab = Vocab(pickle.load(open(cache_path/'itos.pkl','rb')))
170 train_ids,train_lbls = np.load(cache_path/f'train_ids.npy'), np.load(cache_path/f'train_lbl.npy')
171 valid_ids,valid_lbls = np.load(cache_path/f'valid_ids.npy'), np.load(cache_path/f'valid_lbl.npy')
FileNotFoundError: [Errno 2] No such file or directory: '/root/.fastai/data/imdb_sample/tmp/itos.pkl'
There is no file data/imdb_sample/tmp/itos.pkl on my system. The IMDB files are there.
It appears that itos.pkl is a Vocab file, which would not have been generated yet, or downloaded if it is for the wikitext103 model.
Please advise if this is my error or an issue.
Thanks
There was a breaking change in v1.0.45 and everyone forgot to update the IMDB notebook, I’ve just done it. Data should be loaded with load_data
now (and it works in all the applications).
Thanks
Hi,
I just did a clean/fresh machine load and a new git clone from Master branch course-v3 and it still shows the same errors in running Lesson3-IMDB. The fixes do not seem to resolve all the issues.
I have had to revert to 1.0.42 and the matching notebook. That version works fine in my environment.
David
Forgot one cell to change, you’re right. It should work now.
I encounter this error when doing TTA:
Traceback (most recent call last):
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 511, in _try_get_batch
data = self.data_queue.get(timeout=timeout)
File "/srv/conda/envs/notebook/lib/python3.7/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/srv/conda/envs/notebook/lib/python3.7/threading.py", line 300, in wait
gotit = waiter.acquire(True, timeout)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 16) is killed by signal: Bus error.
The code is for a competition and is running on a docker container on their side. Any ideas? Maybe @sgugger or @gurvinder For the heroes wanting to help, I am not familiar at all with the working of Docker containers so I would need a very plain explanation…
That error is a generic error from PyTorch saying… something failed.
So no one can help you without:
- seeing the code you ran
- seeing the error message you pass when you add
num_workers=True
in your call to DataBunch (to deactivate multiprocessing, which is the thing throwing the cryptic error right now)
Hi. Not sure if you scrolled through this thread or not (it’s only 25 posts~! : )
If it’s in a docker container, it might be something to do with setting / lack of shm space. See this post above you, and give that a try?
If that does not solve your problem, then it’s as Sylvain said – you’ll need to post more details / code here for anyone to help…
Good luck!
Yijin
Thanks a lot! @utkb and @sgugger. See my code as of now:
learn = load_learner(path, 'export2.pkl')
learn.data.add_test(good_images)
preds, _ = learn.TTA(ds_type=DatasetType.Test)
My question is where should I add num_workers = True
(this is the same as setting it to 1 worker, right). I check the function TTA but I do not see it as an argument.