When I am running the course-v3 notebook for lesson1-pets on Google Colab, it shows the above-mentioned error on line learn.fit_one_cycle(5)
for both resnet34 and resnet50.
How to resolve it?
When I am running the course-v3 notebook for lesson1-pets on Google Colab, it shows the above-mentioned error on line learn.fit_one_cycle(5)
for both resnet34 and resnet50.
How to resolve it?
This an issue from pyotch so you should ask it on their forum. From what I understand it has to do with shared memory while multiprocessing and setting num_workers
to 0 will fix the problem (but make your training slower).
Inside any function that create a DataBunch
. See the docs for more information.
I run fastai in nvidia docker on my local rig. I was trying to get fast.ai v1 loaded and ready to go for tomorrow and it wasn’t working and I was also getting this error. I was able to fix this when starting by docker container by adding --ipc=host into the docker run command. There was also a docker command for increasing shared memory size but I didn’t really understand the science of what values I should increase it to so I opted for the ipc flag instead. I know this doesn’t help you w/ Colab but I suspect this problem will come up again for others.
I did not have this issue when running fastai v0.7 in docker and I suspect it has something to do with the software changes in v1. I did upgrade my version of my Nvidia drivers to 396 and Docker-CE to the latest version immediately prior to building my fastai v1 container so it could have been related to those upgrades as well although I suspect it wasn’t.
It is an issue that is often mentioned in the pytorch repo with docker installs. fastai v0.7 didn’t use the pytorch dataloader, which is the source of this bug. This is probably why it appears now.
Was able to fix similar issue on Colab by setting num_workers=0 here:
data = ImageDataBunch.from_folder(
path,
ds_tfms=get_transforms(),
tfms=imagenet_norm,
size=256,
num_workers=0
)
img,label = data.valid_ds[-1]
img.show(title=data.classes[label])
@jeffhale I was reading your other post on setting up Google Colab. I was doing the same and got such with the same issue. I tried setting the num_workers=0, but the training is taking forever. How much time did it take for the training to complete for resnet34 for the pets problem ?
I think Colab would be pretty limiting from a performance perspective. With the bug described on this page and fix of setting num_workers=0 for your image loading, it will slow things down considerably. In addition Colab uses K80 GPU’s (same as AWS P2 instance) which are quite slow as compared to a 1080ti. https://medium.com/initialized-capital/benchmarking-tensorflow-performance-and-cost-across-different-gpu-options-69bd85fe5d58
I tried setting num_workers=0 on my machine and training (on resnet50) went from 48s per epoch to 132s per epoch. I am using a 1080ti which is ~4x faster than a K80 and i have nvme storage.
with size = 224
learn = ConvLearner(data, models.resnet50, metrics=accuracy)
learn.fit_one_cycle(1)
Took 12 minutes. Not blazing fast. But a huge improvement over a CPU and free.
epoch train loss valid loss accuracy
1 0.045373 0.029649 0.989500
CPU times: user 10min 21s, sys: 1min 45s, total: 12min 7s
Wall time: 12min 8s
Then i took 14 more minutes to run:
learn.unfreeze()
learn.fit_one_cycle(1, slice(1e-5,3e-4), pct_start=0.05)
Not so fast.
epoch train loss valid loss accuracy
1 0.026436 0.016208 0.993500
CPU times: user 12min 18s, sys: 1min 46s, total: 14min 5s
Wall time: 14min 9s
accuracy(*learn.TTA())
CPU times: user 4min 44s, sys: 8.48 s, total: 4min 53s Wall time: 4min 53s
tensor(0.9965)
So 30 minutes all together to run very few epochs.
The same test on Paperspace’s basic P4000 GPU setup took a little over 8 minutes.
Yeah it is shared memory issue. As by default it seems /dev/shm
is not created when running in container which pytorch needs for workers to communicate. So I created one with size of 256MB and it worked fine
Awesome thanks! So to confirm, is this a docker volume mounted to that location inside the container or the —shm-size=256m?
--shm-size=256m
should be ok if you running directly from docker. But if you are using kubernetes then you need to do this https://github.com/Uninett/helm-charts/blob/master/repos/stable/deep-learning-tools/templates/deployment.yaml#L41 and then mount it as https://github.com/Uninett/helm-charts/blob/master/repos/stable/deep-learning-tools/templates/deployment.yaml#L171
where and how to add this in colab or clouderizer : --shm-size=256m?
I don’t know. I run docker/portainer locally. I’m pretty sure you can’t do this with colab, never heard of the other services.
256m did not seem enough for me when I ran the Docker container on my own machine. I changed it to --shm-size=1024m, and it works fine now. Thanks for pointing out this solution!
This fix worked for me as well.
I’m running in docker on a p2 instance and added --shm-size 50G.
But I had also forgotten to specify the runtime --runtime=nvidia
Thanks for the --ipc=host
tip - worked on my machine too.
@sgugger
I am having an issue today with fast.ai v1, DL course v3 Pt 1, lesson 3, IMDB. I set up a new account on a VAST.ai server using Linux Ubuntu 16.04 with Nvidea RTX 2080Ti GPU with 52 GB memory & 12 cores. I used their Pytorch 1.0 image with CUDA 10.0 which is installed from Docker image. I then updated conda and installed fastai as suggested in the Crestle setup docs.
conda update conda
conda install -c fastai fastai
I ran the NB 1 [lesson1-pets.ipynb] on pets from scratch successfully with no issues.
However when running lesson3-imdb.ipynb I get an error which appears to be due to changes in the fast.ai library. I am using fastai ver 1.0.45
at cell 10:
data = TextDataBunch.load(path)
FileNotFoundErrorTraceback (most recent call last)
<ipython-input-10-4ff358ceed81> in <module>
----> 1 data = TextDataBunch.load(path)
/opt/conda/lib/python3.7/site-packages/fastai/text/data.py in load(cls, path, cache_name, processor, **kwargs)
167 Use `load_data` for data saved with v1.0.44 or later.""", DeprecationWarning)
168 cache_path = Path(path)/cache_name
--> 169 vocab = Vocab(pickle.load(open(cache_path/'itos.pkl','rb')))
170 train_ids,train_lbls = np.load(cache_path/f'train_ids.npy'), np.load(cache_path/f'train_lbl.npy')
171 valid_ids,valid_lbls = np.load(cache_path/f'valid_ids.npy'), np.load(cache_path/f'valid_lbl.npy')
FileNotFoundError: [Errno 2] No such file or directory: '/root/.fastai/data/imdb_sample/tmp/itos.pkl'
There is no file data/imdb_sample/tmp/itos.pkl on my system. The IMDB files are there.
It appears that itos.pkl is a Vocab file, which would not have been generated yet, or downloaded if it is for the wikitext103 model.
Please advise if this is my error or an issue.
Thanks
There was a breaking change in v1.0.45 and everyone forgot to update the IMDB notebook, I’ve just done it. Data should be loaded with load_data
now (and it works in all the applications).