Thanks for the update!
Does that go away if you use num_workers=0 ? If so, then I think that’s the issue we’re concerned about here…
Thanks for the update!
Does that go away if you use num_workers=0 ? If so, then I think that’s the issue we’re concerned about here…
@jeremy
That unfortunately turns out to be quite undeterministic! The only place I’ve gotten stuck is on the fit method where it first calls the “cyle_len” parameter. I’m not sure the num_workers seemed to make any difference. It once got stuck when I serially executed all the cells with 4 threads for data workers. A different time, it didn’t. However, for 0 workers, I don’t think it once stuck.
Let’s wait and see what the overwhelming majority of the people experience (if and after) they’ve upgraded the pytorch version to one using Cuda 8.0. I’m assuming that might be a lead issue. If that doesn’t work, we can try something else.
@apil.tamang I’ve been using pytorch for Cuda 8.0 so I can confirm that wasn’t the source of the issue for my side
@jeremy My gpu is currently in use (in the middle of a long kaggle training run ) but I can def share ssh details to access my box for debugging if anyone else hasn’t already shared theirs by the time its done!
Can you, for good faith, find out what version of CuDNN you’re using? Both major.minor.patch for both CUDA and CUDANN? Use this link to find CuDNN version: https://stackoverflow.com/questions/31326015/how-to-verify-cudnn-installation
I’m on Ubuntu 16.04, Cuda 8.0.61, CudaNN 6.0.21
Also, are you not able to run any fit(…) method with a decent number of workers, or is it just some very occassional thing like I"m having?
I’ve had this same issue, a weird fix that i’ve found - if i delete the “tmp” folder in my dogscats directory it works fine, but else it freezes. So i’d have to delete it before using ImageClassifierData and ConvLearner.pretrained every time
I can confirm with num_workers = 0 fixes the issue and with num_workers = 4 (default) I am able to replicate. I’m on the same as @apil.tamang Ubuntu 16.04, CUDA Version 8.0.61, CudaNN 6.0.21 (but python3.5.2 instead of anaconda).
@apil.tamang I have the exact same setup as you lol : Ubuntu 16.04, Cuda 8.0.61, CudaNN 6.0.21 (also python 3.6)
I have detailed my issue a few times here in the thread as its pretty specific, but basically I can run any of the fit(…) method with num_workers=4 as long as I either set precompute=True or if I skip to augmentation section directly I can run anything after that regardless of precompute setting.
I also did change my shared memory to 8gb by running the command sudo nano /proc/sys/kernel/shmmni
and changing from 4096
to 8192
.
@jamesrequa @metachi have you tried increasing your shared memory (see earlier in thread for details)?
@jeremy Yep see my comment here, sorry I must have edited it right at the moment you posted!
Exact same problems for me too.
Regarding “learn = ConvLearner.pretrained(resnet34, data, precompute=False” very slow, this was due to cuda 9 for me.
Once I uninstalled cuda9 and install cuda8, this problem got solved.
Regarding second problem about running fit function, I was getting error with even 1 worker but seems to be working 0 worker. I was able to reproduce the problem everytime.
I will try suggestion mentioned in the thread to see if that works for me or not.
I am able to reliably reproduce the problem in Jupyter. But when I run python from bash to execute this script:
from fastai.imports import *
from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
from fastai.plots import *
PATH = "data/dogscats/"
sz=224
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(resnet34, sz))
learn = ConvLearner.pretrained(resnet34, data, precompute=True)
learn.fit(0.01, 1)
log_preds = learn.predict()
log_preds.shape
preds = np.argmax(log_preds, axis=1) # from log probabilities to 0 or 1
probs = np.exp(log_preds[:,1])
learn = ConvLearner.pretrained(resnet34, data, precompute=True)
lrf=learn.lr_find()
tfms = tfms_from_model(resnet34, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
data = ImageClassifierData.from_paths(PATH, tfms=tfms)
learn = ConvLearner.pretrained(resnet34, data, precompute=True)
learn.fit(1e-2, 1)
learn.precompute=False
learn.fit(1e-2, 3, cycle_len=1)
I get the following exception instead (and there is no freezing at the start like there is in Jupyter):
Exception ignored in: <bound method DataLoaderIter.__del__ of <torch.utils.data.dataloader.DataLoaderIter object at 0x7fce2b447e48>>
Traceback (most recent call last):
File "/home/z/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 241, in __del__
File "/home/z/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 235, in _shutdown_workers
File "/home/z/anaconda3/lib/python3.6/threading.py", line 521, in set
File "/home/z/anaconda3/lib/python3.6/threading.py", line 364, in notify_all
File "/home/z/anaconda3/lib/python3.6/threading.py", line 347, in notify
TypeError: 'NoneType' object is not callable
Exception ignored in: <bound method DataLoaderIter.__del__ of <torch.utils.data.dataloader.DataLoaderIter object at 0x7fce2b447c88>>
Traceback (most recent call last):
File "/home/z/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 241, in __del__
File "/home/z/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 235, in _shutdown_workers
File "/home/z/anaconda3/lib/python3.6/threading.py", line 521, in set
File "/home/z/anaconda3/lib/python3.6/threading.py", line 364, in notify_all
File "/home/z/anaconda3/lib/python3.6/threading.py", line 347, in notify
TypeError: 'NoneType' object is not callable
Exception ignored in: <bound method DataLoaderIter.__del__ of <torch.utils.data.dataloader.DataLoaderIter object at 0x7fce2b270fd0>>
Traceback (most recent call last):
File "/home/z/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 241, in __del__
File "/home/z/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 235, in _shutdown_workers
File "/home/z/anaconda3/lib/python3.6/threading.py", line 521, in set
File "/home/z/anaconda3/lib/python3.6/threading.py", line 364, in notify_all
File "/home/z/anaconda3/lib/python3.6/threading.py", line 347, in notify
TypeError: 'NoneType' object is not callable
Exception ignored in: <bound method DataLoaderIter.__del__ of <torch.utils.data.dataloader.DataLoaderIter object at 0x7fce2b22ac18>>
Traceback (most recent call last):
File "/home/z/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 241, in __del__
File "/home/z/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 235, in _shutdown_workers
File "/home/z/anaconda3/lib/python3.6/threading.py", line 521, in set
File "/home/z/anaconda3/lib/python3.6/threading.py", line 364, in notify_all
File "/home/z/anaconda3/lib/python3.6/threading.py", line 347, in notify
TypeError: 'NoneType' object is not callable
Has anybody else tried to reproduce this outside of Jupyter? Maybe I just copied the code incorrectly.
It still seems like you’re using multiple workers (judging from the “Process X:” lines)
Interesting observation @alexvonrass. It also solved my problem after deleting tmp folder.
Increasing shmmni does not solve the problem for me also.
If few others can also try this workaround and confirm it solves their problem also, we can code in notebook to delete the folder before running ConvLearner (as workaround).
I also just tried increasing my shared memory and could still replicate the issue with 4 workers.
I’ve been able to replicate this on a new AWS instance Looking into it!
I was attempting to shrink the notebook to the minimum number of instructions able to replicate the issue at the learn.precompute=False
point, and I have noticed a suspect behavior: when all the plot_val...
instructions in the Analyzing results: looking at pictures section are commented out the issue disappears. Maybe there could also be some display memory contention?
I did a fresh reinstall of my Python environment just before my testing, hope not to deviate the current solution efforts from the right track due to something I overlooked.
The issue seems to be due to opencv. I’ve just pushed a change that seems to fix it - try doing a git pull
and see if the lesson 1 notebook now works for you.
Problem still persists for me
But I think I might have a different problem because setting num_workers=0 does not solve the issue.
To clarify, only loading the model is slow here (with or without precompute), .fit is working fine.
I actually had this exact issue with it taking ~1 minute. The num_workers=0 solution worked.
However I removed the num_workers option after that and it’s working great (runs in about 7s). Tested a kernel restart and it works now without any issues.
Edit: Mine’s a pretty powerful ubuntu rig on google cloud
Edit2: This is without pulling in Jeremy’s latest fix