Very slow loading of the Convnet pretrained model on lesson 1

metachi · November 1, 2017, 8:43pm

I can confirm with num_workers = 0 fixes the issue and with num_workers = 4 (default) I am able to replicate. I’m on the same as @apil.tamang Ubuntu 16.04, CUDA Version 8.0.61, CudaNN 6.0.21 (but python3.5.2 instead of anaconda).

jamesrequa · November 1, 2017, 8:45pm

@apil.tamang I have the exact same setup as you lol : Ubuntu 16.04, Cuda 8.0.61, CudaNN 6.0.21 (also python 3.6)

I have detailed my issue a few times here in the thread as its pretty specific, but basically I can run any of the fit(…) method with num_workers=4 as long as I either set precompute=True or if I skip to augmentation section directly I can run anything after that regardless of precompute setting.

I also did change my shared memory to 8gb by running the command sudo nano /proc/sys/kernel/shmmni and changing from 4096 to 8192.

jeremy · November 1, 2017, 8:47pm

@jamesrequa @metachi have you tried increasing your shared memory (see earlier in thread for details)?

jamesrequa · November 1, 2017, 8:47pm

@jeremy Yep see my comment here, sorry I must have edited it right at the moment you posted!

atul8 · November 1, 2017, 8:49pm

Exact same problems for me too.
Regarding “learn = ConvLearner.pretrained(resnet34, data, precompute=False” very slow, this was due to cuda 9 for me.
Once I uninstalled cuda9 and install cuda8, this problem got solved.

Regarding second problem about running fit function, I was getting error with even 1 worker but seems to be working 0 worker. I was able to reproduce the problem everytime.

I will try suggestion mentioned in the thread to see if that works for me or not.

z0k · November 1, 2017, 8:55pm

I am able to reliably reproduce the problem in Jupyter. But when I run python from bash to execute this script:

from fastai.imports import *

from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
from fastai.plots import *


PATH = "data/dogscats/"

sz=224

data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(resnet34, sz))
learn = ConvLearner.pretrained(resnet34, data, precompute=True)
learn.fit(0.01, 1)

log_preds = learn.predict()
log_preds.shape

preds = np.argmax(log_preds, axis=1)  # from log probabilities to 0 or 1
probs = np.exp(log_preds[:,1]) 

learn = ConvLearner.pretrained(resnet34, data, precompute=True)

lrf=learn.lr_find()

tfms = tfms_from_model(resnet34, sz, aug_tfms=transforms_side_on, max_zoom=1.1)

data = ImageClassifierData.from_paths(PATH, tfms=tfms)

learn = ConvLearner.pretrained(resnet34, data, precompute=True)

learn.fit(1e-2, 1)

learn.precompute=False

learn.fit(1e-2, 3, cycle_len=1)

I get the following exception instead (and there is no freezing at the start like there is in Jupyter):

Exception ignored in: <bound method DataLoaderIter.__del__ of <torch.utils.data.dataloader.DataLoaderIter object at 0x7fce2b447e48>>
Traceback (most recent call last):
  File "/home/z/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 241, in __del__
  File "/home/z/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 235, in _shutdown_workers
  File "/home/z/anaconda3/lib/python3.6/threading.py", line 521, in set
  File "/home/z/anaconda3/lib/python3.6/threading.py", line 364, in notify_all
  File "/home/z/anaconda3/lib/python3.6/threading.py", line 347, in notify
TypeError: 'NoneType' object is not callable
Exception ignored in: <bound method DataLoaderIter.__del__ of <torch.utils.data.dataloader.DataLoaderIter object at 0x7fce2b447c88>>
Traceback (most recent call last):
  File "/home/z/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 241, in __del__
  File "/home/z/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 235, in _shutdown_workers
  File "/home/z/anaconda3/lib/python3.6/threading.py", line 521, in set
  File "/home/z/anaconda3/lib/python3.6/threading.py", line 364, in notify_all
  File "/home/z/anaconda3/lib/python3.6/threading.py", line 347, in notify
TypeError: 'NoneType' object is not callable
Exception ignored in: <bound method DataLoaderIter.__del__ of <torch.utils.data.dataloader.DataLoaderIter object at 0x7fce2b270fd0>>
Traceback (most recent call last):
  File "/home/z/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 241, in __del__
  File "/home/z/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 235, in _shutdown_workers
  File "/home/z/anaconda3/lib/python3.6/threading.py", line 521, in set
  File "/home/z/anaconda3/lib/python3.6/threading.py", line 364, in notify_all
  File "/home/z/anaconda3/lib/python3.6/threading.py", line 347, in notify
TypeError: 'NoneType' object is not callable
Exception ignored in: <bound method DataLoaderIter.__del__ of <torch.utils.data.dataloader.DataLoaderIter object at 0x7fce2b22ac18>>
Traceback (most recent call last):
  File "/home/z/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 241, in __del__
  File "/home/z/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 235, in _shutdown_workers
  File "/home/z/anaconda3/lib/python3.6/threading.py", line 521, in set
  File "/home/z/anaconda3/lib/python3.6/threading.py", line 364, in notify_all
  File "/home/z/anaconda3/lib/python3.6/threading.py", line 347, in notify
TypeError: 'NoneType' object is not callable

Has anybody else tried to reproduce this outside of Jupyter? Maybe I just copied the code incorrectly.

apaszke · November 1, 2017, 8:57pm

It still seems like you’re using multiple workers (judging from the “Process X:” lines)

atul8 · November 1, 2017, 9:05pm

Interesting observation @alexvonrass. It also solved my problem after deleting tmp folder.

Increasing shmmni does not solve the problem for me also.

If few others can also try this workaround and confirm it solves their problem also, we can code in notebook to delete the folder before running ConvLearner (as workaround).

metachi · November 1, 2017, 9:08pm

I also just tried increasing my shared memory and could still replicate the issue with 4 workers.

jeremy · November 1, 2017, 9:47pm

I’ve been able to replicate this on a new AWS instance Looking into it!

Robi · November 2, 2017, 12:02am

I was attempting to shrink the notebook to the minimum number of instructions able to replicate the issue at the learn.precompute=False point, and I have noticed a suspect behavior: when all the plot_val... instructions in the Analyzing results: looking at pictures section are commented out the issue disappears. Maybe there could also be some display memory contention?

I did a fresh reinstall of my Python environment just before my testing, hope not to deviate the current solution efforts from the right track due to something I overlooked.

jeremy · November 2, 2017, 1:46am

The issue seems to be due to opencv. I’ve just pushed a change that seems to fix it - try doing a git pull and see if the lesson 1 notebook now works for you.

lgvaz · November 2, 2017, 2:10am

Problem still persists for me
But I think I might have a different problem because setting num_workers=0 does not solve the issue.
To clarify, only loading the model is slow here (with or without precompute), .fit is working fine.

satya · November 2, 2017, 2:12am

I actually had this exact issue with it taking ~1 minute. The num_workers=0 solution worked.
However I removed the num_workers option after that and it’s working great (runs in about 7s). Tested a kernel restart and it works now without any issues.

Edit: Mine’s a pretty powerful ubuntu rig on google cloud

Edit2: This is without pulling in Jeremy’s latest fix

jeremy · November 2, 2017, 2:34am

The first time you run it, it will always take about a minute, since it has to precompute the activations (we’ll discuss this in class). Then when you run again, it’ll be fast.

satya · November 2, 2017, 2:40am

Ohh that explains it.

atul8 · November 2, 2017, 2:44am

@jeremy - Can you explain how did you figure this out?

jeremy · November 2, 2017, 2:49am

Not sure I can!.. The behavior and stack trace clearly showed a race condition or similar, and it had to be in some other library since if it was in Pytorch lots of people would have seen the problem. Since opencv is pretty complex, I figured it might be there, and googled a bit.

metachi · November 2, 2017, 2:52am

Thanks Jeremy! Worked for me this time around. Nice work!

jeremy · November 2, 2017, 2:57am

@metachi I just pushed a different (and hackier!) approach which is much faster - can you check again and confirm it still works for you?