Lesson 1 - Notebook stuck in create_cnn


(Imran) #1

Hi,

I have setup fastai, and am executing lesson1 notebook. I was able to download the tar files and unzip them. I have also replaced the “pat” regex to suit windows format, however, the notebook keeps on executing (without aborting / getting killed) when i execute the “create_cnn” step.

learn = create_cnn(data, models.resnet34, metrics=error_rate)

However if I replace the “learn” with the below code, I am able to proceed.

def get_model(pretrained=True, model_name = ‘resnet34’, **kwargs ):
arch = models.resnet34(pretrained, **kwargs )
return arch

learn = Learner(data, get_model(), metrics=[accuracy])

I’ve got Windows 10, GTX 1080, latest NVIDIA drivers (GeForce), updated latest available version of pytorch and fastai library. Can someone suggest how to debug and resolve this issue?


(Rohit Singh) #3

What was the problem? I’m seeing a similar hang in create_cnn on an AWS windows machine.


(Spencer Uresk) #4

I’m having this problem also.

Windows 10
Python 3.7
FastAI 1.0.43
PyTorch 1.0.1 (py3.7_cuda100_cudnn7_1)

It looks like it hangs in training and if I do a dump of the process, the top 30 or so lines look like this:

caffe2+0x42344
libiomp5md!_kmp_invoke_microtask+0x93
libiomp5md!__kmp_fork_call+0x4c2
libiomp5md!__kmpc_fork_call+0xdd
caffe2+0x41a48
caffe2+0x44af9
caffe2+0x442c8
caffe2+0xd904
caffe2!THFloatBlas_gemm+0x274
caffe2!THFloatTensor_addmm+0x5d9
caffe2!THNN_FloatSpatialConvolutionMM_updateOutput+0x5eb
libiomp5md!_vckmp_invoke_microtask+0x8d
libiomp5md!__vckmp_invoke_task_func+0x4c
libiomp5md!__kmp_fork_call+0x170a
libiomp5md!_vcomp_fork+0x6d
caffe2!THNN_FloatSpatialConvolutionMM_updateOutput+0x320
caffe2!THDiskFile_isLittleEndianCPU+0x772ab
caffe2!at::native::thnn_conv2d_forward+0xa4
caffe2!at::TypeDefault::thnn_conv2d_forward+0x158
torch!torch::autograd::VariableType::thnn_conv2d_forward+0x82b
caffe2!at::native::thnn_conv2d+0x9b
caffe2!at::TypeDefault::thnn_conv2d+0x158
torch!torch::autograd::VariableType::thnn_conv2d+0x27d
caffe2!at::native::_convolution_nogroup+0x57f
caffe2!at::TypeDefault::_convolution_nogroup+0x174
torch!torch::autograd::VariableType::_convolution_nogroup+0x2e1
caffe2!at::native::_convolution+0x1830
caffe2!at::TypeDefault::_convolution+0x1a1
torch!torch::autograd::VariableType::_convolution+0x380
caffe2!at::native::convolution+0x148
caffe2!at::TypeDefault::convolution+0x180
torch!torch::autograd::VariableType::convolution+0x102
caffe2!at::native::conv2d+0xf4
caffe2!at::TypeDefault::conv2d+0x161
torch!torch::autograd::VariableType::conv2d+0xd0

Trying to dig into it further, but I’m a little out of my depth with some of this stuff so it is slow.

I thought perhaps this was purely a PyTorch thing and maybe could be reproduced outside fastai, but I ran through a bunch of PyTorch examples and they all seem to work so far.


#5

I’m having the same issue with create_cnn on my local Win7 machine. After spending some time debugging, i found out that it gets stuck while executing num_features_model function. To overcome this issue, i wrote a custom function which uses create_cnn source code, but instead of calling num_features_model i pass num_features:int value as a parameter:

def _resnet_split(m:nn.Module): return (m[0][6],m[1])
def my_create_cnn(data:DataBunch, arch:Callable, num_features:int, cut:Union[int,Callable]=None, pretrained:bool=True, lin_ftrs:Optional[Collection[int]]=None, ps:Floats=0.5,
                custom_head:Optional[nn.Module]=None, split_on:Optional[SplitFuncOrIdxList]=None,
                bn_final:bool=False, **learn_kwargs:Any)->Learner:
    body = create_body(arch, pretrained, cut)
    head = custom_head or create_head(num_features, data.c, lin_ftrs, ps=ps, bn_final=bn_final)
    model = nn.Sequential(body, head)
    learn = Learner(data, model, **learn_kwargs)
    learn.split(ifnone(split_on, _resnet_split))
    if pretrained: learn.freeze()
    apply_init(model[1], nn.init.kaiming_normal_)
    return learn

I use this function instead of create_cnn, for example:

learn = my_create_cnn(data, models.resnet34, 1024, metrics=error_rate)

where num_features=1024 for resnet34, 4096 for resnet50 and 1024 for resnet18


(Spencer Uresk) #6

Thanks! Still trying to figure out the core problem, but this at least let me work around it temporarily.


#7

Hi there,

I have tried using the fix posted by ZhekaMeka, but run into other issues:

File “”, line 4
body = create_body(arch, pretrained, cut)
^
IndentationError: expected an indented block

How exactly did you manage to fix it ?


#8

Hello!
That’s a Python error which in this case means that a function should be indented. I fixed the formatting for my answer so that now it should work by copying and pasting it into jupyter notebook.
Hope that helps!


(Imran) #9

I extracted the code for “create_cnn” and added a pdb and this is where its stuck now

from fastai.vision import *
from fastai.metrics import error_rate
import torch
torch.cuda.set_device(0)
bs = 64
path = untar_data(URLs.PETS); path
WindowsPath(‘C:/Users/Imran/.fastai/data/oxford-iiit-pet’)
path_anno = path/‘annotations’
path_img = path/‘images’
fnames = get_image_files(path_img)
np.random.seed(2)
pat = re.compile(r’\([^\]+)_\d+.jpg$’)
data = ImageDataBunch.from_name_re(path_img, fnames, pat, ds_tfms=get_transforms(), size=224, bs=bs ).normalize(imagenet_stats)

import pdb;

def _default_split(m:nn.Module): return (m[1],)
def _resnet_split(m:nn.Module): return (m[0][6],m[1])

_default_meta = {‘cut’:-1, ‘split’:_default_split}
_resnet_meta = {‘cut’:-2, ‘split’:_resnet_split }

model_meta = {
… models.resnet18 :{_resnet_meta}, models.resnet34: {_resnet_meta},
… models.resnet50 :{_resnet_meta}, models.resnet101:{_resnet_meta}}

def cnn_config(arch):
… “Get the metadata associated with arch.”
… torch.backends.cudnn.benchmark = True
… return model_meta.get(arch, default_meta)

class Hook():
… “Create a hook on m with hook_func.”
… def init(self, m:nn.Module, hook_func:HookFunc, is_forward:bool=True, detach:bool=True):
… self.hook_func,self.detach,self.stored = hook_func,detach,None
… f = m.register_forward_hook if is_forward else m.register_backward_hook
… self.hook = f(self.hook_fn)
… self.removed = False
… def hook_fn(self, module:nn.Module, input:Tensors, output:Tensors):
… “Applies hook_func to module, input, output.”
… if self.detach:
… input = (o.detach() for o in input ) if is_listy(input ) else input.detach()
… output = (o.detach() for o in output) if is_listy(output) else output.detach()
… self.stored = self.hook_func(module, input, output)
… def remove(self):
… “Remove the hook from the model.”
… if not self.removed:
… self.hook.remove()
… self.removed=True
… def enter(self, *args): return self
… def exit(self, *args): self.remove()

class Hooks():
… “Create several hooks on the modules in ms with hook_func.”
… def init(self, ms:Collection[nn.Module], hook_func:HookFunc, is_forward:bool=True, detach:bool=True):
… self.hooks = [Hook(m, hook_func, is_forward, detach) for m in ms]
… def getitem(self,i:int)->Hook: return self.hooks[i]
… def len(self)->int: return len(self.hooks)
… def iter(self): return iter(self.hooks)
@property
… def stored(self): return [o.stored for o in self]
… def remove(self):
… “Remove the hooks from the model.”
… for h in self.hooks: h.remove()
… def enter(self, *args): return self
… def exit (self, *args): self.remove()

def dummy_batch(m: nn.Module, size:tuple=(64,64))->Tensor:
… “Create a dummy batch to go through m with size.”
… ch_in = in_channels(m)
… pdb.set_trace()
… return one_param(m).new(1, ch_in, *size).requires_grad
(False).uniform_(-1.,1.)

def dummy_eval(m:nn.Module, size:tuple=(64,64)):
… “Pass a dummy_batch in evaluation mode in m with size.”
… return m.eval()(dummy_batch(m, size))

def _hook_inner(m,i,o): return o if isinstance(o,Tensor) else o if is_listy(o) else list(o)

def hook_outputs(modules:Collection[nn.Module], detach:bool=True, grad:bool=False)->Hooks:
… "Return Hooks that store activations of all modules in self.stored"
… return Hooks(modules, _hook_inner, detach=detach, is_forward=not grad)

def model_sizes(m:nn.Module, size:tuple=(64,64))->Tuple[Sizes,Tensor,Hooks]:
… “Pass a dummy input through the model m to get the various sizes of activations.”
… with hook_outputs(m) as hooks:
… x = dummy_eval(m, size)
… return [o.stored.shape for o in hooks]

def num_features_model(m:nn.Module)->int:
… “Return the number of output features for model.”
… sz = 64
… while True:
… try: return model_sizes(m, size=(sz,sz))[-1][1]
… except Exception as e:
… sz *= 2
… if sz > 2048: raise

def create_cnn1(data:DataBunch, arch:Callable, cut:Union[int,Callable]=None, pretrained:bool=True,
… lin_ftrs:Optional[Collection[int]]=None, ps:Floats=0.5,
… custom_head:Optional[nn.Module]=None, split_on:Optional[SplitFuncOrIdxList]=None,
… bn_final:bool=False, **learn_kwargs:Any)->Learner:
… meta = cnn_config(arch)
… body = create_body(arch, pretrained, cut)
… nf = num_features_model(body) * 2
… head = custom_head or create_head(nf, data.c, lin_ftrs, ps=ps, bn_final=bn_final)
… model = nn.Sequential(body, head)
… learn = Learner(data, model, **learn_kwargs)
… learn.split(ifnone(split_on,meta[‘split’]))
… if pretrained: learn.freeze()
… apply_init(model[1], nn.init.kaiming_normal_)
… return learn

learn = create_cnn1(data, models.resnet34, metrics=error_rate)
(5)dummy_batch()
(Pdb) n
–Return–
(5)dummy_batch()->tensor([[[[ 3…5320e-01]]]])
(Pdb) n
–Call–
c:\users\anaconda3\envs\fastai\lib\site-packages\torch\nn\modules\module.py(483)call()
-> def call(self, *input, **kwargs):
(Pdb) n
c:\users\anaconda3\envs\fastai\lib\site-packages\torch\nn\modules\module.py(484)call()
-> for hook in self._forward_pre_hooks.values():
(Pdb) n
c:\users\anaconda3\envs\fastai\lib\site-packages\torch\nn\modules\module.py(486)call()
-> if torch._C._get_tracing_state():
(Pdb) n
c:\users\anaconda3\envs\fastai\lib\site-packages\torch\nn\modules\module.py(489)call()
-> result = self.forward(*input, **kwargs)
(Pdb) n

Execution does not proceed after this.


(Rohit Singh) #10

A workaround:

We don’t see this hang with PyTorch 1.0.0 but with PyTorch 1.0.1


#11

Hi,

thank you for the quick response. I did copy the code to notebook, but get another error:

---------------------------------------------------------------------------

NameError Traceback (most recent call last)
in
----> 1 def _resnet_split(m:nn.Module): return (m[0][6],m[1])
2 def my_create_cnn(data:DataBunch, arch:Callable, num_features:int, cut:Union[int,Callable]=None, pretrained:bool=True, lin_ftrs:Optional[Collection[int]]=None, ps:Floats=0.5,
3 custom_head:Optional[nn.Module]=None, split_on:Optional[SplitFuncOrIdxList]=None,
4 bn_final:bool=False, **learn_kwargs:Any)->Learner:
5 body = create_body(arch, pretrained, cut)

NameError: name ‘nn’ is not defined

I looked up nn for python and only found the tf library, but importing that did not help. Should i define nn myself and if so, how?

I am new to python, which makes it hard to follow whats going on. Any suggestions? Thanks for the help!


#12

I switched to PyTorch 1.0.0 now and it completed the function with the original create_cnn.

I also realized that my kernel was somehow stuckand that the error regarding the missing definition of nn, which i mention up top, could have been just a result of that.

Thanks for the help!


#13

I’m having the same problem (Win10, GTX 1060) and installed pytorch 1.0.0 now via conda, and now I’m getting a BrokenPipeError…

Does anyone know how to avoid that?

EDIT:
I was only trying out the very first tutorial code for fastAI vision when I got these errors (which was kind a bummer, since even at the very basic tutorial, things are not working). Here’s the full code:

from fastai.vision import *

path = untar_data(URLs.MNIST_SAMPLE)

data = ImageDataBunch.from_folder(path)

learn = create_cnn(data, models.resnet18, metrics=accuracy)

learn.fit(1)

#14

I am using python 3.6 and used the pip install command:

pip install /path/to/whl/file

the file i got from:

There i chose this file:
torch-1.0.0-cp36-cp36m-win_amd64.whl

my assumption being that 36 is python, but that was just a hunch. It is working for me so far though


#15

Ok, kind of weird. With pytorch 1.0.0, I get the BrokenPipeError only when I use Spyder (beta version 4), but it works fine with Jupyter notebook. I’m also using python 3.6, but with Anaconda.

Well, guess I’ll use Jupyter notebook for FastAI for now, though I hope there will be some proper solution / support for the whole thing in future.


(Leon Dummer) #16

Did you check the ram usage, when you got the error? For me it appears that the broken pipe error happens, when my ram runs out of memory.


(Christpoher Toth) #17

I’m also stuck in the exact same place.
After running create_cnn everything freezes.
It works on wsl, but of course no GPU support available there.

So disappointing to be right on the edge of being able to play with this and poof it just doesn’t work.


(Imran) #18

Please try with pytorch 1.0.0, it works in Windows with FastAI


(Christpoher Toth) #19

(fastai) C:\Users\Q\fastai>conda install pytorch==1.0.0
Collecting package metadata: done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  • pytorch==1.0.0

(fastai) C:\Users\Q\fastai>pip install torch==1.0.0
Collecting torch==1.0.0
Could not find a version that satisfies the requirement torch==1.0.0 (from versions: 0.1.2, 0.1.2.post1)
No matching distribution found for torch==1.0.0


(Cody Quist) #20

Confirmed working solution

View your current version of cuda with (in a command prompt, cmd, powershell, etc)
nvcc --version

Use the wheel, dont use conda
download this IF you are using cuda 9.0, otherwise you will have to find the link on your own
https://download.pytorch.org/whl/cu90/torch-1.0.0-cp36-cp36m-win_amd64.whl

Make sure to NOT download pytorch 0.4, but rather pytorch 1.0.0

After you download the correct cuda version pytorch wheel, just install it as a normal pip package
pip install torch-1.0.0-cp36-cp36m-win_amd64.whl


(Peter Zacho) #21

Thanks for sharing - really helped me progress. Fix works fine for me. I’m using pytorch 1.0.0 deployed using conda on Windows 10 / CPU.