Lesson 1 - Notebook stuck in create_cnn

(Imran) #1


I have setup fastai, and am executing lesson1 notebook. I was able to download the tar files and unzip them. I have also replaced the “pat” regex to suit windows format, however, the notebook keeps on executing (without aborting / getting killed) when i execute the “create_cnn” step.

learn = create_cnn(data, models.resnet34, metrics=error_rate)

However if I replace the “learn” with the below code, I am able to proceed.

def get_model(pretrained=True, model_name = ‘resnet34’, **kwargs ):
arch = models.resnet34(pretrained, **kwargs )
return arch

learn = Learner(data, get_model(), metrics=[accuracy])

I’ve got Windows 10, GTX 1080, latest NVIDIA drivers (GeForce), updated latest available version of pytorch and fastai library. Can someone suggest how to debug and resolve this issue?

(Rohit Singh) #3

What was the problem? I’m seeing a similar hang in create_cnn on an AWS windows machine.

(Spencer Uresk) #4

I’m having this problem also.

Windows 10
Python 3.7
FastAI 1.0.43
PyTorch 1.0.1 (py3.7_cuda100_cudnn7_1)

It looks like it hangs in training and if I do a dump of the process, the top 30 or so lines look like this:


Trying to dig into it further, but I’m a little out of my depth with some of this stuff so it is slow.

I thought perhaps this was purely a PyTorch thing and maybe could be reproduced outside fastai, but I ran through a bunch of PyTorch examples and they all seem to work so far.


I’m having the same issue with create_cnn on my local Win7 machine. After spending some time debugging, i found out that it gets stuck while executing num_features_model function. To overcome this issue, i wrote a custom function which uses create_cnn source code, but instead of calling num_features_model i pass num_features:int value as a parameter:

def _resnet_split(m:nn.Module): return (m[0][6],m[1])
def my_create_cnn(data:DataBunch, arch:Callable, num_features:int, cut:Union[int,Callable]=None, pretrained:bool=True, lin_ftrs:Optional[Collection[int]]=None, ps:Floats=0.5,
                custom_head:Optional[nn.Module]=None, split_on:Optional[SplitFuncOrIdxList]=None,
                bn_final:bool=False, **learn_kwargs:Any)->Learner:
    body = create_body(arch, pretrained, cut)
    head = custom_head or create_head(num_features, data.c, lin_ftrs, ps=ps, bn_final=bn_final)
    model = nn.Sequential(body, head)
    learn = Learner(data, model, **learn_kwargs)
    learn.split(ifnone(split_on, _resnet_split))
    if pretrained: learn.freeze()
    apply_init(model[1], nn.init.kaiming_normal_)
    return learn

I use this function instead of create_cnn, for example:

learn = my_create_cnn(data, models.resnet34, 1024, metrics=error_rate)

where num_features=1024 for resnet34, 4096 for resnet50 and 1024 for resnet18

(Spencer Uresk) #6

Thanks! Still trying to figure out the core problem, but this at least let me work around it temporarily.


Hi there,

I have tried using the fix posted by ZhekaMeka, but run into other issues:

File “”, line 4
body = create_body(arch, pretrained, cut)
IndentationError: expected an indented block

How exactly did you manage to fix it ?


That’s a Python error which in this case means that a function should be indented. I fixed the formatting for my answer so that now it should work by copying and pasting it into jupyter notebook.
Hope that helps!

(Imran) #9

I extracted the code for “create_cnn” and added a pdb and this is where its stuck now

from fastai.vision import *
from fastai.metrics import error_rate
import torch
bs = 64
path = untar_data(URLs.PETS); path
path_anno = path/‘annotations’
path_img = path/‘images’
fnames = get_image_files(path_img)
pat = re.compile(r’\([^\]+)_\d+.jpg$’)
data = ImageDataBunch.from_name_re(path_img, fnames, pat, ds_tfms=get_transforms(), size=224, bs=bs ).normalize(imagenet_stats)

import pdb;

def _default_split(m:nn.Module): return (m[1],)
def _resnet_split(m:nn.Module): return (m[0][6],m[1])

_default_meta = {‘cut’:-1, ‘split’:_default_split}
_resnet_meta = {‘cut’:-2, ‘split’:_resnet_split }

model_meta = {
… models.resnet18 :{_resnet_meta}, models.resnet34: {_resnet_meta},
… models.resnet50 :{_resnet_meta}, models.resnet101:{_resnet_meta}}

def cnn_config(arch):
… “Get the metadata associated with arch.”
… torch.backends.cudnn.benchmark = True
… return model_meta.get(arch, default_meta)

class Hook():
… “Create a hook on m with hook_func.”
… def init(self, m:nn.Module, hook_func:HookFunc, is_forward:bool=True, detach:bool=True):
… self.hook_func,self.detach,self.stored = hook_func,detach,None
… f = m.register_forward_hook if is_forward else m.register_backward_hook
… self.hook = f(self.hook_fn)
… self.removed = False
… def hook_fn(self, module:nn.Module, input:Tensors, output:Tensors):
… “Applies hook_func to module, input, output.”
… if self.detach:
… input = (o.detach() for o in input ) if is_listy(input ) else input.detach()
… output = (o.detach() for o in output) if is_listy(output) else output.detach()
… self.stored = self.hook_func(module, input, output)
… def remove(self):
… “Remove the hook from the model.”
… if not self.removed:
… self.hook.remove()
… self.removed=True
… def enter(self, *args): return self
… def exit(self, *args): self.remove()

class Hooks():
… “Create several hooks on the modules in ms with hook_func.”
… def init(self, ms:Collection[nn.Module], hook_func:HookFunc, is_forward:bool=True, detach:bool=True):
… self.hooks = [Hook(m, hook_func, is_forward, detach) for m in ms]
… def getitem(self,i:int)->Hook: return self.hooks[i]
… def len(self)->int: return len(self.hooks)
… def iter(self): return iter(self.hooks)
… def stored(self): return [o.stored for o in self]
… def remove(self):
… “Remove the hooks from the model.”
… for h in self.hooks: h.remove()
… def enter(self, *args): return self
… def exit (self, *args): self.remove()

def dummy_batch(m: nn.Module, size:tuple=(64,64))->Tensor:
… “Create a dummy batch to go through m with size.”
… ch_in = in_channels(m)
… pdb.set_trace()
… return one_param(m).new(1, ch_in, *size).requires_grad

def dummy_eval(m:nn.Module, size:tuple=(64,64)):
… “Pass a dummy_batch in evaluation mode in m with size.”
… return m.eval()(dummy_batch(m, size))

def _hook_inner(m,i,o): return o if isinstance(o,Tensor) else o if is_listy(o) else list(o)

def hook_outputs(modules:Collection[nn.Module], detach:bool=True, grad:bool=False)->Hooks:
… "Return Hooks that store activations of all modules in self.stored"
… return Hooks(modules, _hook_inner, detach=detach, is_forward=not grad)

def model_sizes(m:nn.Module, size:tuple=(64,64))->Tuple[Sizes,Tensor,Hooks]:
… “Pass a dummy input through the model m to get the various sizes of activations.”
… with hook_outputs(m) as hooks:
… x = dummy_eval(m, size)
… return [o.stored.shape for o in hooks]

def num_features_model(m:nn.Module)->int:
… “Return the number of output features for model.”
… sz = 64
… while True:
… try: return model_sizes(m, size=(sz,sz))[-1][1]
… except Exception as e:
… sz *= 2
… if sz > 2048: raise

def create_cnn1(data:DataBunch, arch:Callable, cut:Union[int,Callable]=None, pretrained:bool=True,
… lin_ftrs:Optional[Collection[int]]=None, ps:Floats=0.5,
… custom_head:Optional[nn.Module]=None, split_on:Optional[SplitFuncOrIdxList]=None,
… bn_final:bool=False, **learn_kwargs:Any)->Learner:
… meta = cnn_config(arch)
… body = create_body(arch, pretrained, cut)
… nf = num_features_model(body) * 2
… head = custom_head or create_head(nf, data.c, lin_ftrs, ps=ps, bn_final=bn_final)
… model = nn.Sequential(body, head)
… learn = Learner(data, model, **learn_kwargs)
… learn.split(ifnone(split_on,meta[‘split’]))
… if pretrained: learn.freeze()
… apply_init(model[1], nn.init.kaiming_normal_)
… return learn

learn = create_cnn1(data, models.resnet34, metrics=error_rate)
(Pdb) n
(5)dummy_batch()->tensor([[[[ 3…5320e-01]]]])
(Pdb) n
-> def call(self, *input, **kwargs):
(Pdb) n
-> for hook in self._forward_pre_hooks.values():
(Pdb) n
-> if torch._C._get_tracing_state():
(Pdb) n
-> result = self.forward(*input, **kwargs)
(Pdb) n

Execution does not proceed after this.

(Rohit Singh) #10

A workaround:

We don’t see this hang with PyTorch 1.0.0 but with PyTorch 1.0.1



thank you for the quick response. I did copy the code to notebook, but get another error:


NameError Traceback (most recent call last)
----> 1 def _resnet_split(m:nn.Module): return (m[0][6],m[1])
2 def my_create_cnn(data:DataBunch, arch:Callable, num_features:int, cut:Union[int,Callable]=None, pretrained:bool=True, lin_ftrs:Optional[Collection[int]]=None, ps:Floats=0.5,
3 custom_head:Optional[nn.Module]=None, split_on:Optional[SplitFuncOrIdxList]=None,
4 bn_final:bool=False, **learn_kwargs:Any)->Learner:
5 body = create_body(arch, pretrained, cut)

NameError: name ‘nn’ is not defined

I looked up nn for python and only found the tf library, but importing that did not help. Should i define nn myself and if so, how?

I am new to python, which makes it hard to follow whats going on. Any suggestions? Thanks for the help!


I switched to PyTorch 1.0.0 now and it completed the function with the original create_cnn.

I also realized that my kernel was somehow stuckand that the error regarding the missing definition of nn, which i mention up top, could have been just a result of that.

Thanks for the help!


I’m having the same problem (Win10, GTX 1060) and installed pytorch 1.0.0 now via conda, and now I’m getting a BrokenPipeError…

Does anyone know how to avoid that?

I was only trying out the very first tutorial code for fastAI vision when I got these errors (which was kind a bummer, since even at the very basic tutorial, things are not working). Here’s the full code:

from fastai.vision import *

path = untar_data(URLs.MNIST_SAMPLE)

data = ImageDataBunch.from_folder(path)

learn = create_cnn(data, models.resnet18, metrics=accuracy)



I am using python 3.6 and used the pip install command:

pip install /path/to/whl/file

the file i got from:

There i chose this file:

my assumption being that 36 is python, but that was just a hunch. It is working for me so far though


Ok, kind of weird. With pytorch 1.0.0, I get the BrokenPipeError only when I use Spyder (beta version 4), but it works fine with Jupyter notebook. I’m also using python 3.6, but with Anaconda.

Well, guess I’ll use Jupyter notebook for FastAI for now, though I hope there will be some proper solution / support for the whole thing in future.

(Leon Dummer) #16

Did you check the ram usage, when you got the error? For me it appears that the broken pipe error happens, when my ram runs out of memory.

(Christpoher Toth) #17

I’m also stuck in the exact same place.
After running create_cnn everything freezes.
It works on wsl, but of course no GPU support available there.

So disappointing to be right on the edge of being able to play with this and poof it just doesn’t work.

(Imran) #18

Please try with pytorch 1.0.0, it works in Windows with FastAI

(Christpoher Toth) #19

(fastai) C:\Users\Q\fastai>conda install pytorch==1.0.0
Collecting package metadata: done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  • pytorch==1.0.0

(fastai) C:\Users\Q\fastai>pip install torch==1.0.0
Collecting torch==1.0.0
Could not find a version that satisfies the requirement torch==1.0.0 (from versions: 0.1.2, 0.1.2.post1)
No matching distribution found for torch==1.0.0

(Cody Quist) #20

Confirmed working solution

View your current version of cuda with (in a command prompt, cmd, powershell, etc)
nvcc --version

Use the wheel, dont use conda
download this IF you are using cuda 9.0, otherwise you will have to find the link on your own

Make sure to NOT download pytorch 0.4, but rather pytorch 1.0.0

After you download the correct cuda version pytorch wheel, just install it as a normal pip package
pip install torch-1.0.0-cp36-cp36m-win_amd64.whl

(Peter Zacho) #21

Thanks for sharing - really helped me progress. Fix works fine for me. I’m using pytorch 1.0.0 deployed using conda on Windows 10 / CPU.