Sum_cpus causing a NameError

Chris_Palmer · April 27, 2018, 12:20pm

I am having an error using sum_cpus.

Since I see that sum_geom also exists in the core.py I thought I would try running sum_geom as well to check the functions in core.py are loaded, and it works fine, but not so with sum_cpus. Can anyone point me to an understanding of this?

Also, when the darknet notebook calls ConvLearner.from_model_data it causes an error - I see that learner.from_model_data exists in my code, and that ConvLearner is subclassed from it - so perhaps this is evidence of some sort of memory corruption?

KevinB · April 27, 2018, 2:47pm

When is the last time you did a git pull? Also in your title and first paragraph, I believe you are asking about num_cpus() correct?

Chris_Palmer · April 27, 2018, 5:28pm

Hi Kevin

I do a git pull daily. But not so a conda env update as I am a bit wary of what might happen - will I get the wrong version of Pytorch?.

Sure, I am asking about sum_cpus() causing an error, and then I have thrown in the additional fact that ConvLearner can’t locate one of the functions of its super class. This is the same notebook, and maybe related behaviour - something isn’t chaining correctly perhaps…

I have rebooted my PC, to eliminate the possibility of some temporary glitch. But it seems to be related to the fastai conda environment - I can run the code in my base environment fine.

UPDATE:
I tried a conda env update in the fastai directory (after activating the fastai environment) - the only module that needed upgrading was parso-0.1.1 to parso-0.2.0. Unfortunately it made no difference.

There is a difference in pytorch versions between base and fastai on my system - in base I have 0.3.0b0+591e73e, but in fastai I have 0.3.1.post2

Chris_Palmer · April 29, 2018, 7:30pm

Hi @KevinB, and @jeremy if you have time (sorry to try dragging you in, but this might be a simple thing for you and nobody else has helped!). This is still happening, I cannot run things properly from my fast.ai environment - the num_cpus is an example of it, but so is the problem locating a method in the super class. By contrast I can run things outside of the environment, where I also have Pytorch installed, albeit an earlier version (see above).

I am about to re-create my fast.ai environment, but wonder if there is a known problem I could fix before I do so…

KevinB · April 29, 2018, 9:13pm

Can you share your notebook? I am not sure what the problem could be. re-creating your fast.ai environment is probably a good start.

jeremy · April 29, 2018, 10:18pm

Please don’t do that.

Chris_Palmer · April 29, 2018, 11:02pm

Thanks Kevin

There is little point in sharing the notebook - it happens on all notebooks, and the notebook that I first saw it is the standard cifar10-darknet one.

I will re-create the fast.ai environment, and apologies to Jeremy - he wasn’t happy at my attempt to elicit his help!

narvind2003 · April 30, 2018, 12:05am

What OS are you using, Chris?

Chris_Palmer · April 30, 2018, 1:39am

Hi Arvind

I am using Windows 10, and because I have an older GPU (CUDA 5 compatible) I have restricted my Pytorch to those which will play nicely with it.

In my working system, which is just my base Anaconda Python 3.6, I have Pytorch version 0.3.0b0+591e73e - which is the peterjc123 version of 0.3.0.

On the non-working fastai system I have 0.3.1.post2 - which was a .whl from the peterjc123 site - that allows the use of the older GPU with a more up-to-date Pytorch. This has worked, but maybe no longer does.

I wonder if we can upgrade to Pytorch 0.4.0? It has put the CUDA 5 compatibility back in…

narvind2003 · April 30, 2018, 2:06am

Understood. How about using the standard fastai conda install method?
We have official windows support in putorch v0.4 and cuda 5 compatibility as well.
If it doesn’t work, you can wipe out the conda env and we can see what next can be done.

Chris_Palmer · April 30, 2018, 2:27am

Hi Arvind

When you say that “we have official windows support in Pytorch v0.4”, are you speaking from the perspective of Pytorch or of fast.ai? I ask because I can see that there are some architectural / “breaking” changes in Pytorch 0.4 and I suspect that fast.ai might not yet be compatible with them, and I see in the fast.ai environment.yml a directive for - pytorch<0.4.

Because of that directive, I think that if I install using the standard fast.ai conda install it will fetch the standard 0.3.1.post2 - which flat-out rejects my GPU and doesn’t run, which is why I resorted to the version from peterjc123. Is this a correct understanding?

narvind2003 · April 30, 2018, 2:40am

That’s right, Chris. I meant from pytorch perspective.
So you can install default fast.ai env and inside it, upgrade pytorch to 0.4 and see if it works in the new env.
Jeremy mentioned elsewhere that he’s compiling his master version from source and pytorch 0.4 was working fine.

Chris_Palmer · April 30, 2018, 3:10am

Great - thanks for the update

Chris_Palmer · April 30, 2018, 11:17am

Hi Arvind

Unfortunately, despite removing the previous Pytorch and properly upgrading to 0.4.0, I seem to have the old difficulties with Pytorch recognizing my GPU, that caused me to go for the peterjc123 version in the first place.

e.g.

import torch

print(torch.__version__)
0.4.0

torch.cuda.is_available()
True

torch
<module 'torch' from 'D:\\Anaconda3\\envs\\fastai\\lib\\site-packages\\torch\\__init__.py'>

torch.cuda.set_device(0)

torch.cuda.get_device_capability(0)
D:\Anaconda3\envs\fastai\lib\site-packages\torch\cuda\__init__.py:116: UserWarning: 
    Found GPU0 GeForce GTX 650 Ti which is of cuda capability 3.0.
    PyTorch no longer supports this GPU because it is too old.
    
  warnings.warn(old_gpu_warn % (d, name, major, capability[1]))
(3, 0)

x = torch.cuda.FloatTensor(1)
x
tensor([ 0.], device='cuda:0')


y = torch.FloatTensor(1).cuda()
y
tensor([ 0.], device='cuda:0')

x + y
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-9-cd60f97aa77f> in <module>()
----> 1 x + y

RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device at c:\programdata\miniconda3\conda-bld\pytorch_1524549877902\work\aten\src\thc\generic/THCTensorMathPointwise.cu:265

I think I need to go back to a version from peterjc123 - my CUDA compatibility is simply too old - its CUDA 3, not CUDA 5! He has a 0.4.0 version…

narvind2003 · April 30, 2018, 11:44am

Oh I see! Sorry, Chris. Please let us know if the peterjc123 version works for you.

Chris_Palmer · April 30, 2018, 11:54am

Hi Arvind - yes it works as far as doing standard Pytorch things with - I don’t get the errors I described in my immediate prior post. However it has made no difference to the problems I am seeing in fast.ai - so I’m going to remove the environment and start from scratch.

Can you help me understand how to resolve something that I saw upon installation of the peterjc123 .whl - and tell me if might have anything to do with the odd behaviour with fast.ai? Can I safely upgrade the html5lib and downgrade the regex - and will this happen automatically if I upgrade spaCy?

pip install torch-0.4.0a0+38aaa63-cp36-cp36m-win_amd64.whl
Processing d:\fastai\torch-0.4.0a0+38aaa63-cp36-cp36m-win_amd64.whl
spacy 2.0.8 has requirement html5lib==1.0b8, but you'll have html5lib 0.9999999 which is incompatible.
spacy 2.0.8 has requirement regex==2017.4.5, but you'll have regex 2017.11.9 which is incompatible.
Installing collected packages: torch
Successfully installed torch-0.4.0a0+38aaa63

narvind2003 · April 30, 2018, 12:11pm

AFAIK, we use HTML in lesson4 and spacy for tokenization. It shouldn’t matter even if your version is slightly older or newer as we only use them for the core benefits those libraries provide.
Worst case, you can pip install the specific version within your env.
Let us know if that works for you.

Chris_Palmer · April 30, 2018, 12:35pm

Thanks - I’ll check it out after I’ve got the rest of it sorted!

I have recreated the fastai environment, uninstalled the Pytorch it put in place, then pip installed the working peterjc123 version 0.4.0. I got got past the call to num_cpus() without an error, but I am still getting the problem with ConvLearner not seeing the data_path attribute in the parent class - I think this is a genuine bug - see below:

learn = ConvLearner.from_model_data(m, data)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-14-bb021e2adc7e> in <module>()
----> 1 learn = ConvLearner.from_model_data(m, data)

d:\FASTAI\fastai\fastai\learner.py in from_model_data(cls, m, data, **kwargs)
     44     @classmethod
     45     def from_model_data(cls, m, data, **kwargs):
---> 46         self = cls(data, BasicModel(to_gpu(m)), **kwargs)
     47         self.unfreeze()
     48         return self

d:\FASTAI\fastai\fastai\conv_learner.py in __init__(self, data, models, precompute, **kwargs)
     95     def __init__(self, data, models, precompute=False, **kwargs):
     96         self.precompute = False
---> 97         super().__init__(data, models, **kwargs)
     98         if hasattr(data, 'is_multi') and not data.is_reg and self.metrics is None:
     99             self.metrics = [accuracy_thresh(0.5)] if self.data.is_multi else [accuracy]

d:\FASTAI\fastai\fastai\learner.py in __init__(self, data, models, opt_fn, tmp_name, models_name, metrics, clip, crit)
     35         self.opt_fn = opt_fn or SGD_Momentum(0.9)
     36         self.tmp_path = tmp_name if os.path.isabs(tmp_name) else os.path.join(self.data.path, tmp_name)
---> 37         self.models_path = models_name if os.path.isabs(models_name) else os.path.join(self.data_path, models_name)
     38         os.makedirs(self.tmp_path, exist_ok=True)
     39         os.makedirs(self.models_path, exist_ok=True)

AttributeError: 'ConvLearner' object has no attribute 'data_path'

The ConvLearner declaration is class ConvLearner(Learner):

Learner contains a reference to self.data_path - is this a bug - should it read self.data.path as in the line above it in the Learner class definition? I edited the class definition and changed this to self.data.path, and I no longer get an error…

class Learner():
    def __init__(self, data, models, opt_fn=None, tmp_name='tmp', models_name='models', metrics=None, clip=None, crit=None):
        """
        Combines a ModelData object with a nn.Module object, such that you can train that
        module.
        data (ModelData): An instance of ModelData.
        models(module): chosen neural architecture for solving a supported problem.
        opt_fn(function): optimizer function, uses SGD with Momentum of .9 if none.
        tmp_name(str): output name of the directory containing temporary files from training process
        models_name(str): output name of the directory containing the trained model
        metrics(list): array of functions for evaluating a desired metric. Eg. accuracy.
        clip(float): gradient clip chosen to limit the change in the gradient to prevent exploding gradients Eg. .3
        """
        self.data_,self.models,self.metrics = data,models,metrics
        self.sched=None
        self.wd_sched = None
        self.clip = None
        self.opt_fn = opt_fn or SGD_Momentum(0.9)
        self.tmp_path = tmp_name if os.path.isabs(tmp_name) else os.path.join(self.data.path, tmp_name)
        self.models_path = models_name if os.path.isabs(models_name) else os.path.join(self.data_path, models_name)
        os.makedirs(self.tmp_path, exist_ok=True)
        os.makedirs(self.models_path, exist_ok=True)
        self.crit = crit if crit else self._get_crit(data)
        self.reg_fn = None
        self.fp16 = False

alessa · April 30, 2018, 2:43pm

yes indeed there is a bug - you can reverse the file to a previous commit by git checkout b87dd295073d316a75f1d2e0cc6546aef0f1fbdc -- fastai/learner.py until it will be fixed to have a version that works.
To know your commit id - you can install gitk and run gitk fastai/learner.py and choose whater commit you like - I personally choose the last one made by Jeremy.

Chris_Palmer · April 30, 2018, 5:40pm

Thanks Alessa I see there is a pull request for this bug at the moment!