Recursion error; fastai v1.0.27, Windows 10


(Kevin Maher) #1

I am trying to run a simple intro model using fastai v1.0.27 on Windows 10 in the Pycharm IDE and the learn.fit() method goes into infinite recursion. The code:

from fastai import *
from fastai.tabular import *
from fastai.text import *
from fastai.vision import *

PATH = "dogscats/"
sz = 224

def main():  # https://docs.fast.ai/
    data = ImageDataBunch.from_folder(PATH)
    learn = create_cnn(data, models.resnet18, metrics=accuracy)
    learn.fit(1)

if __name__ == '__main__':
    main()
    print('DONE!')

Response when run:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
  File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\data_block.py", line 378, in __getattr__
    res = getattr(self.x, k, None)
  File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\data_block.py", line 378, in __getattr__
    res = getattr(self.x, k, None)
  File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\data_block.py", line 378, in __getattr__
    res = getattr(self.x, k, None)
  [Previous line repeated 995 more times]
RecursionError: maximum recursion depth exceeded while calling a Python object

My system:

3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]
Torch version: 0.4.1
Torch cuda available and enabled: True True
Cuda device: _CudaDeviceProperties(name='GeForce GTX 1080', major=6, minor=1, total_memory=8192MB, multi_processor_count=20)
Cuda version: 9.2
Torchvision version: 0.2.1
fastai version: 1.0.27

Please what am I leaving out or doing wrong to cause the recursion problem?

Vettejeep


(Kevin Maher) #2

I can change the error by setting the number of workers to zero.

data = ImageDataBunch.from_folder(PATH, num_workers=0)

But now I get:

    Traceback (most recent call last):
    File "D:/$CatsDogs/cats.py", line 19, in <module>
    main()
    File "D:/$CatsDogs/cats.py", line 16, in main
    learn.fit(1)
    File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_train.py", line 162, in fit
    callbacks=self.callbacks+callbacks)
    File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_train.py", line 94, in fit
    raise e
    File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_train.py", line 82, in fit
    for xb,yb in progress_bar(data.train_dl, parent=pbar):
    File "C:\ProgramData\Anaconda3\lib\site-packages\fastprogress\fastprogress.py", line 65, in __iter__
    for i,o in enumerate(self._gen):
    File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_data.py", line 47, in __iter__
    for b in self.dl:
    File "C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 314, in __next__
    batch = self.collate_fn([self.dataset[i] for i in indices])
    File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\torch_core.py", line 94, in data_collate
    return torch.utils.data.dataloader.default_collate(to_data(batch))
    File "C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 187, in default_collate
    return [default_collate(samples) for samples in transposed]
    File "C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 187, in <listcomp>
    return [default_collate(samples) for samples in transposed]
    File "C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 164, in default_collate
    return torch.stack(batch, 0, out=out)
    RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 249 and 274 in dimension 2 at c:\programdata\miniconda3\conda-bld\pytorch_1533096106539\work\aten\src\th\generic/THTensorMath.cpp:3616

It appears to be an image size problem now but I don’t know how to set the image size. The code from the lesson 1 notebook does not seem to work in V1, or I need to find out how the import changed.

Vettejeep


(Kevin Maher) #3

If I run the example directly from the docs, I get a different error:

    path = untar_data(URLs.MNIST_SAMPLE)  # https://docs.fast.ai/
    data = ImageDataBunch.from_folder(path,num_workers=0)

    learn = create_cnn(data, models.resnet18, metrics=accuracy)
    learn.fit(1)

The error:

Traceback (most recent call last):
  File "D:/$CatsDogs/cats.py", line 41, in <module>
    main()
  File "D:/$CatsDogs/cats.py", line 38, in main
    learn.fit(1)
  File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_train.py", line 162, in fit
    callbacks=self.callbacks+callbacks)
  File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_train.py", line 94, in fit
    raise e
  File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_train.py", line 84, in fit
    loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
  File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_train.py", line 22, in loss_batch
    loss = loss_func(out, *yb)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\functional.py", line 1550, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\functional.py", line 1407, in nll_loss
    return torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: Expected object of type torch.cuda.LongTensor but found type torch.cuda.IntTensor for argument #2 'target'

Any ideas on how to get going with V1 in an IDE with Windows?

Thanks,
Vettejeep


(Jeremy Howard (Admin)) #4

I don’t know of anyone who has gotten it working on Windows yet.


(Tommaso Moroni) #5

I would like to add my experience of this bug.

On my machine it occurs if I run the example for preprocessing tabular data https://docs.fast.ai/tabular.html#Preprocessing-tabular-data.
The error occurs when I run (cat_x,cont_x),y = next(iter(data.train_dl)).

My system:

=== Software ===
python version : 3.7.1
fastai version : 1.0.27
torch version  : 0.4.1
torch cuda ver
torch cuda is  : **Not available**

=== Hardware ===
No GPUs available

=== Environment ===
platform       : Windows-10-10.0.17134-SP0
conda env      : ai
python         : C:\Users\Tommaso\Miniconda3\envs\ai\python.exe
sys.path       :
C:\Users\Tommaso\Miniconda3\envs\ai\python37.zip
C:\Users\Tommaso\Miniconda3\envs\ai\DLLs
C:\Users\Tommaso\Miniconda3\envs\ai\lib
C:\Users\Tommaso\Miniconda3\envs\ai
C:\Users\Tommaso\Miniconda3\envs\ai\lib\site-packages
C:\Users\Tommaso\Miniconda3\envs\ai\lib\site-packages\IPython\extensions
no supported gpus found on this system

I also got the error with fastai >= 1.0.24. I just tried with version 1.0.22 and everything works fine (of course taking account of API changes to TabularDataBunch.from_df).

Please let me know if I can be of any assistance.

Tommaso


(rares t) #6

can you please provide a guide for fastai v1 installation on windows 10


#7

As explained here there is no official support for fastai v1 for Windows yet.


(Tommaso Moroni) #8

Hi,
I have managed to make it work, but since I have very limited experience with Python, I would like to share with you the problem I have found and the proposed solution. Would a pull request be better to do this?

Thanks
Tommaso


(s.s.o) #9

I know fastai does not support windows 10 yet. Now, unofficial pytorch 1.0 available… I used latest dev version 1.0.29 with python version 3.7. When I run the Tabular data sample I get recursion error too.

the (cat_x,cont_x),y = next(iter(data.train_dl)) line couses the problem...

Traceback (most recent call last):
File “d:\conda3\lib\site-packages\fastai\data_block.py”, line 423, in getattr
res = getattr(self.x, k, None)
File “d:\conda3\lib\site-packages\fastai\data_block.py”, line 423, in getattr
res = getattr(self.x, k, None)
File “d:\conda3\lib\site-packages\fastai\data_block.py”, line 423, in getattr
res = getattr(self.x, k, None)
[Previous line repeated 996 more times]


#10

I don’t think this is a Windows 10 issue, it looks like a windows issue.

All the examples were working for me on Windows 8.1 (as long as I set num_workers=0) before the course started. And as noted above the tabular example stopped working in v1.0.22 and the vision learner started throwing the

RuntimeError: Expected object of scalar type Long but got scalar type Int for argument #2 'target'

in the dogs_and_cats example in v 1.0.27.

I hadn’t noticed until Lesson 6 because everything else was still working. When someone has a fix can they post it here?


(Yevgeny Menaker) #11

Hello,
I had to patch data_block.py to make it work. The issue is not really Windows 10 related. It happens due to ‘self.x’ and ‘self.y’ attribute access from within __getattr__() method.

As mentioned in :this Stackoverflow post attribute access need to be done though.__dict__ when inside __getattr__().

@Tommaso, is this what you did also or you had different change?

This is new implementation of __getattr__ of LabelsList class:

def __getattr__(self,k:str)->Any:
        res = getattr(self.__dict__.get('x'), k, None)
        return res if res is not None else getattr(self.__dict__.get('y'), k)

Similar change for __getattr__ of ItemLists class:

def __getattr__(self, k):
        ft = getattr(self.__dict__['train'], k)
        if not isinstance(ft, Callable): return ft
        fv = getattr(self.__dict__['valid'], k)
        assert isinstance(fv, Callable)
        def _inner(*args, **kwargs):
            self.train = ft(*args, **kwargs)
            assert isinstance(self.__dict__['train'], LabelList)
            self.valid = fv(*args, **kwargs)
            self.__class__ = LabelLists
            self.process()
            return self
        return _inner

(Yevgeny Menaker) #12

Hey @cudawarped,
I also ran into a bunch of those on Windows 10 and one of the FastAI discussions mentions that this is indeed Windows related and variables need to be explicitly converted to long by calling .long() method.

That’s the change for the target parameter. In my case it was line 177 of layers.py file (I am using the latest clone of fastai github repo):

return self.func.__call__(input, target.view(-1).long(), **kwargs)

Also, had to change it in metrics.py, line 39:

return (input==targs.long()).float().mean()

Maybe there are more spots; will be discovering on the go.

HTH


(Tommaso Moroni) #13

Hi @quantotto,
I tried to fix exactly the same part of __getattr__() implementation, but I am not proficient enough in Python so I stopped working on it.
From what I saw on various forums and online resources, infinite recursion in __getattr__() happens very often if the method does not handle correctly missing attributes.
Glad that you came up with a solution! :slight_smile:

Tommaso


(Yevgeny Menaker) #14

Cool, glad it helps!


#15

Hey @quantotto

Thank you for your suggestions, that is a great help.

Regarding the .long() conversion I proposed a small change in this pull request which seemed to work for me.

Alternatively to avoid hacking the library I found that

data.train_ds.y.items = np.int64(data.train_ds.y.items)
data.valid_ds.y.items = np.int64(data.valid_ds.y.items)

also works.


(Yevgeny Menaker) #16

Thanks a lot, @cudawarped! External change is even better. I’ll keep that in mind for my tests.


(s.s.o) #17

After quantotto 's suggested changes I am able to run Tabular data sample thank you all…