Recursion error; fastai v1.0.27, Windows 10

Vettejeep · November 17, 2018, 9:30pm

I am trying to run a simple intro model using fastai v1.0.27 on Windows 10 in the Pycharm IDE and the learn.fit() method goes into infinite recursion. The code:

from fastai import *
from fastai.tabular import *
from fastai.text import *
from fastai.vision import *

PATH = "dogscats/"
sz = 224

def main():  # https://docs.fast.ai/
    data = ImageDataBunch.from_folder(PATH)
    learn = create_cnn(data, models.resnet18, metrics=accuracy)
    learn.fit(1)

if __name__ == '__main__':
    main()
    print('DONE!')

Response when run:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
  File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\data_block.py", line 378, in __getattr__
    res = getattr(self.x, k, None)
  File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\data_block.py", line 378, in __getattr__
    res = getattr(self.x, k, None)
  File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\data_block.py", line 378, in __getattr__
    res = getattr(self.x, k, None)
  [Previous line repeated 995 more times]
RecursionError: maximum recursion depth exceeded while calling a Python object

My system:

3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]
Torch version: 0.4.1
Torch cuda available and enabled: True True
Cuda device: _CudaDeviceProperties(name='GeForce GTX 1080', major=6, minor=1, total_memory=8192MB, multi_processor_count=20)
Cuda version: 9.2
Torchvision version: 0.2.1
fastai version: 1.0.27

Please what am I leaving out or doing wrong to cause the recursion problem?

Vettejeep

Vettejeep · November 18, 2018, 3:36am

I can change the error by setting the number of workers to zero.

data = ImageDataBunch.from_folder(PATH, num_workers=0)

But now I get:

    Traceback (most recent call last):
    File "D:/$CatsDogs/cats.py", line 19, in <module>
    main()
    File "D:/$CatsDogs/cats.py", line 16, in main
    learn.fit(1)
    File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_train.py", line 162, in fit
    callbacks=self.callbacks+callbacks)
    File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_train.py", line 94, in fit
    raise e
    File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_train.py", line 82, in fit
    for xb,yb in progress_bar(data.train_dl, parent=pbar):
    File "C:\ProgramData\Anaconda3\lib\site-packages\fastprogress\fastprogress.py", line 65, in __iter__
    for i,o in enumerate(self._gen):
    File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_data.py", line 47, in __iter__
    for b in self.dl:
    File "C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 314, in __next__
    batch = self.collate_fn([self.dataset[i] for i in indices])
    File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\torch_core.py", line 94, in data_collate
    return torch.utils.data.dataloader.default_collate(to_data(batch))
    File "C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 187, in default_collate
    return [default_collate(samples) for samples in transposed]
    File "C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 187, in <listcomp>
    return [default_collate(samples) for samples in transposed]
    File "C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 164, in default_collate
    return torch.stack(batch, 0, out=out)
    RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 249 and 274 in dimension 2 at c:\programdata\miniconda3\conda-bld\pytorch_1533096106539\work\aten\src\th\generic/THTensorMath.cpp:3616

It appears to be an image size problem now but I don’t know how to set the image size. The code from the lesson 1 notebook does not seem to work in V1, or I need to find out how the import changed.

Vettejeep

Vettejeep · November 18, 2018, 3:51am

If I run the example directly from the docs, I get a different error:

    path = untar_data(URLs.MNIST_SAMPLE)  # https://docs.fast.ai/
    data = ImageDataBunch.from_folder(path,num_workers=0)

    learn = create_cnn(data, models.resnet18, metrics=accuracy)
    learn.fit(1)

The error:

Traceback (most recent call last):
  File "D:/$CatsDogs/cats.py", line 41, in <module>
    main()
  File "D:/$CatsDogs/cats.py", line 38, in main
    learn.fit(1)
  File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_train.py", line 162, in fit
    callbacks=self.callbacks+callbacks)
  File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_train.py", line 94, in fit
    raise e
  File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_train.py", line 84, in fit
    loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
  File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_train.py", line 22, in loss_batch
    loss = loss_func(out, *yb)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\functional.py", line 1550, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\functional.py", line 1407, in nll_loss
    return torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: Expected object of type torch.cuda.LongTensor but found type torch.cuda.IntTensor for argument #2 'target'

Any ideas on how to get going with V1 in an IDE with Windows?

Thanks,
Vettejeep

jeremy · November 18, 2018, 4:38am

I don’t know of anyone who has gotten it working on Windows yet.

Tommaso · November 18, 2018, 5:59am

I would like to add my experience of this bug.

On my machine it occurs if I run the example for preprocessing tabular data https://docs.fast.ai/tabular.html#Preprocessing-tabular-data.
The error occurs when I run (cat_x,cont_x),y = next(iter(data.train_dl)).

My system:

=== Software ===
python version : 3.7.1
fastai version : 1.0.27
torch version  : 0.4.1
torch cuda ver
torch cuda is  : **Not available**

=== Hardware ===
No GPUs available

=== Environment ===
platform       : Windows-10-10.0.17134-SP0
conda env      : ai
python         : C:\Users\Tommaso\Miniconda3\envs\ai\python.exe
sys.path       :
C:\Users\Tommaso\Miniconda3\envs\ai\python37.zip
C:\Users\Tommaso\Miniconda3\envs\ai\DLLs
C:\Users\Tommaso\Miniconda3\envs\ai\lib
C:\Users\Tommaso\Miniconda3\envs\ai
C:\Users\Tommaso\Miniconda3\envs\ai\lib\site-packages
C:\Users\Tommaso\Miniconda3\envs\ai\lib\site-packages\IPython\extensions
no supported gpus found on this system

I also got the error with fastai >= 1.0.24. I just tried with version 1.0.22 and everything works fine (of course taking account of API changes to TabularDataBunch.from_df).

Please let me know if I can be of any assistance.

Tommaso

Rares · November 23, 2018, 2:59pm

can you please provide a guide for fastai v1 installation on windows 10

sgugger · November 23, 2018, 3:14pm

As explained here there is no official support for fastai v1 for Windows yet.

Tommaso · November 24, 2018, 5:48am

Hi,
I have managed to make it work, but since I have very limited experience with Python, I would like to share with you the problem I have found and the proposed solution. Would a pull request be better to do this?

Thanks
Tommaso

s.s.o · November 24, 2018, 10:44am

I know fastai does not support windows 10 yet. Now, unofficial pytorch 1.0 available… I used latest dev version 1.0.29 with python version 3.7. When I run the Tabular data sample I get recursion error too.

the (cat_x,cont_x),y = next(iter(data.train_dl)) line couses the problem...

Traceback (most recent call last):
File “d:\conda3\lib\site-packages\fastai\data_block.py”, line 423, in getattr
res = getattr(self.x, k, None)
File “d:\conda3\lib\site-packages\fastai\data_block.py”, line 423, in getattr
res = getattr(self.x, k, None)
File “d:\conda3\lib\site-packages\fastai\data_block.py”, line 423, in getattr
res = getattr(self.x, k, None)
[Previous line repeated 996 more times]

cudawarped · November 30, 2018, 9:54am

I don’t think this is a Windows 10 issue, it looks like a windows issue.

All the examples were working for me on Windows 8.1 (as long as I set num_workers=0) before the course started. And as noted above the tabular example stopped working in v1.0.22 and the vision learner started throwing the

RuntimeError: Expected object of scalar type Long but got scalar type Int for argument #2 'target'

in the dogs_and_cats example in v 1.0.27.

I hadn’t noticed until Lesson 6 because everything else was still working. When someone has a fix can they post it here?

quantotto · December 5, 2018, 3:52am

Hello,
I had to patch data_block.py to make it work. The issue is not really Windows 10 related. It happens due to ‘self.x’ and ‘self.y’ attribute access from within __getattr__() method.

As mentioned in :this Stackoverflow post attribute access need to be done though.__dict__ when inside __getattr__().

@Tommaso, is this what you did also or you had different change?

This is new implementation of __getattr__ of LabelsList class:

def __getattr__(self,k:str)->Any:
        res = getattr(self.__dict__.get('x'), k, None)
        return res if res is not None else getattr(self.__dict__.get('y'), k)

Similar change for __getattr__ of ItemLists class:

def __getattr__(self, k):
        ft = getattr(self.__dict__['train'], k)
        if not isinstance(ft, Callable): return ft
        fv = getattr(self.__dict__['valid'], k)
        assert isinstance(fv, Callable)
        def _inner(*args, **kwargs):
            self.train = ft(*args, **kwargs)
            assert isinstance(self.__dict__['train'], LabelList)
            self.valid = fv(*args, **kwargs)
            self.__class__ = LabelLists
            self.process()
            return self
        return _inner

quantotto · December 5, 2018, 4:06am

Hey @cudawarped,
I also ran into a bunch of those on Windows 10 and one of the FastAI discussions mentions that this is indeed Windows related and variables need to be explicitly converted to long by calling .long() method.

That’s the change for the target parameter. In my case it was line 177 of layers.py file (I am using the latest clone of fastai github repo):

return self.func.__call__(input, target.view(-1).long(), **kwargs)

Also, had to change it in metrics.py, line 39:

return (input==targs.long()).float().mean()

Maybe there are more spots; will be discovering on the go.

HTH

Tommaso · December 5, 2018, 5:50am

Hi @quantotto,
I tried to fix exactly the same part of __getattr__() implementation, but I am not proficient enough in Python so I stopped working on it.
From what I saw on various forums and online resources, infinite recursion in __getattr__() happens very often if the method does not handle correctly missing attributes.
Glad that you came up with a solution!

Tommaso

quantotto · December 5, 2018, 6:52am

Cool, glad it helps!

cudawarped · December 5, 2018, 9:44am

Hey @quantotto

Thank you for your suggestions, that is a great help.

Regarding the .long() conversion I proposed a small change in this pull request which seemed to work for me.

Alternatively to avoid hacking the library I found that

data.train_ds.y.items = np.int64(data.train_ds.y.items)
data.valid_ds.y.items = np.int64(data.valid_ds.y.items)

also works.

quantotto · December 5, 2018, 10:56am

Thanks a lot, @cudawarped! External change is even better. I’ll keep that in mind for my tests.

s.s.o · December 5, 2018, 11:45am

After quantotto 's suggested changes I am able to run Tabular data sample thank you all…

Mirodil · December 10, 2018, 11:51pm

I have the same issue with this example. Have you found a solution?

jeremy · December 11, 2018, 12:06am

All the errors mentioned here should be fixed in master now.

s.s.o · December 11, 2018, 8:13pm

When I run the tabular data example from docs fastai version: 1.0.36.post1 (master) I still get the recursion error.