Fastai v2 chat

Richard-Wang · August 5, 2020, 9:44am

Hi, Idk whether I could ask this question here but,

Should we also not to convert nn.LayerNorm to fp16 ?
(All I know is we don’t convert batch_norm to fp16)

# fp16_utils.py
def convert_network(network, dtype):
    """
    Converts a network's parameters and buffers to dtype.
    """
    for module in network.modules():
        if isinstance(module, torch.nn.modules.batchnorm._BatchNorm) and module.affine is True: # <----- here
            continue
        convert_module(module, dtype)
        if isinstance(module, torch.nn.RNNBase) or isinstance(module, torch.nn.modules.rnn.RNNBase):
            module.flatten_parameters()
    return network

morgan · August 5, 2020, 9:53am

If the LayerNorm layer isn’t so big then maybe it should be excluded as in BatchNorm? Could be worth a PR

from the docs:
http://dev.fast.ai/callback.fp16#Problems-with-half-precision:

For the last problem, the tricks offered by NVIDIA are to leave the batchnorm layers in single precision (they don’t have many weights so it’s not a big memory challenge) and compute the loss in single precision (which means converting the last output of the model in single precision before passing it to the loss).

Richard-Wang · August 5, 2020, 10:22am

Based on the doc of autocast
Pytorch mixed precision cast following operations to fp32

I would like to make a pr. But this issue make me not able to even do git checkout -b

bguan · August 5, 2020, 11:48pm

Unfortunately since this change, my local fastai2 clone running on my Ubuntu laptop got into a mess when I try to enable submodule and pulling.

I was getting “Permission denied (publickey). fatal: Could not read from remote repository.” when pulling docs submodule. After messing it up even further by mucking in various .gitxxx files and manually cloning docs without knowing what I was doing I decided to backup and reclone afresh with:

git clone --recurse-submodules https://github.com/fastai/fastai2

Same error.

The full output is:

 brian ~ / Projects > git clone --recurse-submodules https://github.com/fastai/fastai2
Cloning into 'fastai2'...
remote: Enumerating objects: 111, done.
remote: Counting objects: 100% (111/111), done.
remote: Compressing objects: 100% (88/88), done.
remote: Total 9034 (delta 45), reused 35 (delta 15), pack-reused 8923
Receiving objects: 100% (9034/9034), 511.73 MiB | 10.62 MiB/s, done.
Resolving deltas: 100% (7231/7231), done.
Submodule 'docs' (git@github.com:fastai/fastai-docs.git) registered for path 'docs'
Cloning into '/home/brian/Projects/fastai2/docs'...
git@github.com: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: clone of 'git@github.com:fastai/fastai-docs.git' into submodule path '/home/brian/Projects/fastai2/docs' failed
Failed to clone 'docs'. Retry scheduled
Cloning into '/home/brian/Projects/fastai2/docs'...
git@github.com: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: clone of 'git@github.com:fastai/fastai-docs.git' into submodule path '/home/brian/Projects/fastai2/docs' failed
Failed to clone 'docs' a second time, aborting

After googling for what could be going on, I found a bug report mentioning that submodule defaults to using SSH instead of HTTPS to clone submodules, I thought that could be the problem.

I was finally successful with these 4 steps for fresh checkout, thought I should share with anyone facing similar issues:

git clone GitHub - fastai/fastai2: Temporary home for fastai v2 while it's being developed
… manually update fastai2/.gitmodules file to change the URL from git to https i.e. GitHub - fastai/fastai-docs: Documentation for fastai
git submodule init
git submodule update

Success!

jeremy · August 6, 2020, 4:44pm

@bguan that should be fixed now. Sorry about that!

bguan · August 6, 2020, 6:27pm

@jeremy no problem at all! Wishing you and team god speed in getting FastAI 2 out!

Richard-Wang · August 8, 2020, 3:49am

Hi all, since

fastai require pytorch 1.6 now
to_fp16 has something didn’t implement (e.g. don’t cast layernorm to fp16, as I showed above) and requires user to cast inputs of loss function back to fp32.
original fastai mixed precision detect modules (e.g.batch norm) to decide which to not cast, but pytorch mixed precision detect ops. It means that it is probable that only pytorch can cover custom norm correctly.

Shall we use native fp16 as default now ?

muellerzr · August 8, 2020, 11:47am

There is a to_native_fp16 iirc now. So you can choose. (It’s experimental)

Richard-Wang · August 8, 2020, 12:08pm

@muellerzr yeah but can’t we make it default?

make mixed_precision _one_batch become one_batch and use enable= to flag whether use mixed precision training. And deprecate the original to_fp16 to avoid users running into issues mentioned above.

muellerzr · August 8, 2020, 12:10pm

You can always put in a PR and see (as this is a Jeremy question and he can answer on the PR)

Richard-Wang · August 8, 2020, 3:48pm

[HELP]

I tried to clone my fork of fastai to make a pr.
git clone --recurse-submodules https://github.com/richarddwang/fastai2.git
But it seems to meet some problem when it comes to the submodule.

Cloning into 'fastai2'...
remote: Enumerating objects: 56, done.
remote: Counting objects: 100% (56/56), done.
remote: Compressing objects: 100% (42/42), done.
remote: Total 9415 (delta 16), reused 43 (delta 13), pack-reused 9359
Receiving objects: 100% (9415/9415), 504.88 MiB | 18.58 MiB/s, done.
Resolving deltas: 100% (7441/7441), done.
Checking connectivity... done.
Checking out files: 100% (216/216), done.
Submodule 'docs' (git@github.com:fastai/fastai-docs.git) registered for path 'docs'
Cloning into 'docs'...
The authenticity of host 'github.com (52.69.186.44)' can't be established.
RSA key fingerprint is SHA256:nThbg6kXUpJWGl7E1IGOCspRomTxdCARLviKw6E5SY8.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'github.com,52.69.186.44' (RSA) to the list of known hosts.
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: clone of 'git@github.com:fastai/fastai-docs.git' into submodule path 'docs' failed

Any suggestion ?

jeremy · August 8, 2020, 4:07pm

@Richard-Wang sorry was just in the middle of making a change to that. It’s fixed now.

boris · August 8, 2020, 9:26pm

I’m trying to find a robust way to access the “raw” dataset from any Learner (in order to track it).

So far it seems that something like learn.dls.train.dataset.items is pretty good. However I can still be missing important files (label files, classes name file, etc).

Should I just track an entire folder and consider it as the “raw” dataset? Another issue is that it will not all the time be easily accessible through learn.dls.path, for example when building it manually from Datasets or TfmdLists, in which case it just remains Path('.').

muellerzr · August 8, 2020, 9:40pm

You basically lose that ability once it’s in the dataloader because the dataset is fully loaded (for instance I can’t grab a filename from a PILImage if it was made from a numpy array stored on file, that just doesn’t exist). I don’t really have an answer besides what you describe, just give them the folder or file everything is originating from. ($0.02 thinking about this, as I’ve wondered it too )

boris · August 8, 2020, 9:43pm

Yes I think you’re right.
Initially I just wanted to use torch.save on the DataLoaders but I think large sets of images are lazily loaded so to have a reproducible experiment we probably need both learn.dls and all raw files (plus it’s always good to track the “raw” dataset anyway).

Note: I’m thinking of adding this feature in the W&B callback so it could be something like WandbCallback(track_dataset="../data/")

Richard-Wang · August 9, 2020, 12:00am

@jeremy Thanks for the fix.

I try git clone --recurse-submodules https://github.com/richarddwang/fastai2.git again and I got

...
Submodule 'docs' (https://github.com/richarddwang/fastai-docs.git) registered for path 'docs'
Cloning into 'docs'...
Username for 'https://github.com': richarddwang   
Password for 'https://richarddwang@github.com': 
remote: Repository not found.
fatal: repository 'https://github.com/richarddwang/fastai-docs.git/' not found
fatal: clone of 'https://github.com/richarddwang/fastai-docs.git' into submodule path 'docs' failed

So I fork also fastai-docs and it works this time.
Just to confirm, is the fork of fastai-docs within your intention ?

jeremy · August 9, 2020, 4:19pm

If you’re forking, yes I guess you need to do both - sorry I hadn’t thought of that issue. Will add to the readme.

Richard-Wang · August 12, 2020, 3:39pm

Hi all
Does the below show that grad clip doesn’t take effect when using to_fp16(clip=1.0) ?

I tried to find the answer using ipdb, below is what I got.

The below is saying: Only the grad of parameters of master_pgs has been clipped, but it is not the case for self.model's parameters or parameters which optimizer use to update.

tensor(-0.4746, device='cuda:0', dtype=torch.float16, grad_fn=) # p list(self.model.parameters())[50].grad.mean()
tensor(5.0221e-06, device='cuda:0') # p self.master_pgs[-3][0].grad.mean()

>>> unt 116 # from 114
> /home/yisiang/fastai2/fastai2/callback/fp16.py(116)after_backward()
13  114         if self.clip is not None:
    115             for group in self.master_pgs: nn.utils.clip_grad_norm_(group, self.clip)
--> 116         if self.dynamic:
    117             self.count += 1
    118             if self.count == self.scale_wait:

tensor(-0.4746, device='cuda:0', dtype=torch.float16, grad_fn=) # p list(self.model.parameters())[50].grad.mean()
tensor(3.1092e-06, device='cuda:0') # p self.master_pgs[-3]

tensor(6.3187e-07, device='cuda:0') #p self.opt.all_params(with_grad=True)[33][0].grad.mean()

>>> unt 116 # from 114
> /home/yisiang/fastai2/fastai2/callback/fp16.py(116)after_backward()
13  114         if self.clip is not None:
    115             for group in self.master_pgs: nn.utils.clip_grad_norm_(group, self.clip)
--> 116         if self.dynamic:
    117             self.count += 1
    118             if self.count == self.scale_wait:

tensor(6.3187e-07, device='cuda:0')  #p self.opt.all_params(with_grad=True)[33][0].grad.mean()

Did I misunderstand something ? Or I should file an issue ?

jeremy · August 12, 2020, 5:47pm

@Richard-Wang please feel an issue and I’ll take a look to see if there’s a problem.

Aninda · August 13, 2020, 8:50am

Hi,

I am facing this error when trying to use fastai2 in Kaggle. I have upgraded to torch 1.6.0 and installed fastai2 using pip. Would request help from anyone.

AttributeError Traceback (most recent call last)
in
7 from matplotlib import pyplot as plt
8 get_ipython().run_line_magic(‘matplotlib’, ‘inline’)
----> 9 from fastai2.torch_basics import *
10 from fastai2.basics import *
11 from fastai2.data.all import *

/opt/conda/lib/python3.7/site-packages/fastai2/torch_basics.py in
2 from .imports import *
3 from .torch_imports import *
----> 4 from .torch_core import *
5 from .layers import *

/opt/conda/lib/python3.7/site-packages/fastai2/torch_core.py in
244 # Cell
245 if not hasattr(torch,‘as_subclass’):
–> 246 setattr(torch, ‘as_subclass’, torch.Tensor.as_subclass)
247
248 # Cell

AttributeError: type object ‘Tensor’ has no attribute ‘as_subclass’