Meet Ranger - RAdam + Lookahead optimizer

muellerzr · November 20, 2019, 3:11am

@Jeremy we got it.

Here is our final code for xresnet:

class XResNet(nn.Sequential):
    def __init__(self, expansion, layers, c_in=3, c_out=1000, sa=False, sym=False, act_cls=defaults.activation):
        stem = []
        sizes = [c_in,16,32,64] if c_in<3 else [c_in,32,64,64]
        for i in range(3):
            stem.append(ConvLayer(sizes[i], sizes[i+1], stride=2 if i==0 else 1, act_cls=act_cls))

        block_szs = [64//expansion,64,128,256,512] +[256]*(len(layers)-4)
        blocks = [self._make_layer(expansion, block_szs[i], block_szs[i+1], l, 1 if i==0 else 2,
                                  sa = sa if i == (len(layers)-4) else False, sym=sym, act_cls=act_cls)
                  for i,l in enumerate(layers)]
        super().__init__(
            *stem,
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
            *blocks,
            nn.AdaptiveAvgPool2d(1), Flatten(),
            nn.Linear(block_szs[-1]*expansion, c_out),
        )
        init_cnn(self)

    def _make_layer(self, expansion, ni, nf, blocks, stride, sa, sym, act_cls):
        return nn.Sequential(
            *[ResBlock(expansion, ni if i==0 else nf, nf, stride if i==0 else 1,
                      sa if i == (blocks-1) else False, sym=sym, act_cls=act_cls)
              for i in range(blocks)])

The main changes are the filter sizes (32,64,64), changing the self attention layer conditions, and including the activation function in the ConvLayer.

We could get relatively stable results. Let me know if you’d rather I put in a PR to the repo

jeremy · November 20, 2019, 5:14pm

Looks like v1 had 32,32,64 stem. Changing it would make pretrained weights fail. Have you tested that one change and seen any difference?

muellerzr · November 20, 2019, 5:48pm

I’m rerunning two sets of 5 runs right now. I’ll update this in about 30 minutes to an hour with those results.

jeremy · November 20, 2019, 6:18pm

Thinking about it, at layer 2 32 filters seems too low actually. Receptive field is effectively 5x5 from the input, and with 3 channels that’s 5x5x3=75. 64 still seems like a lot. I wonder if 48 is actually the right number - i.e. 32,48,64 . Maybe try that too?

jeremy · November 20, 2019, 6:20pm

I guess there’s also something to be said for just sticking with what the paper used, mind you! Which I think was 32,64,64, as you said.

muellerzr · November 20, 2019, 6:21pm

Right as I finished! I was still able to get 72%(ish) with the filters. Still not the 75-76% we can get directly porting over the code. I wonder if perhaps you could look at something for me. They look close to the exact same but are the both essentially equivalent?

class ConvLayer(nn.Sequential):
    "Create a sequence of convolutional (`ni` to `nf`), ReLU (if `use_activ`) and `norm_type` layers."
    def __init__(self, ni, nf, ks=3, stride=1, padding=None, bias=None, ndim=2, norm_type=NormType.Batch, bn_1st=True,
                 act_cls=defaults.activation, transpose=False, init=nn.init.kaiming_normal_, xtra=None, **kwargs):
        if padding is None: padding = ((ks-1)//2 if not transpose else 0) # Ours: padding = ks//2
        bn = norm_type in (NormType.Batch, NormType.BatchZero)
        if bias is None: bias = not bn
        conv_func = _conv_func(ndim, transpose=transpose)
        conv = init_default(conv_func(ni, nf, kernel_size=ks, bias=bias, stride=stride, padding=padding, **kwargs), init)
        if   norm_type==NormType.Weight:   conv = weight_norm(conv)
        elif norm_type==NormType.Spectral: conv = spectral_norm(conv)
        layers = [conv]
        act_bn = []
        if act_cls is not None: act_bn.append(act_cls())
        if bn: act_bn.append(BatchNorm(nf, norm_type=norm_type, ndim=ndim))
        if bn_1st: act_bn.reverse()
        layers += act_bn
        if xtra: layers.append(xtra)
        super().__init__(*layers)

Vs

def conv1d(ni:int, no:int, ks:int=1, stride:int=1, padding:int=0, bias:bool=False):
    "Create and initialize a `nn.Conv1d` layer with spectral normalization."
    conv = nn.Conv1d(ni, no, ks, stride=stride, padding=padding, bias=bias)
    nn.init.kaiming_normal_(conv.weight)
    if bias: conv.bias.data.zero_()
    return spectral_norm(conv)

def conv(ni, nf, ks=3, stride=1, bias=False):
    return nn.Conv2d(ni, nf, kernel_size=ks, stride=stride, padding=ks//2, bias=bias)

def noop(x): return x

def conv_layer(ni, nf, ks=3, stride=1, zero_bn=False, act=True):
    bn = nn.BatchNorm2d(nf)
    nn.init.constant_(bn.weight, 0. if zero_bn else 1.)
    layers = [conv(ni, nf, ks, stride=stride), bn]
    if act: layers.append(act_fn)
    return nn.Sequential(*layers)

jeremy · November 20, 2019, 6:22pm

In other news, I’m considering redoing imagenette and imagewoof to make the train/val split 50/50 (but using same total size of train+val). The idea being that would largely avoid the need for averaging multiple runs (bigger val set) and can experiment with data augmentation usefully with less epochs (smaller train set). It would mean creating a new leaderboard, but I think it’s worth that one-time cost, personally… Any other reasons this might be a bad idea?

muellerzr · November 20, 2019, 6:25pm

Seems like a great idea to me! I’d still (personally) expect some repeated runs (maybe 3 - 4) just to see how that variance is, but I expect it to be much lower than what we’ve been seeing so far.

jeremy · November 20, 2019, 6:30pm

Probably best to create an object of each type, and print them out. Let me know if you see any differences…

muellerzr · November 20, 2019, 6:59pm

They are in fact the exact same So now we’ve figured out that the architectures are indeed the same.So those changes to the architecture are it. We’re going back to the optimizer now to be sure otherwise keep scratching our heads

lgvaz · November 20, 2019, 7:11pm

What we found is that if we just plug our architecture onto Less’ code we are able to get ~75/76% (that uses fastai v1 and his implementation of ranger). I got results that are even a little bit better than the current leaderboard. The means the problem is not in the arch anymore.

My current suspects are the optimizer itself (one subtle difference is that Less is using a RAdam threshold of 5, I’ve made a PR for that) and the RRC tfm which is always dropping my accuracy compared to a simple resize.

sgugger · November 20, 2019, 7:25pm

As I mentioned on the PR, this doesn’t change anything apart for one iteration in training, so I don’t think this has any link with your results. Feel free to use a separate implementation to compare but theoretically, it doesn’t really make sense to me that the difference would come from that.

muellerzr · November 20, 2019, 7:32pm

So as a summary of what has been tested/eliminated:

Fixed the XResNet architecture
Adjusted the transforms (so they are equivalent)
Verified the optimizers were good.
Training loop is also good

What else could it be coming from possibly? (these were all that we had adjusted for/figured out)

sgugger · November 20, 2019, 7:35pm

I’m happy to investigate where the difference could come from, just give me two minimal implementations that show it (one to the 75/76% in v1 and one in v2).

lgvaz · November 20, 2019, 7:56pm

For that I just put our architecture into Less’ code. For that substitute mxresnet.py with this:

#FastAI's XResnet modified to use Mish activation function, MXResNet 
#https://github.com/fastai/fastai/blob/master/fastai/vision/models/xresnet.py
#modified by lessw2020 - github:  https://github.com/lessw2020/mish


from fastai.torch_core import *
import torch.nn as nn
import torch,math,sys
import torch.utils.model_zoo as model_zoo
from functools import partial
#from ...torch_core import Module
from fastai.torch_core import Module

import torch.nn.functional as F  #(uncomment if needed,but you likely already have it)


class Mish(nn.Module):
    def __init__(self):
        super().__init__()
        print("Mish activation loaded...")

    def forward(self, x):  
        #save 1 second per epoch with no x= x*() and then return x...just inline it.
        return x *( torch.tanh(F.softplus(x))) 
        


    

#Unmodified from https://github.com/fastai/fastai/blob/5c51f9eabf76853a89a9bc5741804d2ed4407e49/fastai/layers.py
def conv1d(ni:int, no:int, ks:int=1, stride:int=1, padding:int=0, bias:bool=False):
    "Create and initialize a `nn.Conv1d` layer with spectral normalization."
    conv = nn.Conv1d(ni, no, ks, stride=stride, padding=padding, bias=bias)
    nn.init.kaiming_normal_(conv.weight)
    if bias: conv.bias.data.zero_()
    return spectral_norm(conv)



# Adapted from SelfAttention layer at https://github.com/fastai/fastai/blob/5c51f9eabf76853a89a9bc5741804d2ed4407e49/fastai/layers.py
# Inspired by https://arxiv.org/pdf/1805.08318.pdf
class SimpleSelfAttention(nn.Module):
    
    def __init__(self, n_in:int, ks=1, sym=False):#, n_out:int):
        super().__init__()
           
        self.conv = conv1d(n_in, n_in, ks, padding=ks//2, bias=False)      
       
        self.gamma = nn.Parameter(tensor([0.]))
        
        self.sym = sym
        self.n_in = n_in
        
    def forward(self,x):
        
        
        if self.sym:
            # symmetry hack by https://github.com/mgrankin
            c = self.conv.weight.view(self.n_in,self.n_in)
            c = (c + c.t())/2
            self.conv.weight = c.view(self.n_in,self.n_in,1)
                
        size = x.size()  
        x = x.view(*size[:2],-1)   # (C,N)
        
        # changed the order of mutiplication to avoid O(N^2) complexity
        # (x*xT)*(W*x) instead of (x*(xT*(W*x)))
        
        convx = self.conv(x)   # (C,C) * (C,N) = (C,N)   => O(NC^2)
        xxT = torch.bmm(x,x.permute(0,2,1).contiguous())   # (C,N) * (N,C) = (C,C)   => O(NC^2)
        
        o = torch.bmm(xxT, convx)   # (C,C) * (C,N) = (C,N)   => O(NC^2)
          
        o = self.gamma * o + x
        
          
        return o.view(*size).contiguous()        
        


    
    
__all__ = ['MXResNet', 'mxresnet18', 'mxresnet34', 'mxresnet50', 'mxresnet101', 'mxresnet152']

# or: ELU+init (a=0.54; gain=1.55)
act_fn = Mish() #nn.ReLU(inplace=True)

class Flatten(Module):
    def forward(self, x): return x.view(x.size(0), -1)

def init_cnn(m):
    if getattr(m, 'bias', None) is not None: nn.init.constant_(m.bias, 0)
    if isinstance(m, (nn.Conv2d,nn.Linear)): nn.init.kaiming_normal_(m.weight)
    for l in m.children(): init_cnn(l)

def conv(ni, nf, ks=3, stride=1, bias=False):
    return nn.Conv2d(ni, nf, kernel_size=ks, stride=stride, padding=ks//2, bias=bias)

def noop(x): return x

def conv_layer(ni, nf, ks=3, stride=1, zero_bn=False, act=True):
    bn = nn.BatchNorm2d(nf)
    nn.init.constant_(bn.weight, 0. if zero_bn else 1.)
    layers = [conv(ni, nf, ks, stride=stride), bn]
    if act: layers.append(act_fn)
    return nn.Sequential(*layers)

class ResBlock(Module):
    def __init__(self, expansion, ni, nh, stride=1,sa=False, sym=False):
        nf,ni = nh*expansion,ni*expansion
        layers  = [conv_layer(ni, nh, 3, stride=stride),
                   conv_layer(nh, nf, 3, zero_bn=True, act=False)
        ] if expansion == 1 else [
                   conv_layer(ni, nh, 1),
                   conv_layer(nh, nh, 3, stride=stride),
                   conv_layer(nh, nf, 1, zero_bn=True, act=False)
        ]
        self.sa = SimpleSelfAttention(nf,ks=1,sym=sym) if sa else noop
        self.convs = nn.Sequential(*layers)
        # TODO: check whether act=True works better
        self.idconv = noop if ni==nf else conv_layer(ni, nf, 1, act=False)
        self.pool = noop if stride==1 else nn.AvgPool2d(2, ceil_mode=True)

    def forward(self, x): return act_fn(self.sa(self.convs(x)) + self.idconv(self.pool(x)))

def filt_sz(recep): return min(64, 2**math.floor(math.log2(recep*0.75)))

class MXResNet(nn.Sequential):
    def __init__(self, expansion, layers, c_in=3, c_out=1000, sa = False, sym= False):
        stem = []
        sizes = [c_in,32,64,64]  #modified per Grankin
        for i in range(3):
            stem.append(conv_layer(sizes[i], sizes[i+1], stride=2 if i==0 else 1))
            #nf = filt_sz(c_in*9)
            #stem.append(conv_layer(c_in, nf, stride=2 if i==1 else 1))
            #c_in = nf

        block_szs = [64//expansion,64,128,256,512]
        blocks = [self._make_layer(expansion, block_szs[i], block_szs[i+1], l, 1 if i==0 else 2, sa = sa if i in[len(layers)-4] else False, sym=sym)
                  for i,l in enumerate(layers)]
        super().__init__(
            *stem,
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
            *blocks,
            nn.AdaptiveAvgPool2d(1), Flatten(),
            nn.Linear(block_szs[-1]*expansion, c_out),
        )
        init_cnn(self)

    def _make_layer(self, expansion, ni, nf, blocks, stride, sa=False, sym=False):
        return nn.Sequential(
            *[ResBlock(expansion, ni if i==0 else nf, nf, stride if i==0 else 1, sa if i in [blocks -1] else False,sym)
              for i in range(blocks)])

def mxresnet(expansion, n_layers, name, pretrained=False, **kwargs):
    model = MXResNet(expansion, n_layers, **kwargs)
    if pretrained: 
        #model.load_state_dict(model_zoo.load_url(model_urls[name]))
        print("No pretrained yet for MXResNet")
    return model

import fastai2
from fastai2.basics import *
from fastai2.callback.all import *
from fastai2.vision.all import *
class XResNet2(nn.Sequential):
    def __init__(self, expansion, layers, c_in=3, c_out=1000, sa=False, sym=False, act_cls=defaults.activation):
        stem = []
        sizes = [c_in,16,32,64] if c_in<3 else [c_in,32,64,64]
        for i in range(3):
            stem.append(ConvLayer(sizes[i], sizes[i+1], stride=2 if i==0 else 1, act_cls=act_cls))

        block_szs = [64//expansion,64,128,256,512] +[256]*(len(layers)-4)
        blocks = [self._make_layer(expansion, block_szs[i], block_szs[i+1], l, 1 if i==0 else 2,
                                  sa = sa if i==len(layers)-4 else False, sym=sym, act_cls=act_cls)
                  for i,l in enumerate(layers)]
        super().__init__(
            *stem,
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
            *blocks,
            nn.AdaptiveAvgPool2d(1), Flatten(),
            nn.Linear(block_szs[-1]*expansion, c_out),
        )
        init_cnn(self)

    def _make_layer(self, expansion, ni, nf, blocks, stride, sa, sym, act_cls):
        return nn.Sequential(
            *[ResBlock(expansion, ni if i==0 else nf, nf, stride if i==0 else 1,
                      sa if i==blocks-1 else False, sym=sym, act_cls=act_cls)
              for i in range(blocks)])
fastai2.vision.models.xresnet.XResNet = XResNet2

me = sys.modules[__name__]
for n,e,l in [
    [ 18 , 1, [2,2,2 ,2] ],
    [ 34 , 1, [3,4,6 ,3] ],
    [ 50 , 4, [3,4,6 ,3] ],
    [ 101, 4, [3,4,23,3] ],
    [ 152, 4, [3,8,36,3] ],
]:
    name = f'mxresnet{n}'
    setattr(me, name, partial(mxresnet, expansion=e, n_layers=l, name=name))
    
setattr(me, 'mxresnet50', partial(xresnet50, act_cls=MishJit))

and then run train.py with the parameters:

--woof 1 --size 128 --bs 64 --mixup 0 --epoch 5 --lr 4e-3 --gpu 0 --opt ranger --mom .95 --sched_type flat_and_anneal --ann_start 0.72 --sa 1

I also removed the call learn.to_fp16() on train.py because that seemed to improve results by a bit (might be just variance though).

Here is my code for v2.

muellerzr · November 20, 2019, 8:16pm

@sgugger I want to try one more thing before I put in the towel. So in translating the transforms from v1 to v2. In v2 we used:

(ImageList.from_folder(path).split_by_folder(valid='val')
            .label_from_folder().transform(([flip_lr(p=0.5)], []), size=size)
            .databunch(bs=bs, num_workers=workers)
            .presize(size, scale=(0.35,1))
            .normalize(imagenet_stats))

In making our v2, what would be the equivalent for presize?

Eg:

batch_tfms = [Cuda(), IntToFloatTensor(), Normalize(*imagenet_stats)]

dbch = dsrc.databunch(after_item=[ToTensor(),Resize(128), Flip()], 
                      after_batch=batch_tfms, 
                      bs=64, num_workers=nw)

sgugger · November 20, 2019, 8:18pm

No, you need RandomResizedCrop(size, min_scale=0.35) instead of your Resize.

Edit: also Flip is a batch transform. FlipItem is the item version.

muellerzr · November 20, 2019, 8:22pm

I would want to include it in the before_batch? Eg:

dbch = dsrc.databunch(before_batch=[RandomResizedCrop(128, min_scale=0.35)],after_item=[ToTensor(),Resize(128), FlipItem()], 
                      after_batch=batch_tfms, 
                      bs=64, num_workers=nw)

Or would I just include it before the Resize call (asking as before_batch throws a device error)

sgugger · November 20, 2019, 8:22pm

No, replace your Resize by RandomResizedCrop.

lgvaz · November 20, 2019, 8:23pm

For some reason using RandomResizedCrop(size, min_scale=0.35) drops the accuracy from 72% to ~69%. Im running more experiments to confirm that.

EDIT: Confirmed, and it’s actually closer to 66%