Meet Ranger - RAdam + Lookahead optimizer

@Jeremy we got it.

Here is our final code for xresnet:

class XResNet(nn.Sequential):
    def __init__(self, expansion, layers, c_in=3, c_out=1000, sa=False, sym=False, act_cls=defaults.activation):
        stem = []
        sizes = [c_in,16,32,64] if c_in<3 else [c_in,32,64,64]
        for i in range(3):
            stem.append(ConvLayer(sizes[i], sizes[i+1], stride=2 if i==0 else 1, act_cls=act_cls))

        block_szs = [64//expansion,64,128,256,512] +[256]*(len(layers)-4)
        blocks = [self._make_layer(expansion, block_szs[i], block_szs[i+1], l, 1 if i==0 else 2,
                                  sa = sa if i == (len(layers)-4) else False, sym=sym, act_cls=act_cls)
                  for i,l in enumerate(layers)]
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
            nn.AdaptiveAvgPool2d(1), Flatten(),
            nn.Linear(block_szs[-1]*expansion, c_out),

    def _make_layer(self, expansion, ni, nf, blocks, stride, sa, sym, act_cls):
        return nn.Sequential(
            *[ResBlock(expansion, ni if i==0 else nf, nf, stride if i==0 else 1,
                      sa if i == (blocks-1) else False, sym=sym, act_cls=act_cls)
              for i in range(blocks)])

The main changes are the filter sizes (32,64,64), changing the self attention layer conditions, and including the activation function in the ConvLayer.

We could get relatively stable results. Let me know if youā€™d rather I put in a PR to the repo :slight_smile:


Looks like v1 had 32,32,64 stem. Changing it would make pretrained weights fail. Have you tested that one change and seen any difference?

Iā€™m rerunning two sets of 5 runs right now. Iā€™ll update this in about 30 minutes to an hour with those results.

Thinking about it, at layer 2 32 filters seems too low actually. Receptive field is effectively 5x5 from the input, and with 3 channels thatā€™s 5x5x3=75. 64 still seems like a lot. I wonder if 48 is actually the right number - i.e. 32,48,64 . Maybe try that too?

I guess thereā€™s also something to be said for just sticking with what the paper used, mind you! Which I think was 32,64,64, as you said.

Right as I finished! I was still able to get 72%(ish) with the filters. Still not the 75-76% we can get directly porting over the code. I wonder if perhaps you could look at something for me. They look close to the exact same but are the both essentially equivalent?

class ConvLayer(nn.Sequential):
    "Create a sequence of convolutional (`ni` to `nf`), ReLU (if `use_activ`) and `norm_type` layers."
    def __init__(self, ni, nf, ks=3, stride=1, padding=None, bias=None, ndim=2, norm_type=NormType.Batch, bn_1st=True,
                 act_cls=defaults.activation, transpose=False, init=nn.init.kaiming_normal_, xtra=None, **kwargs):
        if padding is None: padding = ((ks-1)//2 if not transpose else 0) # Ours: padding = ks//2
        bn = norm_type in (NormType.Batch, NormType.BatchZero)
        if bias is None: bias = not bn
        conv_func = _conv_func(ndim, transpose=transpose)
        conv = init_default(conv_func(ni, nf, kernel_size=ks, bias=bias, stride=stride, padding=padding, **kwargs), init)
        if   norm_type==NormType.Weight:   conv = weight_norm(conv)
        elif norm_type==NormType.Spectral: conv = spectral_norm(conv)
        layers = [conv]
        act_bn = []
        if act_cls is not None: act_bn.append(act_cls())
        if bn: act_bn.append(BatchNorm(nf, norm_type=norm_type, ndim=ndim))
        if bn_1st: act_bn.reverse()
        layers += act_bn
        if xtra: layers.append(xtra)


def conv1d(ni:int, no:int, ks:int=1, stride:int=1, padding:int=0, bias:bool=False):
    "Create and initialize a `nn.Conv1d` layer with spectral normalization."
    conv = nn.Conv1d(ni, no, ks, stride=stride, padding=padding, bias=bias)
    if bias:
    return spectral_norm(conv)

def conv(ni, nf, ks=3, stride=1, bias=False):
    return nn.Conv2d(ni, nf, kernel_size=ks, stride=stride, padding=ks//2, bias=bias)

def noop(x): return x

def conv_layer(ni, nf, ks=3, stride=1, zero_bn=False, act=True):
    bn = nn.BatchNorm2d(nf)
    nn.init.constant_(bn.weight, 0. if zero_bn else 1.)
    layers = [conv(ni, nf, ks, stride=stride), bn]
    if act: layers.append(act_fn)
    return nn.Sequential(*layers)
In other news, Iā€™m considering redoing imagenette and imagewoof to make the train/val split 50/50 (but using same total size of train+val). The idea being that would largely avoid the need for averaging multiple runs (bigger val set) and can experiment with data augmentation usefully with less epochs (smaller train set). It would mean creating a new leaderboard, but I think itā€™s worth that one-time cost, personallyā€¦ Any other reasons this might be a bad idea?


Seems like a great idea to me! Iā€™d still (personally) expect some repeated runs (maybe 3 - 4) just to see how that variance is, but I expect it to be much lower than what weā€™ve been seeing so far.

Probably best to create an object of each type, and print them out. Let me know if you see any differencesā€¦

They are in fact the exact same :slight_smile: So now weā€™ve figured out that the architectures are indeed the same.So those changes to the architecture are it. Weā€™re going back to the optimizer now to be sure otherwise keep scratching our heads

What we found is that if we just plug our architecture onto Lessā€™ code we are able to get ~75/76% (that uses fastai v1 and his implementation of ranger). I got results that are even a little bit better than the current leaderboard. The means the problem is not in the arch anymore.

My current suspects are the optimizer itself (one subtle difference is that Less is using a RAdam threshold of 5, Iā€™ve made a PR for that) and the RRC tfm which is always dropping my accuracy compared to a simple resize.

As I mentioned on the PR, this doesnā€™t change anything apart for one iteration in training, so I donā€™t think this has any link with your results. Feel free to use a separate implementation to compare but theoretically, it doesnā€™t really make sense to me that the difference would come from that.


So as a summary of what has been tested/eliminated:

  • Fixed the XResNet architecture
  • Adjusted the transforms (so they are equivalent)
  • Verified the optimizers were good.
  • Training loop is also good

What else could it be coming from possibly? (these were all that we had adjusted for/figured out)

Iā€™m happy to investigate where the difference could come from, just give me two minimal implementations that show it (one to the 75/76% in v1 and one in v2).


For that I just put our architecture into Lessā€™ code. For that substitute with this:

#FastAI's XResnet modified to use Mish activation function, MXResNet 
#modified by lessw2020 - github:

from fastai.torch_core import *
import torch.nn as nn
import torch,math,sys
import torch.utils.model_zoo as model_zoo
from functools import partial
#from ...torch_core import Module
from fastai.torch_core import Module

import torch.nn.functional as F  #(uncomment if needed,but you likely already have it)

class Mish(nn.Module):
    def __init__(self):
        print("Mish activation loaded...")

    def forward(self, x):  
        #save 1 second per epoch with no x= x*() and then return x...just inline it.
        return x *( torch.tanh(F.softplus(x))) 


#Unmodified from
def conv1d(ni:int, no:int, ks:int=1, stride:int=1, padding:int=0, bias:bool=False):
    "Create and initialize a `nn.Conv1d` layer with spectral normalization."
    conv = nn.Conv1d(ni, no, ks, stride=stride, padding=padding, bias=bias)
    if bias:
    return spectral_norm(conv)

# Adapted from SelfAttention layer at
# Inspired by
class SimpleSelfAttention(nn.Module):
    def __init__(self, n_in:int, ks=1, sym=False):#, n_out:int):
        self.conv = conv1d(n_in, n_in, ks, padding=ks//2, bias=False)      
        self.gamma = nn.Parameter(tensor([0.]))
        self.sym = sym
        self.n_in = n_in
    def forward(self,x):
        if self.sym:
            # symmetry hack by
            c = self.conv.weight.view(self.n_in,self.n_in)
            c = (c + c.t())/2
            self.conv.weight = c.view(self.n_in,self.n_in,1)
        size = x.size()  
        x = x.view(*size[:2],-1)   # (C,N)
        # changed the order of mutiplication to avoid O(N^2) complexity
        # (x*xT)*(W*x) instead of (x*(xT*(W*x)))
        convx = self.conv(x)   # (C,C) * (C,N) = (C,N)   => O(NC^2)
        xxT = torch.bmm(x,x.permute(0,2,1).contiguous())   # (C,N) * (N,C) = (C,C)   => O(NC^2)
        o = torch.bmm(xxT, convx)   # (C,C) * (C,N) = (C,N)   => O(NC^2)
        o = self.gamma * o + x
        return o.view(*size).contiguous()        

__all__ = ['MXResNet', 'mxresnet18', 'mxresnet34', 'mxresnet50', 'mxresnet101', 'mxresnet152']

# or: ELU+init (a=0.54; gain=1.55)
act_fn = Mish() #nn.ReLU(inplace=True)

class Flatten(Module):
    def forward(self, x): return x.view(x.size(0), -1)

def init_cnn(m):
    if getattr(m, 'bias', None) is not None: nn.init.constant_(m.bias, 0)
    if isinstance(m, (nn.Conv2d,nn.Linear)): nn.init.kaiming_normal_(m.weight)
    for l in m.children(): init_cnn(l)

def conv(ni, nf, ks=3, stride=1, bias=False):
    return nn.Conv2d(ni, nf, kernel_size=ks, stride=stride, padding=ks//2, bias=bias)

def noop(x): return x

def conv_layer(ni, nf, ks=3, stride=1, zero_bn=False, act=True):
    bn = nn.BatchNorm2d(nf)
    nn.init.constant_(bn.weight, 0. if zero_bn else 1.)
    layers = [conv(ni, nf, ks, stride=stride), bn]
    if act: layers.append(act_fn)
    return nn.Sequential(*layers)

class ResBlock(Module):
    def __init__(self, expansion, ni, nh, stride=1,sa=False, sym=False):
        nf,ni = nh*expansion,ni*expansion
        layers  = [conv_layer(ni, nh, 3, stride=stride),
                   conv_layer(nh, nf, 3, zero_bn=True, act=False)
        ] if expansion == 1 else [
                   conv_layer(ni, nh, 1),
                   conv_layer(nh, nh, 3, stride=stride),
                   conv_layer(nh, nf, 1, zero_bn=True, act=False)
        ] = SimpleSelfAttention(nf,ks=1,sym=sym) if sa else noop
        self.convs = nn.Sequential(*layers)
        # TODO: check whether act=True works better
        self.idconv = noop if ni==nf else conv_layer(ni, nf, 1, act=False)
        self.pool = noop if stride==1 else nn.AvgPool2d(2, ceil_mode=True)

    def forward(self, x): return act_fn( + self.idconv(self.pool(x)))

def filt_sz(recep): return min(64, 2**math.floor(math.log2(recep*0.75)))

class MXResNet(nn.Sequential):
    def __init__(self, expansion, layers, c_in=3, c_out=1000, sa = False, sym= False):
        stem = []
        sizes = [c_in,32,64,64]  #modified per Grankin
        for i in range(3):
            stem.append(conv_layer(sizes[i], sizes[i+1], stride=2 if i==0 else 1))
            #nf = filt_sz(c_in*9)
            #stem.append(conv_layer(c_in, nf, stride=2 if i==1 else 1))
            #c_in = nf

        block_szs = [64//expansion,64,128,256,512]
        blocks = [self._make_layer(expansion, block_szs[i], block_szs[i+1], l, 1 if i==0 else 2, sa = sa if i in[len(layers)-4] else False, sym=sym)
                  for i,l in enumerate(layers)]
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
            nn.AdaptiveAvgPool2d(1), Flatten(),
            nn.Linear(block_szs[-1]*expansion, c_out),

    def _make_layer(self, expansion, ni, nf, blocks, stride, sa=False, sym=False):
        return nn.Sequential(
            *[ResBlock(expansion, ni if i==0 else nf, nf, stride if i==0 else 1, sa if i in [blocks -1] else False,sym)
              for i in range(blocks)])

def mxresnet(expansion, n_layers, name, pretrained=False, **kwargs):
    model = MXResNet(expansion, n_layers, **kwargs)
    if pretrained: 
        print("No pretrained yet for MXResNet")
    return model

import fastai2
from fastai2.basics import *
from fastai2.callback.all import *
from import *
class XResNet2(nn.Sequential):
    def __init__(self, expansion, layers, c_in=3, c_out=1000, sa=False, sym=False, act_cls=defaults.activation):
        stem = []
        sizes = [c_in,16,32,64] if c_in<3 else [c_in,32,64,64]
        for i in range(3):
            stem.append(ConvLayer(sizes[i], sizes[i+1], stride=2 if i==0 else 1, act_cls=act_cls))

        block_szs = [64//expansion,64,128,256,512] +[256]*(len(layers)-4)
        blocks = [self._make_layer(expansion, block_szs[i], block_szs[i+1], l, 1 if i==0 else 2,
                                  sa = sa if i==len(layers)-4 else False, sym=sym, act_cls=act_cls)
                  for i,l in enumerate(layers)]
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
            nn.AdaptiveAvgPool2d(1), Flatten(),
            nn.Linear(block_szs[-1]*expansion, c_out),

    def _make_layer(self, expansion, ni, nf, blocks, stride, sa, sym, act_cls):
        return nn.Sequential(
            *[ResBlock(expansion, ni if i==0 else nf, nf, stride if i==0 else 1,
                      sa if i==blocks-1 else False, sym=sym, act_cls=act_cls)
              for i in range(blocks)]) = XResNet2

me = sys.modules[__name__]
for n,e,l in [
    [ 18 , 1, [2,2,2 ,2] ],
    [ 34 , 1, [3,4,6 ,3] ],
    [ 50 , 4, [3,4,6 ,3] ],
    [ 101, 4, [3,4,23,3] ],
    [ 152, 4, [3,8,36,3] ],
    name = f'mxresnet{n}'
    setattr(me, name, partial(mxresnet, expansion=e, n_layers=l, name=name))
setattr(me, 'mxresnet50', partial(xresnet50, act_cls=MishJit))

and then run with the parameters:

--woof 1 --size 128 --bs 64 --mixup 0 --epoch 5 --lr 4e-3 --gpu 0 --opt ranger --mom .95 --sched_type flat_and_anneal --ann_start 0.72 --sa 1

I also removed the call learn.to_fp16() on because that seemed to improve results by a bit (might be just variance though).

Here is my code for v2.

@sgugger I want to try one more thing before I put in the towel. So in translating the transforms from v1 to v2. In v2 we used:

            .label_from_folder().transform(([flip_lr(p=0.5)], []), size=size)
            .databunch(bs=bs, num_workers=workers)
            .presize(size, scale=(0.35,1))

In making our v2, what would be the equivalent for presize?


batch_tfms = [Cuda(), IntToFloatTensor(), Normalize(*imagenet_stats)]

dbch = dsrc.databunch(after_item=[ToTensor(),Resize(128), Flip()], 
                      bs=64, num_workers=nw)

No, you need RandomResizedCrop(size, min_scale=0.35) instead of your Resize.

Edit: also Flip is a batch transform. FlipItem is the item version.

I would want to include it in the before_batch? Eg:

dbch = dsrc.databunch(before_batch=[RandomResizedCrop(128, min_scale=0.35)],after_item=[ToTensor(),Resize(128), FlipItem()], 
                      bs=64, num_workers=nw)

Or would I just include it before the Resize call (asking as before_batch throws a device error)

No, replace your Resize by RandomResizedCrop.

For some reason using RandomResizedCrop(size, min_scale=0.35) drops the accuracy from 72% to ~69%. Im running more experiments to confirm that.

EDIT: Confirmed, and itā€™s actually closer to 66%