Meet Ranger - RAdam + Lookahead optimizer

Are you adjusting the learning rate for mixed precision?

Will do thanks, Iā€™ll be offline for a day or two so Iā€™ll check back here then and hopefully get QHAdam pushed.


EDIT: @muellerzr just checked re eps and it looks like @LessW2020 had left it outside the sqrt in both Ranger and RangerQH. RAdam in fastai v2 also has it outside and my port of QHAdam also leaves it outside.

def qhadam_step(p, lr, mom, sqr_mom, sqr_avg, nu_1, nu_2, step, grad_avg, eps, **kwargs):             
    debias1 = debias(mom,     1-mom,     step)
    debias2 = debias(sqr_mom, 1-sqr_mom, step)
    p.data.addcdiv_(-lr, ((1-nu_1) * p.grad.data) + (nu_1 * (grad_avg / debias1)), 
                    (((1 - nu_2) * (p.grad.data)**2) + (nu_2 * (sqr_avg / debias2))).sqrt() + eps)
    return p   

qhadam_step._defaults = dict(eps=1e-8)  


#export
def QHAdam(params, lr, mom=0.999, sqr_mom=0.999, nu_1=0.7, nu_2 = 1.0, eps=1e-8, wd=0., decouple_wd=True):
    "An `Optimizer` for Adam with `lr`, `mom`, `sqr_mom`, `nus`, eps` and `params`"
    from functools  import partial
    steppers = [weight_decay] if decouple_wd else [l2_reg]
    steppers.append(qhadam_step)
    stats = [partial(average_grad, dampening=True), partial(average_sqr_grad, dampening=True), step_stat]
    return Optimizer(params, steppers, stats=stats, lr=lr, nu_1=nu_1, nu_2=nu_2 , 
                     mom=mom, sqr_mom=sqr_mom, eps=eps, wd=wd)
1 Like

@muellerzr bit of a delay here, had a small hand surgery last week, so typing was a little slow :slight_smile:

I compared v1 and v2 with Ranger but without any image transforms and got pretty much the same result after averaging for 5 runs,

V1 notebok: (69.4+72.3+70.3+68.4+69.6)/5 = 70.0%

  • to make sure Ranger was doing something, a 1-run accuracy for Adam instead of Ranger was 64%

V2 notebok: (68.4+71.6+71+69+71.2)/5 = 70.2%

Note for v2 I used after_item=[ToTensor(),Resize(128)] to do the resizing, which squishes the image, which is the same as the v1 resize. Your previous notebook used a random crop I think which would impact the data being shown to the model and might explain some difference.

Will add the transforms back and let you know how it looks!

4 Likes

Awesome @morgan! Iā€™ve been busy doing key pints for a bit so havenā€™t looked into it. Canā€™t wait to hear an update. Great work! :slight_smile:

So after adding transforms a big difference in performance emerges: 73.6% vs 69.08%, V1 vs V2. So its probably our implementation of the v2 transforms, that is driving the difference (maaaybe a small chance its a difference in the implementation of the transforms, but unlikely Iā€™d guess).

Will try do an ablation test tomorrow to see if I can narrow down the culprit. Note that I need to look properly for a v2 version of the 3rd transform below (ā€œresize and cropā€)

Transforms used (V1 naming)

  • flip_lr
  • presize(128, scale=(0.35,1)) (Resize images to size using RandomResizedCrop)
  • size=128 (equivalent to resize and crop, ā€œno transformā€ version above used size=(128,128) which is equal to squish

Fastai V1 Result

(73.6+74.2+73.8+72+74.6)/5 = 73.64%

Databunch code:

img_ls = ImageList.from_folder(src).split_by_folder(train='train', valid='val').label_from_folder()

img_ls = img_ls.transform(([flip_lr(p=0.5)], []), size=(128))

data =img_ls.databunch(bs=64, num_workers=nw).presize(128, scale=(0.35,1)).normalize(imagenet_stats)

Fastai V2 Result

(66.8+71+67.8+71+68.8)/5 = 69.08

Databunch code:

tfms = [[PILImage.create], [parent_label, lbl_dict.__getitem__, Categorize()]]

item_tfms = [FlipItem(0.5)]

dsrc = DataSource(items, tfms, splits=split_idx)

batch_tfms = [Cuda(), IntToFloatTensor(), Normalize(*imagenet_stats)]

dbch = dsrc.databunch(item_tfms=item_tfms,
                      after_item=[ToTensor(), RandomResizedCrop(128, min_scale=0.35)],                       after_batch=batch_tfms, 
                      bs=64, 
                      num_workers=nw)
3 Likes

I tried it out again and included a c_out (which was missing before and leading to slightly higher losses). Good news is after one of my runs I got our 74.8%!!! It looks like the bug in the head of the models was the issue (which Jeremy fixed yesterday). @morgan Iā€™m running it for five but here is my code:

learn = Learner(dbch,xresnet50(sa=True, c_out=10), opt_func=opt_func, loss_func=LabelSmoothingCrossEntropy(),
                metrics=accuracy)
fit_fc(learn, 5, 4e-3)

Average of five was not as good though, [70.4,74.8,71.0,71.2,73.0] but much better!!! (average is 72.08)

3 Likes

@muellerzr Iā€™m guessing youā€™re using custom functions defined in this notebook? I tried running with only fastai and my accuracy is around 65% :sleepy:, here is the code in case someone wants to check out.

I was not, Iā€™ll post a new notebook later (when I have the time to do so, Iā€™m all over the place this week) but Iā€™ll let you know.

Reread your bit @lgvaz and yes, I was using those custom functions that were in the notebook. Try that and see if it helps. If you get the same as I did Iā€™ll compare whatā€™s in there vs the library

1 Like

I can confirm that running the code on that notebook gets the accuracy to 72%.

Iā€™ll compare all the functions from my nb and his nb and try to spot the difference.

1 Like

@lgvaz it looks like the two models are slightly different. First we should be specifying that , act_cls=MishJit

Also looks like some of the Convolution sizes are different. Eg ours in group 1 has a Conv2d(32,64,...) whereas fastai has Conv2d(32,32,...). Along with this, in our implementation the ResBlock has a final activation whereas fastai does not.

I found where that is. Line 431 in layers.py

self.idconv = noop if ni==nf else ConvLayer(ni, nf, 1, act_cls=None, ndim=ndim, **kwargs)

@jeremy Iā€™m going to tag you into here. If we can show a version working with the activation change and filter change can we post a PR for the layers?

Along with this you do not have an activation present in the last ConvLayer of a resblock but we do

1 Like

@muellerzr Maybe this is what you were saying with:

But I think thereā€™s a bug in xresnet.py line 23:

stem.append(ConvLayer(sizes[i], sizes[i+1], stride=2 if i==0 else 1))

act_cls is not being passed here, which causes the stem to use ReLU instead of MishJIT

1 Like

Thereā€™s a bug there too (I think, let me look at it before I say anything definitive), but Iā€™m saying in your call to xresnet, xresnet50(sa=True, c_out=10, act_cls=MishJit)

Ow Iā€™m already doing that :sweat_smile:

1 Like

Got it :slight_smile: to your bit, weā€™d want stem.append(ConvLayer(sizes[i], sizes[i+1], stride=2 if i==0 else 1, act_cls=act_cls if defaults.activation else act_cls())) there. Iā€™m running this now. Let me know what else you find/try :slight_smile:

Last bit we need to solve is that 32/32 in the resblock (I think)

1 Like

Got any improvements so far?

Iā€™m having some problems to modify fastai_dev source code (some permission errors) so Iā€™m a little bit behind. But Iā€™m also going to change it here and then we can compare results :smiley:

Iā€™m just modifying it in a code above when we make the model and it will override the libraries code. What Iā€™ve been doing is using a difference browser to look at what was different between everything and going from there

Thanks this is helpful. Let me know what you find and I can update the lib later this afternoon as needed.

2 Likes

Weā€™re working our way there, but we can confirm that the original mxresnet implementation gets 72.6% (average out of five with a STD of .5%)

1 Like

OK so the v2 results are within 1 std of thatā€¦ Although it sounds like their are definite bugs in v2 (not surprising - some of those changes to it I made very recently and under time pressure!)

1 Like

Correct. There were a few changes, weā€™re working on verifying that it works correctly one more time and then make a PR with a few changes to the architecture design on the dev repo, and we can discuss what to do. :slight_smile: (For example with the sizes and make_layer)

1 Like