Meet Ranger - RAdam + Lookahead optimizer

muellerzr · November 7, 2019, 9:06pm

Are you adjusting the learning rate for mixed precision?

morgan · November 7, 2019, 9:40pm

Will do thanks, I’ll be offline for a day or two so I’ll check back here then and hopefully get QHAdam pushed.

EDIT: @muellerzr just checked re eps and it looks like @LessW2020 had left it outside the sqrt in both Ranger and RangerQH. RAdam in fastai v2 also has it outside and my port of QHAdam also leaves it outside.

def qhadam_step(p, lr, mom, sqr_mom, sqr_avg, nu_1, nu_2, step, grad_avg, eps, **kwargs):             
    debias1 = debias(mom,     1-mom,     step)
    debias2 = debias(sqr_mom, 1-sqr_mom, step)
    p.data.addcdiv_(-lr, ((1-nu_1) * p.grad.data) + (nu_1 * (grad_avg / debias1)), 
                    (((1 - nu_2) * (p.grad.data)**2) + (nu_2 * (sqr_avg / debias2))).sqrt() + eps)
    return p   

qhadam_step._defaults = dict(eps=1e-8)  


#export
def QHAdam(params, lr, mom=0.999, sqr_mom=0.999, nu_1=0.7, nu_2 = 1.0, eps=1e-8, wd=0., decouple_wd=True):
    "An `Optimizer` for Adam with `lr`, `mom`, `sqr_mom`, `nus`, eps` and `params`"
    from functools  import partial
    steppers = [weight_decay] if decouple_wd else [l2_reg]
    steppers.append(qhadam_step)
    stats = [partial(average_grad, dampening=True), partial(average_sqr_grad, dampening=True), step_stat]
    return Optimizer(params, steppers, stats=stats, lr=lr, nu_1=nu_1, nu_2=nu_2 , 
                     mom=mom, sqr_mom=sqr_mom, eps=eps, wd=wd)

morgan · November 14, 2019, 8:17pm

@muellerzr bit of a delay here, had a small hand surgery last week, so typing was a little slow

I compared v1 and v2 with Ranger but without any image transforms and got pretty much the same result after averaging for 5 runs,

V1 notebok: (69.4+72.3+70.3+68.4+69.6)/5 = 70.0%

to make sure Ranger was doing something, a 1-run accuracy for Adam instead of Ranger was 64%

V2 notebok: (68.4+71.6+71+69+71.2)/5 = 70.2%

Note for v2 I used after_item=[ToTensor(),Resize(128)] to do the resizing, which squishes the image, which is the same as the v1 resize. Your previous notebook used a random crop I think which would impact the data being shown to the model and might explain some difference.

Will add the transforms back and let you know how it looks!

muellerzr · November 14, 2019, 8:29pm

Awesome @morgan! I’ve been busy doing key pints for a bit so haven’t looked into it. Can’t wait to hear an update. Great work!

morgan · November 14, 2019, 10:25pm

So after adding transforms a big difference in performance emerges: 73.6% vs 69.08%, V1 vs V2. So its probably our implementation of the v2 transforms, that is driving the difference (maaaybe a small chance its a difference in the implementation of the transforms, but unlikely I’d guess).

Will try do an ablation test tomorrow to see if I can narrow down the culprit. Note that I need to look properly for a v2 version of the 3rd transform below (“resize and crop”)

Transforms used (V1 naming)

flip_lr
presize(128, scale=(0.35,1)) (Resize images to size using RandomResizedCrop)
size=128 (equivalent to resize and crop, “no transform” version above used size=(128,128) which is equal to squish

Fastai V1 Result

(73.6+74.2+73.8+72+74.6)/5 = 73.64%

Databunch code:

img_ls = ImageList.from_folder(src).split_by_folder(train='train', valid='val').label_from_folder()

img_ls = img_ls.transform(([flip_lr(p=0.5)], []), size=(128))

data =img_ls.databunch(bs=64, num_workers=nw).presize(128, scale=(0.35,1)).normalize(imagenet_stats)

Fastai V2 Result

(66.8+71+67.8+71+68.8)/5 = 69.08

Databunch code:

tfms = [[PILImage.create], [parent_label, lbl_dict.__getitem__, Categorize()]]

item_tfms = [FlipItem(0.5)]

dsrc = DataSource(items, tfms, splits=split_idx)

batch_tfms = [Cuda(), IntToFloatTensor(), Normalize(*imagenet_stats)]

dbch = dsrc.databunch(item_tfms=item_tfms,
                      after_item=[ToTensor(), RandomResizedCrop(128, min_scale=0.35)],                       after_batch=batch_tfms, 
                      bs=64, 
                      num_workers=nw)

muellerzr · November 19, 2019, 6:21pm

I tried it out again and included a c_out (which was missing before and leading to slightly higher losses). Good news is after one of my runs I got our 74.8%!!! It looks like the bug in the head of the models was the issue (which Jeremy fixed yesterday). @morgan I’m running it for five but here is my code:

learn = Learner(dbch,xresnet50(sa=True, c_out=10), opt_func=opt_func, loss_func=LabelSmoothingCrossEntropy(),
                metrics=accuracy)
fit_fc(learn, 5, 4e-3)

Average of five was not as good though, [70.4,74.8,71.0,71.2,73.0] but much better!!! (average is 72.08)

lgvaz · November 19, 2019, 6:48pm

@muellerzr I’m guessing you’re using custom functions defined in this notebook? I tried running with only fastai and my accuracy is around 65% , here is the code in case someone wants to check out.

muellerzr · November 19, 2019, 7:07pm

I was not, I’ll post a new notebook later (when I have the time to do so, I’m all over the place this week) but I’ll let you know.

Reread your bit @lgvaz and yes, I was using those custom functions that were in the notebook. Try that and see if it helps. If you get the same as I did I’ll compare what’s in there vs the library

lgvaz · November 19, 2019, 9:13pm

I can confirm that running the code on that notebook gets the accuracy to 72%.

I’ll compare all the functions from my nb and his nb and try to spot the difference.

muellerzr · November 19, 2019, 9:26pm

@lgvaz it looks like the two models are slightly different. First we should be specifying that , act_cls=MishJit

Also looks like some of the Convolution sizes are different. Eg ours in group 1 has a Conv2d(32,64,...) whereas fastai has Conv2d(32,32,...). Along with this, in our implementation the ResBlock has a final activation whereas fastai does not.

I found where that is. Line 431 in layers.py

self.idconv = noop if ni==nf else ConvLayer(ni, nf, 1, act_cls=None, ndim=ndim, **kwargs)

@jeremy I’m going to tag you into here. If we can show a version working with the activation change and filter change can we post a PR for the layers?

Along with this you do not have an activation present in the last ConvLayer of a resblock but we do

lgvaz · November 19, 2019, 9:48pm

@muellerzr Maybe this is what you were saying with:

But I think there’s a bug in xresnet.py line 23:

stem.append(ConvLayer(sizes[i], sizes[i+1], stride=2 if i==0 else 1))

act_cls is not being passed here, which causes the stem to use ReLU instead of MishJIT

muellerzr · November 19, 2019, 9:49pm

There’s a bug there too (I think, let me look at it before I say anything definitive), but I’m saying in your call to xresnet, xresnet50(sa=True, c_out=10, act_cls=MishJit)

lgvaz · November 19, 2019, 9:50pm

Ow I’m already doing that

muellerzr · November 19, 2019, 9:54pm

Got it to your bit, we’d want stem.append(ConvLayer(sizes[i], sizes[i+1], stride=2 if i==0 else 1, act_cls=act_cls if defaults.activation else act_cls())) there. I’m running this now. Let me know what else you find/try

Last bit we need to solve is that 32/32 in the resblock (I think)

lgvaz · November 19, 2019, 10:07pm

Got any improvements so far?

I’m having some problems to modify fastai_dev source code (some permission errors) so I’m a little bit behind. But I’m also going to change it here and then we can compare results

muellerzr · November 19, 2019, 10:08pm

I’m just modifying it in a code above when we make the model and it will override the libraries code. What I’ve been doing is using a difference browser to look at what was different between everything and going from there

jeremy · November 19, 2019, 10:46pm

Thanks this is helpful. Let me know what you find and I can update the lib later this afternoon as needed.

muellerzr · November 20, 2019, 12:43am

We’re working our way there, but we can confirm that the original mxresnet implementation gets 72.6% (average out of five with a STD of .5%)

jeremy · November 20, 2019, 12:50am

OK so the v2 results are within 1 std of that… Although it sounds like their are definite bugs in v2 (not surprising - some of those changes to it I made very recently and under time pressure!)

muellerzr · November 20, 2019, 12:59am

Correct. There were a few changes, we’re working on verifying that it works correctly one more time and then make a PR with a few changes to the architecture design on the dev repo, and we can discuss what to do. (For example with the sizes and make_layer)