Meet Ranger - RAdam + Lookahead optimizer

Yes. v2 augs are:

item_img_tfms = [ToTensor(), FlipItem(0.5), RandomResizedCrop(128, min_scale=0.35)]

and v1 are

.transform(([flip_lr(p=0.5)], []), size=size)

With size being 128.

Same architecture and everything as well.

Training loop is fit_flat_cos (just in case I also checked the original fc_fit)

I think I found and solved the issue. It was in model declaration, the number of classes for the arch was still 1000. I’m surprised this didn’t throw an error? (mismatch error I mean during tensor calc’s). Let me run again real quick to double check.

1 Like

You misunderstood: you should compare a training done with the same databunch object (either v1 or v2), the same training loop (either v1 or v2) and just change the optimizer, to make sure this is the problem.

For instance, your v2 data has rrc and I don’t see it in your v1 data.

1 Like

Hmmm… I’ll look into that. After adjusting the c_out I still get 69%… but the loss does look better

Looks like that’s the equivalent of just Resize. 71%, getting better :slight_smile:

RRC was being applied (

(ImageList.from_folder(path).split_by_folder(valid='val')
            .label_from_folder().transform(([flip_lr(p=0.5)], []), size=size)
            .databunch(bs=bs, num_workers=workers)
            .presize(size, scale=(0.35,1))
            .normalize(imagenet_stats))

)

Still getting stuck there… odd… It’s not the transforms and it’s not the optimizer then (probably):

tfms = [[PILImage.create], [parent_label, lbl_dict.__getitem__, Categorize()]]
item_img_tfms = [ToTensor(), FlipItem(0.5), RandomResizedCrop(128, min_scale=0.35), Resize(128)]

Thats as far as I can get today. @morgan let me know what you come up with :confused:

2 Likes

QHAdam

Below is a rough version of QHAdam which can be then used with the Lookahead wrapper to give RangerQH. The parameter update definitely could be refactored to be more efficient/elegant

def opt_func(ps, lr=defaults.lr): return Lookahead(QHAdam(ps, lr=lr, wd=1e-2,mom=0.9, eps=1e-6))

def qhadam_step(p, lr, mom, sqr_mom, sqr_avg, nus, step, grad_avg, eps, **kwargs):

nu_1 = nus[0]
nu_2 = nus[1]

debias1 = debias(mom,     1-mom,     step)
debias2 = debias(sqr_mom, 1-sqr_mom, step)

num = (((1-nu_1) * p.grad.data) + (nu_1 * (grad_avg / debias1)))
denom = ( ((1 - nu_2) * (p.grad.data)**2) + (nu_2 * (sqr_avg / debias2))).sqrt() + eps 
               
p.data = p.data - lr * (num / denom)
return p    

def QHAdam(params, lr=1e-3, mom=0.9, sqr_mom=0.99, nus=[(.7, 1.0)], eps=1e-8, wd=0., decouple_wd=False):

from functools  import partial
steppers = [weight_decay] if decouple_wd else [l2_reg]
steppers.append(qhadam_step)
stats = [average_grad, average_sqr_grad, step_stat]
return Optimizer(params, steppers, stats=stats, lr=lr, nus=nus, mom=mom, > sqr_mom=sqr_mom, eps=eps, wd=wd)

Testing fastai Ranger v fastai2 Ranger

I tried removing all the transformations and shuffling in v1 and v2 for a fair comparison between both, but I keep getting an error when trying to remove all of the transforms in item_img_tfms in fastai2, I guess I need to get more familiar with the new data api. For fastai2 I used AffineCoordTfm to do the resizing as I had seen Jeremy do it in the kaggle RSNA comp. Happy to test both for Ranger if I can get fastai2 to give me images with no transforms :rofl:

fastai2 data (gives error)

src = Path(‘data/imagewoof/imagewoof’)
items = get_image_files(src)
split_idx = GrandparentSplitter(valid_name=‘val’)(items)
tfms = [[PILImage.create], [parent_label, lbl_dict.getitem, Categorize()]]
item_img_tfms = [ToTensor()]
dsrc = DataSource(items, tfms, splits=split_idx)
batch_tfms = [Cuda(), IntToFloatTensor(), Normalize(*imagenet_stats)]
dbunch = dsrc.databunch(shuffle_train=False, after_batch=batch_tfms+[AffineCoordTfm(size=128)], bs=64)

fastai v1 data

n_gpus = num_distrib() or 1
nw = min(8, num_cpus()//n_gpus)
img_ls = ImageList.from_folder(src).split_by_folder(valid=‘val’).label_from_folder()
img_ls = img_ls.transform(([flip_lr(p=0.0)], []), size=128)
data =img_ls.databunch(bs=64, num_workers=nw).normalize(imagenet_stats)
data.train_dl = data.train_dl.new(shuffle=False)

1 Like

I’ll take a look in a moment and see what I can find :slight_smile: I’ll update this post.

Looks like there is something missing with the initialization among a few other things with the fastai version of Mish with xresnet (first epoch I got 12% vs 29% with ours). Working on fixing it. I believe it’s due to some of the activation functions still being a ReLU (posted a PR) PR approved. I’ll see what else I can find later today.

However, here is what I tried. I didn’t do affine as there is a bug with pytorch, it, and google Colab.

woof = DataBlock(blocks=(ImageBlock, CategoryBlock),
                 get_items=get_image_files,
                 splitter=GrandparentSplitter(valid_name='val'),
                 get_y=parent_label)

dbunch = woof.databunch(untar_data(URLs.IMAGEWOOF), bs=32, item_tfms=RandomResizedCrop(128),
                        batch_tfms=Normalize(*imagenet_stats))

This should only resize the image. Only 64% here…

Won’t that take a random crop as opposed to a squished or center crop? (not even sure what fastaiv1 v1 .transform(size=...) does actually). I guess it should be pretty close though.

Sorry I haven’t been able to compare the two, I’ve been trying to get QHAdam done. I’m ready to submit a clean version of QHAdam, but blocked by tests failing for average_grad in the Optimizer notebook :frowning:

1 Like

You’re good! I traced it back to a random resized crop being applied. Oh no that’s a headache :frowning: I put in the fix for the one ReLU that wasn’t being changed to Mish so all the pieces are now (should be) natively there for us to try and play with.

1 Like

woohoo, so now they’re both showing the same performance?

Nope :confused: but now instead of copying code to debug we can use the library itself. Still yet to match the 73-78%

To add onto my earlier comment, presize does the RRC

Also @morgan see Jeremy’s note in the Practical 2.0 thread. He said to check and make sure eps was being used the same in the 1.0 and 2.0 versions of RAdam (That Ranger used)

1 Like

In my tests (totally different data set) learner with to_fp16() has lower score than when I train normal in addition I need to train longer to get close to adam but still lower… so the calculation precision might be the cause.

1 Like

Are you adjusting the learning rate for mixed precision?

Will do thanks, I’ll be offline for a day or two so I’ll check back here then and hopefully get QHAdam pushed.


EDIT: @muellerzr just checked re eps and it looks like @LessW2020 had left it outside the sqrt in both Ranger and RangerQH. RAdam in fastai v2 also has it outside and my port of QHAdam also leaves it outside.

def qhadam_step(p, lr, mom, sqr_mom, sqr_avg, nu_1, nu_2, step, grad_avg, eps, **kwargs):             
    debias1 = debias(mom,     1-mom,     step)
    debias2 = debias(sqr_mom, 1-sqr_mom, step)
    p.data.addcdiv_(-lr, ((1-nu_1) * p.grad.data) + (nu_1 * (grad_avg / debias1)), 
                    (((1 - nu_2) * (p.grad.data)**2) + (nu_2 * (sqr_avg / debias2))).sqrt() + eps)
    return p   

qhadam_step._defaults = dict(eps=1e-8)  


#export
def QHAdam(params, lr, mom=0.999, sqr_mom=0.999, nu_1=0.7, nu_2 = 1.0, eps=1e-8, wd=0., decouple_wd=True):
    "An `Optimizer` for Adam with `lr`, `mom`, `sqr_mom`, `nus`, eps` and `params`"
    from functools  import partial
    steppers = [weight_decay] if decouple_wd else [l2_reg]
    steppers.append(qhadam_step)
    stats = [partial(average_grad, dampening=True), partial(average_sqr_grad, dampening=True), step_stat]
    return Optimizer(params, steppers, stats=stats, lr=lr, nu_1=nu_1, nu_2=nu_2 , 
                     mom=mom, sqr_mom=sqr_mom, eps=eps, wd=wd)
1 Like

@muellerzr bit of a delay here, had a small hand surgery last week, so typing was a little slow :slight_smile:

I compared v1 and v2 with Ranger but without any image transforms and got pretty much the same result after averaging for 5 runs,

V1 notebok: (69.4+72.3+70.3+68.4+69.6)/5 = 70.0%

  • to make sure Ranger was doing something, a 1-run accuracy for Adam instead of Ranger was 64%

V2 notebok: (68.4+71.6+71+69+71.2)/5 = 70.2%

Note for v2 I used after_item=[ToTensor(),Resize(128)] to do the resizing, which squishes the image, which is the same as the v1 resize. Your previous notebook used a random crop I think which would impact the data being shown to the model and might explain some difference.

Will add the transforms back and let you know how it looks!

4 Likes

Awesome @morgan! I’ve been busy doing key pints for a bit so haven’t looked into it. Can’t wait to hear an update. Great work! :slight_smile:

So after adding transforms a big difference in performance emerges: 73.6% vs 69.08%, V1 vs V2. So its probably our implementation of the v2 transforms, that is driving the difference (maaaybe a small chance its a difference in the implementation of the transforms, but unlikely I’d guess).

Will try do an ablation test tomorrow to see if I can narrow down the culprit. Note that I need to look properly for a v2 version of the 3rd transform below (“resize and crop”)

Transforms used (V1 naming)

  • flip_lr
  • presize(128, scale=(0.35,1)) (Resize images to size using RandomResizedCrop)
  • size=128 (equivalent to resize and crop, “no transform” version above used size=(128,128) which is equal to squish

Fastai V1 Result

(73.6+74.2+73.8+72+74.6)/5 = 73.64%

Databunch code:

img_ls = ImageList.from_folder(src).split_by_folder(train='train', valid='val').label_from_folder()

img_ls = img_ls.transform(([flip_lr(p=0.5)], []), size=(128))

data =img_ls.databunch(bs=64, num_workers=nw).presize(128, scale=(0.35,1)).normalize(imagenet_stats)

Fastai V2 Result

(66.8+71+67.8+71+68.8)/5 = 69.08

Databunch code:

tfms = [[PILImage.create], [parent_label, lbl_dict.__getitem__, Categorize()]]

item_tfms = [FlipItem(0.5)]

dsrc = DataSource(items, tfms, splits=split_idx)

batch_tfms = [Cuda(), IntToFloatTensor(), Normalize(*imagenet_stats)]

dbch = dsrc.databunch(item_tfms=item_tfms,
                      after_item=[ToTensor(), RandomResizedCrop(128, min_scale=0.35)],                       after_batch=batch_tfms, 
                      bs=64, 
                      num_workers=nw)
3 Likes

I tried it out again and included a c_out (which was missing before and leading to slightly higher losses). Good news is after one of my runs I got our 74.8%!!! It looks like the bug in the head of the models was the issue (which Jeremy fixed yesterday). @morgan I’m running it for five but here is my code:

learn = Learner(dbch,xresnet50(sa=True, c_out=10), opt_func=opt_func, loss_func=LabelSmoothingCrossEntropy(),
                metrics=accuracy)
fit_fc(learn, 5, 4e-3)

Average of five was not as good though, [70.4,74.8,71.0,71.2,73.0] but much better!!! (average is 72.08)

3 Likes

@muellerzr I’m guessing you’re using custom functions defined in this notebook? I tried running with only fastai and my accuracy is around 65% :sleepy:, here is the code in case someone wants to check out.

I was not, I’ll post a new notebook later (when I have the time to do so, I’m all over the place this week) but I’ll let you know.

Reread your bit @lgvaz and yes, I was using those custom functions that were in the notebook. Try that and see if it helps. If you get the same as I did I’ll compare what’s in there vs the library

1 Like

I can confirm that running the code on that notebook gets the accuracy to 72%.

I’ll compare all the functions from my nb and his nb and try to spot the difference.

1 Like

@lgvaz it looks like the two models are slightly different. First we should be specifying that , act_cls=MishJit

Also looks like some of the Convolution sizes are different. Eg ours in group 1 has a Conv2d(32,64,...) whereas fastai has Conv2d(32,32,...). Along with this, in our implementation the ResBlock has a final activation whereas fastai does not.

I found where that is. Line 431 in layers.py

self.idconv = noop if ni==nf else ConvLayer(ni, nf, 1, act_cls=None, ndim=ndim, **kwargs)

@jeremy I’m going to tag you into here. If we can show a version working with the activation change and filter change can we post a PR for the layers?

Along with this you do not have an activation present in the last ConvLayer of a resblock but we do

1 Like