Meet Ranger - RAdam + Lookahead optimizer

muellerzr · November 6, 2019, 12:24pm

Awesome! Any chance you could skip the QH and have ranger? (Just so I can compare). Or tell me where to comment out (let me see if I can figure it out too). I think I may have found something. I believe that on line 11, v should be: v = math.sqrt((1-sqr_mom**step) * (r - 4) / (r_inf - 4) * (r-2) / r*r_inf / (r_inf -2)) for RAdam’s step

Let me see if that helps…

Still do not think it’s quite working right, initial loss is still 2.74. I think I need to figure out where this step goes : v/(1-sqr_avg**step)

Couldn’t seem to find what’s going on with radam…

@morgan does it look like exponential average is being applied to you?

morgan · November 6, 2019, 1:07pm

Just checked with RAdam on its own and it performs worse than Ranger, so Lookahead is doing something right…just don’t know if the problem is with the fastai2 RAdam implementation or LookAhead, or their interaction…

morgan · November 6, 2019, 3:09pm

Yep thats exactly what I was puzzling over, but wanted to check that I wasn’t missing any fastai magic

So v_t and m_t below don’t look like they’re being calculated in the current RAdam implementation correct?

muellerzr · November 6, 2019, 3:19pm

Looks that way to me. I’ve been focusing on understanding how to migrate it all over so let me make sure I understand right how to go about this. We should define it pre-step and grab the gradients and affect it during the step? And pass them in as an argument? Or should we generate them during the step.

morgan · November 6, 2019, 3:34pm

I would say pre-step, similar to how the state for average_grad and average_grad are calculated

muellerzr · November 6, 2019, 3:41pm

Sounds good. I’ll try that out shortly and see what happens.

morgan · November 6, 2019, 4:17pm

Something like this for the update to the first moment I guess? I just need the first moment for QHAdam so I’ll test this in parallel to your work

def exp_average(state, p, **kwargs): if 'exp_avg' not in state: state['exp_avg'] = torch.zeros_like(p.data) state['exp_avg'] = (mom * state['exp_avg']) + ((1-mom) * avg_grad) return state

sgugger · November 6, 2019, 4:17pm

I don’t understand your remarks: hat vt and hat mt are computed, just not separately. debias1 and debias2 are the two terms you need to divide them with and you can see them used in the line that applies the update on p.

muellerzr · November 6, 2019, 4:20pm

Ah, thank you for the clarification @sgugger (I’m very new to trying to understand the math to code). Just trying to figure out why there’s the accuracy difference.

sgugger · November 6, 2019, 4:23pm

No worries. Note that there are multiple factors that could impact accuracy from v1 to v2, the data augmentation being done quite differently for instance. Is your comparison with the same data objects, same training loop but just the optimizers being different?

muellerzr · November 6, 2019, 4:24pm

Yes. v2 augs are:

item_img_tfms = [ToTensor(), FlipItem(0.5), RandomResizedCrop(128, min_scale=0.35)]

and v1 are

.transform(([flip_lr(p=0.5)], []), size=size)

With size being 128.

Same architecture and everything as well.

Training loop is fit_flat_cos (just in case I also checked the original fc_fit)

I think I found and solved the issue. It was in model declaration, the number of classes for the arch was still 1000. I’m surprised this didn’t throw an error? (mismatch error I mean during tensor calc’s). Let me run again real quick to double check.

sgugger · November 6, 2019, 4:33pm

You misunderstood: you should compare a training done with the same databunch object (either v1 or v2), the same training loop (either v1 or v2) and just change the optimizer, to make sure this is the problem.

For instance, your v2 data has rrc and I don’t see it in your v1 data.

muellerzr · November 6, 2019, 4:40pm

Hmmm… I’ll look into that. After adjusting the c_out I still get 69%… but the loss does look better

Looks like that’s the equivalent of just Resize. 71%, getting better

RRC was being applied (

(ImageList.from_folder(path).split_by_folder(valid='val')
            .label_from_folder().transform(([flip_lr(p=0.5)], []), size=size)
            .databunch(bs=bs, num_workers=workers)
            .presize(size, scale=(0.35,1))
            .normalize(imagenet_stats))

)

Still getting stuck there… odd… It’s not the transforms and it’s not the optimizer then (probably):

tfms = [[PILImage.create], [parent_label, lbl_dict.__getitem__, Categorize()]]
item_img_tfms = [ToTensor(), FlipItem(0.5), RandomResizedCrop(128, min_scale=0.35), Resize(128)]

Thats as far as I can get today. @morgan let me know what you come up with

morgan · November 6, 2019, 10:48pm

QHAdam

Below is a rough version of QHAdam which can be then used with the Lookahead wrapper to give RangerQH. The parameter update definitely could be refactored to be more efficient/elegant

def opt_func(ps, lr=defaults.lr): return Lookahead(QHAdam(ps, lr=lr, wd=1e-2,mom=0.9, eps=1e-6))

def qhadam_step(p, lr, mom, sqr_mom, sqr_avg, nus, step, grad_avg, eps, **kwargs):

nu_1 = nus[0]
nu_2 = nus[1]

debias1 = debias(mom,     1-mom,     step)
debias2 = debias(sqr_mom, 1-sqr_mom, step)

num = (((1-nu_1) * p.grad.data) + (nu_1 * (grad_avg / debias1)))
denom = ( ((1 - nu_2) * (p.grad.data)**2) + (nu_2 * (sqr_avg / debias2))).sqrt() + eps 
               
p.data = p.data - lr * (num / denom)
return p

def QHAdam(params, lr=1e-3, mom=0.9, sqr_mom=0.99, nus=[(.7, 1.0)], eps=1e-8, wd=0., decouple_wd=False):

from functools  import partial
steppers = [weight_decay] if decouple_wd else [l2_reg]
steppers.append(qhadam_step)
stats = [average_grad, average_sqr_grad, step_stat]
return Optimizer(params, steppers, stats=stats, lr=lr, nus=nus, mom=mom, > sqr_mom=sqr_mom, eps=eps, wd=wd)

Testing fastai Ranger v fastai2 Ranger

I tried removing all the transformations and shuffling in v1 and v2 for a fair comparison between both, but I keep getting an error when trying to remove all of the transforms in item_img_tfms in fastai2, I guess I need to get more familiar with the new data api. For fastai2 I used AffineCoordTfm to do the resizing as I had seen Jeremy do it in the kaggle RSNA comp. Happy to test both for Ranger if I can get fastai2 to give me images with no transforms

fastai2 data (gives error)

src = Path(‘data/imagewoof/imagewoof’)
items = get_image_files(src)
split_idx = GrandparentSplitter(valid_name=‘val’)(items)
tfms = [[PILImage.create], [parent_label, lbl_dict.getitem, Categorize()]]
item_img_tfms = [ToTensor()]
dsrc = DataSource(items, tfms, splits=split_idx)
batch_tfms = [Cuda(), IntToFloatTensor(), Normalize(*imagenet_stats)]
dbunch = dsrc.databunch(shuffle_train=False, after_batch=batch_tfms+[AffineCoordTfm(size=128)], bs=64)

fastai v1 data

n_gpus = num_distrib() or 1
nw = min(8, num_cpus()//n_gpus)
img_ls = ImageList.from_folder(src).split_by_folder(valid=‘val’).label_from_folder()
img_ls = img_ls.transform(([flip_lr(p=0.0)], ), size=128)
data =img_ls.databunch(bs=64, num_workers=nw).normalize(imagenet_stats)
data.train_dl = data.train_dl.new(shuffle=False)

muellerzr · November 6, 2019, 11:48pm

I’ll take a look in a moment and see what I can find I’ll update this post.

Looks like there is something missing with the initialization among a few other things with the fastai version of Mish with xresnet (first epoch I got 12% vs 29% with ours). Working on fixing it. ~~I believe it’s due to some of the activation functions still being a ReLU (posted a PR)~~ PR approved. I’ll see what else I can find later today.

However, here is what I tried. I didn’t do affine as there is a bug with pytorch, it, and google Colab.

woof = DataBlock(blocks=(ImageBlock, CategoryBlock),
                 get_items=get_image_files,
                 splitter=GrandparentSplitter(valid_name='val'),
                 get_y=parent_label)

dbunch = woof.databunch(untar_data(URLs.IMAGEWOOF), bs=32, item_tfms=RandomResizedCrop(128),
                        batch_tfms=Normalize(*imagenet_stats))

This should only resize the image. Only 64% here…

morgan · November 7, 2019, 7:18pm

Won’t that take a random crop as opposed to a squished or center crop? (not even sure what fastaiv1 v1 .transform(size=...) does actually). I guess it should be pretty close though.

Sorry I haven’t been able to compare the two, I’ve been trying to get QHAdam done. I’m ready to submit a clean version of QHAdam, but blocked by tests failing for average_grad in the Optimizer notebook

muellerzr · November 7, 2019, 7:20pm

You’re good! I traced it back to a random resized crop being applied. Oh no that’s a headache I put in the fix for the one ReLU that wasn’t being changed to Mish so all the pieces are now (should be) natively there for us to try and play with.

morgan · November 7, 2019, 7:20pm

woohoo, so now they’re both showing the same performance?

muellerzr · November 7, 2019, 7:21pm

Nope but now instead of copying code to debug we can use the library itself. Still yet to match the 73-78%

To add onto my earlier comment, presize does the RRC

Also @morgan see Jeremy’s note in the Practical 2.0 thread. He said to check and make sure eps was being used the same in the 1.0 and 2.0 versions of RAdam (That Ranger used)

s.s.o · November 7, 2019, 9:04pm

In my tests (totally different data set) learner with to_fp16() has lower score than when I train normal in addition I need to train longer to get close to adam but still lower… so the calculation precision might be the cause.