Meet Ranger - RAdam + Lookahead optimizer

Awesome! Any chance you could skip the QH and have ranger? (Just so I can compare). Or tell me where to comment out :slight_smile: (let me see if I can figure it out too). I think I may have found something. I believe that on line 11, v should be: v = math.sqrt((1-sqr_mom**step) * (r - 4) / (r_inf - 4) * (r-2) / r*r_inf / (r_inf -2)) for RAdamā€™s step

Let me see if that helpsā€¦

Still do not think itā€™s quite working right, initial loss is still 2.74. I think I need to figure out where this step goes : v/(1-sqr_avg**step)

Couldnā€™t seem to find whatā€™s going on with radamā€¦

@morgan does it look like exponential average is being applied to you?

1 Like

Just checked with RAdam on its own and it performs worse than Ranger, so Lookahead is doing something rightā€¦just donā€™t know if the problem is with the fastai2 RAdam implementation or LookAhead, or their interactionā€¦

37

1 Like

Yep thats exactly what I was puzzling over, but wanted to check that I wasnā€™t missing any fastai magic :smiley:

So v_t and m_t below donā€™t look like theyā€™re being calculated in the current RAdam implementation correct?

1 Like

Looks that way to me. Iā€™ve been focusing on understanding how to migrate it all over so let me make sure I understand right how to go about this. We should define it pre-step and grab the gradients and affect it during the step? And pass them in as an argument? Or should we generate them during the step.

1 Like

I would say pre-step, similar to how the state for average_grad and average_grad are calculated

Sounds good. Iā€™ll try that out shortly and see what happens.

1 Like

Something like this for the update to the first moment I guess? I just need the first moment for QHAdam so Iā€™ll test this in parallel to your work :slight_smile:

def exp_average(state, p, **kwargs): if 'exp_avg' not in state: state['exp_avg'] = torch.zeros_like(p.data) state['exp_avg'] = (mom * state['exp_avg']) + ((1-mom) * avg_grad) return state

I donā€™t understand your remarks: hat vt and hat mt are computed, just not separately. debias1 and debias2 are the two terms you need to divide them with and you can see them used in the line that applies the update on p.

1 Like

Ah, thank you for the clarification @sgugger :slight_smile: (Iā€™m very new to trying to understand the math to code). Just trying to figure out why thereā€™s the accuracy difference. :confused:

No worries. Note that there are multiple factors that could impact accuracy from v1 to v2, the data augmentation being done quite differently for instance. Is your comparison with the same data objects, same training loop but just the optimizers being different?

1 Like

Yes. v2 augs are:

item_img_tfms = [ToTensor(), FlipItem(0.5), RandomResizedCrop(128, min_scale=0.35)]

and v1 are

.transform(([flip_lr(p=0.5)], []), size=size)

With size being 128.

Same architecture and everything as well.

Training loop is fit_flat_cos (just in case I also checked the original fc_fit)

I think I found and solved the issue. It was in model declaration, the number of classes for the arch was still 1000. Iā€™m surprised this didnā€™t throw an error? (mismatch error I mean during tensor calcā€™s). Let me run again real quick to double check.

1 Like

You misunderstood: you should compare a training done with the same databunch object (either v1 or v2), the same training loop (either v1 or v2) and just change the optimizer, to make sure this is the problem.

For instance, your v2 data has rrc and I donā€™t see it in your v1 data.

1 Like

Hmmmā€¦ Iā€™ll look into that. After adjusting the c_out I still get 69%ā€¦ but the loss does look better

Looks like thatā€™s the equivalent of just Resize. 71%, getting better :slight_smile:

RRC was being applied (

(ImageList.from_folder(path).split_by_folder(valid='val')
            .label_from_folder().transform(([flip_lr(p=0.5)], []), size=size)
            .databunch(bs=bs, num_workers=workers)
            .presize(size, scale=(0.35,1))
            .normalize(imagenet_stats))

)

Still getting stuck thereā€¦ oddā€¦ Itā€™s not the transforms and itā€™s not the optimizer then (probably):

tfms = [[PILImage.create], [parent_label, lbl_dict.__getitem__, Categorize()]]
item_img_tfms = [ToTensor(), FlipItem(0.5), RandomResizedCrop(128, min_scale=0.35), Resize(128)]

Thats as far as I can get today. @morgan let me know what you come up with :confused:

2 Likes

QHAdam

Below is a rough version of QHAdam which can be then used with the Lookahead wrapper to give RangerQH. The parameter update definitely could be refactored to be more efficient/elegant

def opt_func(ps, lr=defaults.lr): return Lookahead(QHAdam(ps, lr=lr, wd=1e-2,mom=0.9, eps=1e-6))

def qhadam_step(p, lr, mom, sqr_mom, sqr_avg, nus, step, grad_avg, eps, **kwargs):

nu_1 = nus[0]
nu_2 = nus[1]

debias1 = debias(mom,     1-mom,     step)
debias2 = debias(sqr_mom, 1-sqr_mom, step)

num = (((1-nu_1) * p.grad.data) + (nu_1 * (grad_avg / debias1)))
denom = ( ((1 - nu_2) * (p.grad.data)**2) + (nu_2 * (sqr_avg / debias2))).sqrt() + eps 
               
p.data = p.data - lr * (num / denom)
return p    

def QHAdam(params, lr=1e-3, mom=0.9, sqr_mom=0.99, nus=[(.7, 1.0)], eps=1e-8, wd=0., decouple_wd=False):

from functools  import partial
steppers = [weight_decay] if decouple_wd else [l2_reg]
steppers.append(qhadam_step)
stats = [average_grad, average_sqr_grad, step_stat]
return Optimizer(params, steppers, stats=stats, lr=lr, nus=nus, mom=mom, > sqr_mom=sqr_mom, eps=eps, wd=wd)

Testing fastai Ranger v fastai2 Ranger

I tried removing all the transformations and shuffling in v1 and v2 for a fair comparison between both, but I keep getting an error when trying to remove all of the transforms in item_img_tfms in fastai2, I guess I need to get more familiar with the new data api. For fastai2 I used AffineCoordTfm to do the resizing as I had seen Jeremy do it in the kaggle RSNA comp. Happy to test both for Ranger if I can get fastai2 to give me images with no transforms :rofl:

fastai2 data (gives error)

src = Path(ā€˜data/imagewoof/imagewoofā€™)
items = get_image_files(src)
split_idx = GrandparentSplitter(valid_name=ā€˜valā€™)(items)
tfms = [[PILImage.create], [parent_label, lbl_dict.getitem, Categorize()]]
item_img_tfms = [ToTensor()]
dsrc = DataSource(items, tfms, splits=split_idx)
batch_tfms = [Cuda(), IntToFloatTensor(), Normalize(*imagenet_stats)]
dbunch = dsrc.databunch(shuffle_train=False, after_batch=batch_tfms+[AffineCoordTfm(size=128)], bs=64)

fastai v1 data

n_gpus = num_distrib() or 1
nw = min(8, num_cpus()//n_gpus)
img_ls = ImageList.from_folder(src).split_by_folder(valid=ā€˜valā€™).label_from_folder()
img_ls = img_ls.transform(([flip_lr(p=0.0)], []), size=128)
data =img_ls.databunch(bs=64, num_workers=nw).normalize(imagenet_stats)
data.train_dl = data.train_dl.new(shuffle=False)

1 Like

Iā€™ll take a look in a moment and see what I can find :slight_smile: Iā€™ll update this post.

Looks like there is something missing with the initialization among a few other things with the fastai version of Mish with xresnet (first epoch I got 12% vs 29% with ours). Working on fixing it. I believe itā€™s due to some of the activation functions still being a ReLU (posted a PR) PR approved. Iā€™ll see what else I can find later today.

However, here is what I tried. I didnā€™t do affine as there is a bug with pytorch, it, and google Colab.

woof = DataBlock(blocks=(ImageBlock, CategoryBlock),
                 get_items=get_image_files,
                 splitter=GrandparentSplitter(valid_name='val'),
                 get_y=parent_label)

dbunch = woof.databunch(untar_data(URLs.IMAGEWOOF), bs=32, item_tfms=RandomResizedCrop(128),
                        batch_tfms=Normalize(*imagenet_stats))

This should only resize the image. Only 64% hereā€¦

Wonā€™t that take a random crop as opposed to a squished or center crop? (not even sure what fastaiv1 v1 .transform(size=...) does actually). I guess it should be pretty close though.

Sorry I havenā€™t been able to compare the two, Iā€™ve been trying to get QHAdam done. Iā€™m ready to submit a clean version of QHAdam, but blocked by tests failing for average_grad in the Optimizer notebook :frowning:

1 Like

Youā€™re good! I traced it back to a random resized crop being applied. Oh no thatā€™s a headache :frowning: I put in the fix for the one ReLU that wasnā€™t being changed to Mish so all the pieces are now (should be) natively there for us to try and play with.

1 Like

woohoo, so now theyā€™re both showing the same performance?

Nope :confused: but now instead of copying code to debug we can use the library itself. Still yet to match the 73-78%

To add onto my earlier comment, presize does the RRC

Also @morgan see Jeremyā€™s note in the Practical 2.0 thread. He said to check and make sure eps was being used the same in the 1.0 and 2.0 versions of RAdam (That Ranger used)

1 Like

In my tests (totally different data set) learner with to_fp16() has lower score than when I train normal in addition I need to train longer to get close to adam but still lowerā€¦ so the calculation precision might be the cause.

1 Like