Meet Ranger - RAdam + Lookahead optimizer

grankin · August 29, 2019, 8:55am

I believe this kind of implementation will result in loosing values of alpha and k as well as slow_weights.

fgfm · August 29, 2019, 9:15am

True, but since alpha & k are constant for a given training, I haven’t had to deal with that for now (and haven’t yet played around with alpha and k).

I considered passing alpha and k to the state_dict of base_optimizer to solve this. But what was the exact issue you had with Pickle interaction using your implementation?

EDIT: actually, you gave me an idea, I just pushed a commit to reconciliate the state getter & loader as well as having alpha, k! Cheers for that @grankin

grankin · August 29, 2019, 9:46am

The issue is that I get different result if I do save/load and if I don’t. I think the proper behaviour for save/load is like if it didn’t happened.

The Pickle loader don’t call the __init__ constructor and alpha and k would be uninitialised after load. And the slow_weight are lost.

fgfm · August 29, 2019, 9:58am

I guess you did, but better ask: have you set a seed for this check?

I see, would you have the code reference for this pickle loader? so that I cant take a look

grankin · August 29, 2019, 10:33am

I use seed and I can get deterministic results between run.

I’ve used fast.ai’s load https://github.com/fastai/fastai/blob/master/fastai/basic_train.py#L262

Which in turn uses Pytorch load
torch.load

And Pytorch uses a Python’s Pickle https://docs.python.org/3/library/pickle.html

fgfm · August 29, 2019, 2:10pm

Oh, I thought you were talking about an alternative loading method. As you pointed out the param duplication, I looked more closely the PyTorch optimizer implementation and made a few changes to my implementation.

It should solve the duplication issue as well as improve support for base optimizer method support (so that’s it’s a proper wrapper )
Let me know if the commit helps!

rwightman · August 29, 2019, 10:41pm

Most of the Lookahead / Ranger impl have some issues with state dict save/load and also adding parameters after Optimizer creation via add_param_group causing a crash.

I’ve gone through a few iterations starting from https://github.com/alphadl/lookahead.pytorch (which is closer to correct than the lonePatient impl).

Currently state is here, tested resumes with different optimizers. Could still be issues but I think it’s close

github.com

rwightman/pytorch-image-models/blob/master/timm/optim/lookahead.py

""" Lookahead Optimizer Wrapper.
Implementation modified from: https://github.com/alphadl/lookahead.pytorch
Paper: `Lookahead Optimizer: k steps forward, 1 step back` - https://arxiv.org/abs/1907.08610
"""
import torch
from torch.optim.optimizer import Optimizer
from collections import defaultdict


class Lookahead(Optimizer):
    def __init__(self, base_optimizer, alpha=0.5, k=6):
        if not 0.0 <= alpha <= 1.0:
            raise ValueError(f'Invalid slow update rate: {alpha}')
        if not 1 <= k:
            raise ValueError(f'Invalid lookahead steps: {k}')
        defaults = dict(lookahead_alpha=alpha, lookahead_k=k, lookahead_step=0)
        self.base_optimizer = base_optimizer
        self.param_groups = self.base_optimizer.param_groups
        self.defaults = base_optimizer.defaults
        self.defaults.update(defaults)

This file has been truncated. show original

EDIT: Also, added support in this one to resume a checkpoint with Lookahead(OptA) that was created with just OptA

grankin · August 30, 2019, 9:01am

Wow, that’s terrific job, works like a charm! I did save/load and got exact same validation loss.

grankin · August 30, 2019, 9:03am

slow_weights are now stored in param_groups, I like that trick!

fgfm · August 30, 2019, 1:01pm

Yes, I figured it would give more coherence to the implementation since we inherit from Optimizer. And it allows us to use inherited methods to reduce code base The only method that I’m not overwriting is a private one to ensure smooth param_group addition for slow weights.

Also, you might want to check the discussion on this thread. I added a param synchronization method for external calls so that users can choose the state they want for their model to be evaluated. My current version is available here!

Cheers

khwajawisal · August 30, 2019, 1:24pm

Your procedure to use ranger is not working fro me i am getting Typeerror:module object is not callable when i am running this line —>optar=partial(Ranger).

simon3 · September 1, 2019, 5:27pm

Thanks for your work !

Using Ranger on my model, I did a save/load , and start the training again. And the training loss behaves in a completely different way after this save/load step (training speed decreases).

Anyone has the same problem ? I think the saved optimizer is not saved correctly. I have better performance if I specify with_opt=False when I load the model.

LessW2020 · September 1, 2019, 9:24pm

Thanks for the feedback! It sounds like we are dragging around duplicate slow weights…I will take a look and try to fix!

LessW2020 · September 2, 2019, 5:49am

I didn’t get a chance to test it, but I believe the fix is to simply leverage what @rwightman did and move the slow weights into a state param group (which as usual, is brilliant coding by him).
That way they are re-loaded properly and should correct this issue.
Will try and do that tomorrow but at least I believe I know the issue and by copying @rwightman excellent idea, should fix.

ilovescience · September 2, 2019, 7:26pm

I think the optimizer does not work with pretrained models when the model has different layer groups. For some reason, it stops after one epoch. Could you please look into this?

LessW2020 · September 2, 2019, 8:54pm

I’m testing an update now that should handle layer groups. What model are you using and I’ll test a run with the fix.
Thanks!

ilovescience · September 2, 2019, 8:57pm

I am using an EfficientNetB3 from the Luke Melas repository with the following split:
learn.split( lambda m: (m._conv_head,) )

LessW2020 · September 4, 2019, 4:21am

I’ve posted a new version of Ranger - it has improved support for layer groups and is a much tighter codebase all around (one pass handling at param level, no repeat loops, slow weights moved into state dict, etc).

Can you please see if that resolves your issue?

New version 9.3.19

*Also thanks to @rwightman as I leveraged some of his code ideas related to putting the slow weights into a state dictionary vs how lonepatient originally did it.
I’m working to integrate @fgfm idea regarding partial sync as well next.

ilovescience · September 4, 2019, 4:37am

Thanks! When I get the chance, I will try it out with layer groups and let you know!

howtodowtle · September 4, 2019, 2:30pm

Is there any kind of early consensus around how to handle .fit_one_cycle() in combination with RAdam or Ranger yet?