Make a Learner with no param groups? (updated, partial fix)

LessW2020 · December 15, 2019, 8:58pm

I’m trying to integrate a new SLS optimizer which expects there to be no param groups…vs FastAI by default makes 2 param groups.

**Update - FastAI is splittinng the parameters into l1 and l2 groups so that’s why even with 1 layer group, you still get 2 param groups. Function to avoid that split (at your own risk) is below…

I’m going through the source code now to see how to avoid that split but if someone already knows, please post!

Most optimizers appropriately look through the param groups but SLS is a whole different optimizer with a closure, etc. but looks promising for self-tuning!

Issue:
batch step size 1
new step size 1
loss - tensor(2.4591, device=‘cuda:0’, grad_fn=<AddBackward0>)

check for p.grad… <class ‘list’>

169 = total params with 2 total groups

muellerzr · December 15, 2019, 9:02pm

If you define the architecture beforehand and just use Learner it doesn’t use splits at all. (Look at the cnn_learner code for evidence of this)

If you can’t find that tell me @LessW2020 and I can show you the lines tonight You should also be able to do the same by unfreezing the model just after making a cnn_learner based model (I think)

LessW2020 · December 15, 2019, 9:25pm

Thanks @muellerzr - but I don’t mean model Layer Groups…(ie. the model) I mean parameter groups, within the optimizer.

Example:
MXResNet50 = 1 layer group if I check via
learn.layer_groups

However:

learn.opt = OptimWrapper with two parameter groups

muellerzr · December 15, 2019, 9:32pm

Real quick, v1 or v2 of fastai?

LessW2020 · December 15, 2019, 9:33pm

oops sorry - I’m running 1.0.57 FAi.

muellerzr · December 15, 2019, 9:35pm

All good! Maybe line 106 can provide hints?

github.com

fastai/fastai/blob/master/fastai/general_optimizer.py#L106




def step(self, closure=None):
    self.update_stats()
    for i,pg in enumerate(self.param_groups):
        for p in pg['params']:
            if p.grad is not None: self.on_step(p, pg, i)


def on_step(self, p, group, group_idx): p.data.add_(-group['lr'], p.grad.data)


def _split_stats(self, stats):
    splits = [[stat for stat in listify(stats) if stat.scope==scope] for scope in StatScope]
    for split,s in zip([splits[0], splits[1], splits[2]+splits[3]+splits[4]], StatScope):
        if np.any([getattr(s, 'debias', False) for s in split]): split.insert(0, CounterStat('step', scope=s))
    return splits


def _init_stats(self, stats, data=None):
    return {stat.buf: stat.init if data is None
            else torch.zeros_like(data) + stat.init for stat in stats if stat.buf is not None}


def init_stats(self):
    self.state['global'] = self._init_stats(self.global_stats)

LessW2020 · December 15, 2019, 9:38pm

that looks like a great place to check. I was just debugging to make sure I’m really seeing 2 param groups and it is:

batch step size 1
new step size 1
loss - tensor(2.4591, device=‘cuda:0’, grad_fn=<AddBackward0>)

check for p.grad… <class ‘list’>

169 = total params with 2 total groups

LessW2020 · December 15, 2019, 9:42pm

hmm, actually the stats is for it’s own tracking of momentum/debais, etc.

The root issue is it’s self.param_groups ala
def step(self, closure=None):
self.update_stats()
for i,pg in enumerate(self.param_groups):
for p in pg[‘params’]:
if p.grad is not None: self.on_step(p, pg, i)

Need to figure out where this param_groups is being set and then I should be able to over-ride.

LessW2020 · December 15, 2019, 9:50pm

Seems to be somewhere in Callback.py where it creates the optimWrapper…maybe that second param group is related to splitting out weight decay?

@classmethod
    def create(cls, opt_func:Union[type,Callable], lr:Union[float,Tuple,List], layer_groups:ModuleList, wd:Floats=0., 
               true_wd:bool=False, bn_wd:bool=True)->optim.Optimizer:
        "Create an `optim.Optimizer` from `opt_func` with `lr`. Set lr on `layer_groups`."
    **

split_params = split_no_wd_params(layer_groups)

**
opt = opt_func([{‘params’: p, ‘lr’:0} for p in split_params])
opt = cls(opt, wd=wd, true_wd=true_wd, bn_wd=bn_wd)
opt.lr,opt.opt_func = listify(lr, layer_groups),opt_func
return opt

[edit] That’s the splitter causing this issue - split_no_wd_params…

which is in torch_core:

def split_no_wd_params(layer_groups:Collection[nn.Module])->List[List[nn.Parameter]]:
    "Separate the parameters in `layer_groups` between `no_wd_types` and  bias (`bias_types`) from the rest."
    split_params = []
    for l in layer_groups:
        l1,l2 = [],[]
        for c in l.children():
            if isinstance(c, no_wd_types): l2 += list(trainable_params(c))
            elif isinstance(c, bias_types):
                bias = c.bias if hasattr(c, 'bias') else None
                l1 += [p for p in trainable_params(c) if not (p is bias)]
                if bias is not None: l2.append(bias)
            else: l1 += list(trainable_params(c))
        #Since we scan the children separately, we might get duplicates (tied weights). We need to preserve the order
        #for the optimizer load of state_dict
        l1,l2 = uniqueify(l1),uniqueify(l2)
        split_params += [l1, l2]
    return split_params

thus you will always get 2 param groups even though 1 layer group.

Unfortunately that blows up this new SLS optimizer so let me try and over-ride at least to get SLS into testing…

LessW2020 · December 15, 2019, 10:38pm

making a new function that will keep things as one param group - in progress:

def filter_all_params_no_split(layer_groups:Collection[nn.Module])->List[List[nn.Parameter]]:
pure = []
buffer=[]
for l in layer_groups:
    for c in l.children():
        buffer +=list(trainable_params(c))  
pure += [uniqueify(buffer)]
    
return pure

that solves it in terms of now only 1 param group…now onto other SLS errors but at least past this issue
batch step size 1 new step size 1 loss - tensor(2.4583, device=‘cuda:0’, grad_fn=<AddBackward0>) 1 check for p.grad… <class ‘list’>

169 = total params with 1 total groups