Make a Learner with no param groups? (updated, partial fix)

I’m trying to integrate a new SLS optimizer which expects there to be no param groups…vs FastAI by default makes 2 param groups.

**Update - FastAI is splittinng the parameters into l1 and l2 groups so that’s why even with 1 layer group, you still get 2 param groups. Function to avoid that split (at your own risk) is below…

I’m going through the source code now to see how to avoid that split but if someone already knows, please post!

Most optimizers appropriately look through the param groups but SLS is a whole different optimizer with a closure, etc. but looks promising for self-tuning!

Issue:
batch step size 1
new step size 1
loss - tensor(2.4591, device=‘cuda:0’, grad_fn=<AddBackward0>)

check for p.grad… <class ‘list’>

169 = total params with 2 total groups

If you define the architecture beforehand and just use Learner it doesn’t use splits at all. (Look at the cnn_learner code for evidence of this)

If you can’t find that tell me @LessW2020 and I can show you the lines tonight :slight_smile: You should also be able to do the same by unfreezing the model just after making a cnn_learner based model (I think)

Thanks @muellerzr - but I don’t mean model Layer Groups…(ie. the model) I mean parameter groups, within the optimizer.

Example:
MXResNet50 = 1 layer group if I check via
learn.layer_groups

However:

learn.opt = OptimWrapper with two parameter groups

1 Like

Real quick, v1 or v2 of fastai?

oops sorry - I’m running 1.0.57 FAi. :slight_smile:

1 Like

All good! Maybe line 106 can provide hints?

1 Like

that looks like a great place to check. I was just debugging to make sure I’m really seeing 2 param groups and it is:

batch step size 1
new step size 1
loss - tensor(2.4591, device=‘cuda:0’, grad_fn=<AddBackward0>)

check for p.grad… <class ‘list’>

169 = total params with 2 total groups

1 Like

hmm, actually the stats is for it’s own tracking of momentum/debais, etc.

The root issue is it’s self.param_groups ala
def step(self, closure=None):
self.update_stats()
for i,pg in enumerate(self.param_groups):
for p in pg[‘params’]:
if p.grad is not None: self.on_step(p, pg, i)

Need to figure out where this param_groups is being set and then I should be able to over-ride.

Seems to be somewhere in Callback.py where it creates the optimWrapper…maybe that second param group is related to splitting out weight decay?

@classmethod
    def create(cls, opt_func:Union[type,Callable], lr:Union[float,Tuple,List], layer_groups:ModuleList, wd:Floats=0., 
               true_wd:bool=False, bn_wd:bool=True)->optim.Optimizer:
        "Create an `optim.Optimizer` from `opt_func` with `lr`. Set lr on `layer_groups`."
    **

split_params = split_no_wd_params(layer_groups)

**
opt = opt_func([{‘params’: p, ‘lr’:0} for p in split_params])
opt = cls(opt, wd=wd, true_wd=true_wd, bn_wd=bn_wd)
opt.lr,opt.opt_func = listify(lr, layer_groups),opt_func
return opt

[edit] That’s the splitter causing this issue - split_no_wd_params…

which is in torch_core:

def split_no_wd_params(layer_groups:Collection[nn.Module])->List[List[nn.Parameter]]:
    "Separate the parameters in `layer_groups` between `no_wd_types` and  bias (`bias_types`) from the rest."
    split_params = []
    for l in layer_groups:
        l1,l2 = [],[]
        for c in l.children():
            if isinstance(c, no_wd_types): l2 += list(trainable_params(c))
            elif isinstance(c, bias_types):
                bias = c.bias if hasattr(c, 'bias') else None
                l1 += [p for p in trainable_params(c) if not (p is bias)]
                if bias is not None: l2.append(bias)
            else: l1 += list(trainable_params(c))
        #Since we scan the children separately, we might get duplicates (tied weights). We need to preserve the order
        #for the optimizer load of state_dict
        l1,l2 = uniqueify(l1),uniqueify(l2)
        split_params += [l1, l2]
    return split_params

thus you will always get 2 param groups even though 1 layer group.

Unfortunately that blows up this new SLS optimizer so let me try and over-ride at least to get SLS into testing…

1 Like

making a new function that will keep things as one param group - in progress:

def filter_all_params_no_split(layer_groups:Collection[nn.Module])->List[List[nn.Parameter]]:
pure = []
buffer=[]
for l in layer_groups:
    for c in l.children():
        buffer +=list(trainable_params(c))  
pure += [uniqueify(buffer)]
    
return pure

that solves it in terms of now only 1 param group…now onto other SLS errors but at least past this issue :slight_smile:
batch step size 1 new step size 1 loss - tensor(2.4583, device=‘cuda:0’, grad_fn=<AddBackward0>) 1 check for p.grad… <class ‘list’>

169 = total params with 1 total groups

1 Like