Seems to be somewhere in Callback.py where it creates the optimWrapper…maybe that second param group is related to splitting out weight decay?
@classmethod
def create(cls, opt_func:Union[type,Callable], lr:Union[float,Tuple,List], layer_groups:ModuleList, wd:Floats=0.,
true_wd:bool=False, bn_wd:bool=True)->optim.Optimizer:
"Create an `optim.Optimizer` from `opt_func` with `lr`. Set lr on `layer_groups`."
**
split_params = split_no_wd_params(layer_groups)
**
opt = opt_func([{‘params’: p, ‘lr’:0} for p in split_params])
opt = cls(opt, wd=wd, true_wd=true_wd, bn_wd=bn_wd)
opt.lr,opt.opt_func = listify(lr, layer_groups),opt_func
return opt
[edit] That’s the splitter causing this issue - split_no_wd_params…
which is in torch_core:
def split_no_wd_params(layer_groups:Collection[nn.Module])->List[List[nn.Parameter]]:
"Separate the parameters in `layer_groups` between `no_wd_types` and bias (`bias_types`) from the rest."
split_params = []
for l in layer_groups:
l1,l2 = [],[]
for c in l.children():
if isinstance(c, no_wd_types): l2 += list(trainable_params(c))
elif isinstance(c, bias_types):
bias = c.bias if hasattr(c, 'bias') else None
l1 += [p for p in trainable_params(c) if not (p is bias)]
if bias is not None: l2.append(bias)
else: l1 += list(trainable_params(c))
#Since we scan the children separately, we might get duplicates (tied weights). We need to preserve the order
#for the optimizer load of state_dict
l1,l2 = uniqueify(l1),uniqueify(l2)
split_params += [l1, l2]
return split_params
thus you will always get 2 param groups even though 1 layer group.
Unfortunately that blows up this new SLS optimizer so let me try and over-ride at least to get SLS into testing…