Modifying parameters while using distributed


So I made a callback to modify some neural network parameters of a model (sample code at the end) while training.

My question is: what happens if train on multiple GPUs? My neural network got ruined for some reason, after a few epochs. I had okay training error, but massive validation error, and I suspect that it has something to do with using two GPU’s. The problem is that each attempt takes like 8 hours, and it only starts to get noticeable by the end :frowning:

class MyCallback(LearnerCallback):
    def __init__(self, learn:Learner):
        self.important_parameter = learn.model[17].weight # or whatever

    def on_batch_end(self, **kwargs):
        with torch.no_grad():

So my specific question is: will this callback work?


Do you actually want to update the weights with a callback? First of all, why would you need to do that? Secondly, that’s not what your code above would do. It would only updated self.important_parameter which was first initialized with the weight value.

Yep, I want to update some parameter weights with a callback.

For example, weight decay does that inside fastai, but it actually does it in the optimizer step, not a callback. I saw the code for weight decay inside, and to be honest, I don’t understand why it works on a multi-GPU environment:

def step(self)->None:
    "Set weight decay and step optimizer."
    # weight decay outside of optimizer step (AdamW)
    if self.true_wd:
        for lr,wd,pg1,pg2 in zip(self._lr,self._wd,self.opt.param_groups[::2],self.opt.param_groups[1::2]):
            for p in pg1['params']: - wd*lr)
            if self.bn_wd:
                for p in pg2['params']: - wd*lr)
        self.set_val('weight_decay', listify(0, self._wd))

Now, I’m not doing weight decay, but something kind of like that: making sure the weights satisfy certain constraints I need (don’t worry about why).

I’ve tested the above code with a single thread and it works and does what I think it does. But on multiple threads, since I can’t use jupyter, it’s a bit harder to see if the models are maybe getting out of sync or whatever. I think it might be getting applied only to one of the two models. But then again, the code above (from would suffer from the same problem… wouldn’t it?

I also tried like this:

class MyCallback(LearnerCallback):
    def __init__(self, learn:Learner):

    def on_batch_end(self, **kwargs):
        important_parameter = learn.model.module[17].weight # or whatever
        with torch.no_grad():

But I’m seeing the same problem :frowning:

Thank you for your reply.

Notice that in this new version I had to do learn.model.module[17], because nn.DataParallel apparently adds a “module” and the model is within that… but I’m somewhat confused.

Sorry, I found out my error, it had nothing to do with distributed and everything to do with the fact that I was using fp16.

1 Like