Using multiple optimizers?

muellerzr · June 1, 2019, 4:57pm

Hi everyone! I’m anxiously waiting for Part 2 to come out as I know it will help there, but I’m trying to implement a model right now that takes in 3 different optimizers and I’m trying to migrate it over from pytorch to fastai. In it’s training loop, the loss functions are calculated as such:

x, y, z _, a = net(img)
x_loss = lossFunc1(x)
y_loss = nn.CrossEntropyLoss(y)
z_loss = nn.CrossEntropyLoss(z)
a_loss = lossFunc1(a)
b_loss = nn.CrossEntropyLoss()

overallLoss = x_loss + y_loss...

overallLoss.backward()

How can I best implement this into fastai? If anyone has material recommendations for me I would greatly appreciate it. This is my first project I did in real pytorch and trying to bring it to fastai. There are 4 different optimizers as well that are declared before and do a .step() after overallLoss

Thanks!

baz · June 1, 2019, 5:07pm

You can set a custom loss function for the learners:

def custom_loss(in ,target):
    # Implement loss function here

cnn_learner(data, loss_func=custom_loss)

muellerzr · June 1, 2019, 5:09pm

My particular model uses 3 different ones for different steps and then sums it all up as the overall loss. Do I just use one loss function where it calls multiple based on the input?

muellerzr · June 1, 2019, 5:16pm

I think I would need a custom callback that can get these diffferent values. Here is what the actual code looks like:

      raw_optimizer.zero_grad()
      part_optimizer.zero_grad()
      concat_optimizer.zero_grad()
      partcls_optimizer.zero_grad()
      
      raw_logits, concat_logits, part_logits, _, top_n_prob = net(img)
      part_loss = list_loss(part_logits.view(batch_size * 6, -1),
                                 label.unsqueeze(1).repeat(1, 6).view(-1)).view(batch_size, 6)
      raw_loss = creterion(raw_logits, label)
      concat_loss = creterion(concat_logits, label)
      rank_loss = ranking_loss(top_n_prob, part_loss)
      partcls_loss = creterion(part_logits.view(batch_size * PROPOSAL_NUM, -1),
                              label.unsqueeze(1).repeat(1, PROPOSAL_NUM).view(-1))
      
      total_loss = rank_loss + raw_loss + concat_loss + partcls_loss
      total_loss.backward()
      raw_optimizer.step()
      part_optimizer.step()
      concat_optimizer.step()
      partcls_optimizer.step()

Where criterion is just CrossEntropy

Edit: There are four different parts to this model, each with their own parameters and so there are four different optimizations. What I’d really like is how to implement those four different parameterizations and prepare a model for this

Kornel · June 2, 2019, 4:26pm

You have to do these steps to handle optimizers:

split model to groups (for each sub-optimizer)
initialize sub-optimizers with groups parameters
call step for each optimizer

You can create custom callback in which you manually split model and initialize all sub-optimizers in __init__. Then you call step() for each sub-optimizer in on_backward_end() method, and returns True from method to ignore default optimizer

You can split with split method learn.split(split_func). Then you have to extend pytorch Optimizer and pass it to learner

learn = Learner(data, model, opt_func=YourOptiimzer)

If you want to split model manually but still using custom Optimizer you can pass it to learn.opt

learn.opt = YourOptimizer(manuall_params)

Remember that in YourOptimizer.__init__ you will have to initialize sub-optimizers and in YourOptimizer.step() you have to call step() for each sub-optimizer

For loss function, you have to use custom_loss as @baz said

def your_custom_loss(out, label):
  raw_logits, concat_logits, part_logits, _, top_n_prob = out

  part_loss = list_loss(part_logits.view(batch_size * 6, -1), label.unsqueeze(1).repeat(1, 6).view(-1)).view(batch_size, 6)
  raw_loss = creterion(raw_logits, label)
  concat_loss = creterion(concat_logits, label)
  rank_loss = ranking_loss(top_n_prob, part_loss)
  partcls_loss = creterion(part_logits.view(batch_size * PROPOSAL_NUM, -1),
  label.unsqueeze(1).repeat(1, PROPOSAL_NUM).view(-1))

  return rank_loss + raw_loss + concat_loss + partcls_loss

EDIT: for custom Optimizer also write zero_grad method, and call opt.zero_grad() all sub-optimizers

muellerzr · June 2, 2019, 5:12pm

Thanks!!! I greatly appreciate the extremely thorough answer!!! When I try to use the model, I run into an error and I’m unsure if I should just make a new topic. Have you run into an issue where a model will run fine in pure pytorch but when you split and implement the model it shows an error occurring in the models definition/functions?

Kornel · June 2, 2019, 6:07pm

Split method does not modify model at all. Maybe check if after split learn.layer_groups shows groups as you intend to have, because split method isn’t perfectly clear e.g.

split_func = lambda m: (m[0][1], m[1])
learn.split(split_func)

for module

[[conv,conv],[conv,conv],[conv]]
       ^    ^

will split into groups:

[[conv]]
[[conv]]
[[conv, conv], [conv]]

muellerzr · June 2, 2019, 9:18pm

Gotcha! Thanks @Kornel Kornel! It seems to be working. Looking into the callbacks now. The model uses SGD and passes in predetermined learning rates, momentum, and weight decay. Is there a way to get access to it in the callbacks? Here is the start, I hope this is close to what you are recommending?

@dataclass
def customCallback():
  def __init__(self, learn:Learner):
    super().__init__(learn)
    self.raw = list(learn.model.pretrained_model.parameters())
    self.part = list(learn.model.proposal_net.parameters())
    self.concat = list(learn.model.concat_net.parameters())
    self.partcls = list(learn.model.partcls_net.parameters())
    
    self.raw_optim = optim.SGD(self.raw, lr=LR, momentum=0.9, weight_decay=WD)
    self.part_optim = optim.SGD(self.part, lr=LR, momentum=0.9, weight_decay=WD)
    self.concat_optim = optim.SGD(self.concat, lr=LR, momentum=0.9, weight_decay=WD)
    self.partcls_optim = optim.SGD(self.partcls, lr=LR, momentum=0.9, weight_decay=WD)
    
  def on_backward_end():
    self.raw_optim.step()
    self.part_optim.step()
    self.concat_optim.step()
    self.partcls_optim.step()

Are you also saying to split the model in this custom callback too? In init?

Kornel · June 2, 2019, 9:49pm

You forgot to return “True” in on_backward_end

If you don’t return anything fastai will also step on default optimizer (Adam) which can collapse your intentions. Check this line

Oh no. I can see that your model is already splitted by design. (on pretrained_model, proposal_net etc.)

muellerzr · June 2, 2019, 9:50pm

Got it! Thanks. Do you know how to solve this issue by chance? Model Troubles

Kornel · June 2, 2019, 10:20pm

Yes by learn.opt.lr and learn.opt.wd

Your custom_loss should return float tensor with no size:

total_loss.type() == torch.FloatTensor
total_loss.size() == torch.Size([])

You can use total_loss.squeeze(0) if you have total_loss.size() == torch.Size([1])

muellerzr · June 2, 2019, 10:29pm

Thank you so very much Kornel, I greatly appreciate the advice and assistance with this. Is this what you are meaning?

def your_custom_loss(out, label):
  raw_logits, concat_logits, part_logits, _, top_n_prob = out
  
  creterion = torch.nn.CrossEntropyLoss()

  part_loss = list_loss(part_logits.view(4 * 6, -1), label.unsqueeze(1).repeat(1, 6).view(-1)).view(4, 6)
  raw_loss = creterion(raw_logits, label)
  concat_loss = creterion(concat_logits, label)
  rank_loss = ranking_loss(top_n_prob, part_loss)
  partcls_loss = creterion(part_logits.view(4 * 6, -1),
  label.unsqueeze(1).repeat(1, 6).view(-1))

  total_loss = rank_loss + raw_loss + concat_loss + partcls_loss
  total_loss = torch.FloatTensor(total_loss)
  total_loss.type() == torch.FloatTensor
  total_loss.size() == torch.Size([1])
  
  return total_loss.squeeze(0)

Edit:
It seems there’s two things going on in the models training:

            raw_optimizer.zero_grad()
            part_optimizer.zero_grad()
            concat_optimizer.zero_grad()
            partcls_optimizer.zero_grad()

            raw_logits, concat_logits, part_logits, _, top_n_prob = net(img)
            part_loss = model.list_loss(part_logits.view(batch_size * PROPOSAL_NUM, -1),
                                        label.unsqueeze(1).repeat(1, PROPOSAL_NUM).view(-1)).view(batch_size, PROPOSAL_NUM)
            raw_loss = creterion(raw_logits, label)
            concat_loss = creterion(concat_logits, label)
            rank_loss = model.ranking_loss(top_n_prob, part_loss)
            partcls_loss = creterion(part_logits.view(batch_size * PROPOSAL_NUM, -1),
                                    label.unsqueeze(1).repeat(1, PROPOSAL_NUM).view(-1))

            total_loss = raw_loss + rank_loss + concat_loss + partcls_loss
            total_loss.backward()
            raw_optimizer.step()
            part_optimizer.step()
            concat_optimizer.step()
            partcls_optimizer.step()

First there is this. Then:

            for i, data in enumerate(trainloader):
                with torch.no_grad():
                    img, label = data[0].cuda(), data[1].cuda()
                    batch_size = img.size(0)
                    _, concat_logits, _, _, _ = net(img)
                    # calculate loss
                    concat_loss = creterion(concat_logits, label)
                    # calculate accuracy
                    _, concat_predict = torch.max(concat_logits, 1)
                    total += batch_size
                    train_correct += torch.sum(concat_predict.data == label.data)
                    train_loss += concat_loss.item() * batch_size
                    progress_bar(i, len(trainloader), 'eval train set')

            train_acc = float(train_correct) / total
            train_loss = train_loss / total

The first is the partial unsupervised learning this model has. The second is the actual accuracy.

Editx2, the first is what the loss_fn should be, the second the metric

muellerzr · June 2, 2019, 11:17pm

Ok question, if the model is already split into groups, and the optimizer is the same (it’s SGD) do I need a custom callback?

Kornel · June 3, 2019, 8:41am

Actually, you don’t. I thought that your optimizers will be different for each group.
In learner, you can just switch default Adam to SGD.

learn = Learner(data, model, loss_func=your_loss_func, opt_func=SGD)

But why you would want to have SGD instead of Adam? If there is no particular reason use default Adam.

And If you wish to have, for example, different learning rates for different parts you can use split anyway

def split_met(m):
  return (m.pretrained_model, m.proposal_net, m.concat_net, m.partcls_net)

learn.split(split_met)

and then learning rates (first will probably be smaller cause it is “pretrained”_model)

learn.fit(epochs, (lr1, lr2, lr3, lr4))

You’re welcome

muellerzr · June 3, 2019, 12:01pm

Got it! Thanks! So really the only thing I need to do is just modify the loss function (for the simple sake of testing, then split the model accordingly) because we all love Adam. In the paper they kept everything at the same learning rate and didn’t try differentiating them all.

Edit: solved the issue, I can pass pretrained model weights to the call to resnet.

muellerzr · June 3, 2019, 10:49pm

We’re training!!! Thank you so much @baz, @Kornel, and @MicPie for all your help! Next few weeks I plan on doing a Medium article on this and I will post on the forum as well. Thank you all so much!

muellerzr · June 4, 2019, 1:08am

As I say that, one last final issue that should be rudimentary to solve. So for their metrics calculation, it looks as such:

for i, data in enumerate(trainloader):
    with torch.no_grad():
        img, label = data[0].cuda(), data[1].cuda()
        batch_size = img.size(0)
        _, concat_logits, _, _, _ = net(img)
        # calculate loss
        concat_loss = creterion(concat_logits, label)
        # calculate accuracy
        _, concat_predict = torch.max(concat_logits, 1) <- only one I am after
        total += batch_size
        train_correct += torch.sum(concat_predict.data == label.data)
        train_loss += concat_loss.item() * batch_size
        progress_bar(i, len(trainloader), 'eval on train set')

train_acc = float(train_correct) / total
train_loss = train_loss / total

Currently my metric is this:

def custom_metric(out, label):
  _, concat_logits, _, _, _ = out
  _, acc = torch.max(concat_logits, 1)
  
  return acc

Which is just returning the numerator (I believe?) In any case, it is an accuracy function where label is concat_logits. I should be able to work with our accuracy function now and do this right?

n = concat_logits.shape[0]
out = out.argmax(dim=-1).view(n,-1)
targs = targs.view(n, -1)
return (out==targs).float.mean()

Or does it need to be slightly different?

Thanks again for all your help! This model trains extremely slowly due to the batch size, but I did notice it trains faster in pure pytorch. Anyone have a sneaking suspicion as to why? Batch size is the same and everything. On average an epoch is a few minutes vs 25. Both same environment

Edit: The answer to the accuracy function is right.

falmasri · September 6, 2019, 5:23pm

why would you like to use multiple optimizers in your model ?

muellerzr · September 6, 2019, 5:25pm

I had misread it. Turns out it was the same optimizer just used in four seperate instances in the source code

falmasri · September 6, 2019, 5:26pm

So having different optimizers grouped on different modules didn’t add any thing to your model ?