Training loop, optimizer, scheduler API

jeremy · August 5, 2018, 10:36pm

We’re about to look at creating a better API for the training loop, hyperparam scheduler, optimizer, etc. I’ll make this a wiki so folks can add missing pieces as they find them. Here are some things that need to be supported

Everything from the training phase API
- For each phase, change data, batch size, optimizer
- Schedule any hyperparam (lr, momentum, wd, beta2, eps) according to any function, including handling momentum (which corresponds to beta[0] in Adam) or beta[1] (which is alpha in RMSprop)
- AdamW-style weight decay, as well as regular wd
Discriminative (per-layer) wd and lr, including different params for weights vs bias vs batchnorm
Call reset at appropriate times for RNNs
Full set of callbacks
- Try to use callbacks for as many features as possible, or find some other way to easily allow them to be customized
All the bits necessary for half precision training
- Maintain single precision copy of weights
- batchnorm in single precision (is this automated by pytorch now)
- loss scaling
Moving average for metrics
Regularization added to the loss for the backprop (like seq2seq_reg in the RNNs)

Some ideas are embedded in this early project from @mcskinner. In the fastai_v1 repo there’s an “extending the training loop” section in to_refactor.ipynb, with some working code that isn’t complete and needs refactoring.

Questions/comments/etc welcome!

rcoh · August 5, 2018, 11:28pm

Qualitative Epoch Metadata

One thing I’ve noticed is that tensorflow will give a bit of qualitative info about each epoch eg. “No improvement in validation loss” and I think it has a feature to automatically stop when it starts overfitting.

Would be nice to have, especially for people just getting started.

Docs / Schedule Configuration

On the documentation side, a default configuration for the different learning rate schedulers + a doc string about what tweaks you can make and when it makes sense to make them.

Currently, I think most of the optimizers that have multiple params are configured by tuple – in fastai_v1, what do you think about having them be configured either by dict or by configuration class?

Old:
learn.fit(..., use_clr_beta=(100, 1, 0.9, 0.8))

Proposed:
learn.fit(sched=ClrBeta(div=100, pct=1, momentums=[0.9, 0.8]))

The parameter names are optional, of course and without them, the code is about the same length. Conveniently, this class gives a nice place to put doc strings about an optimizer.

The only wrinkle is that in the current code, the optimzers need a couple of other parameters, namely a layer_opt , the number of batches, and a cycle_end callback. I’d handle that by putting a to_sched method on the user facing classes to turn a schedule configuration into a runnable scheduler:

def to_sched(self, layer_opt, nb, on_cycle_end):
  return ClrBetaScheduler(self, layer_opt, nb, on_cycle_end)
```

jeremy · August 5, 2018, 11:41pm

Those are easily implemented in the existing callback structure AFAICT, which I assume we’ll be keeping as-is (unless someone comes up with something better).

The training phase API doesn’t do it this way - the older tuple-based approach was just a temporary hack. @mcskinner’s code also has some further development down this path.

mcskinner · August 5, 2018, 11:42pm

@rcoh I would be curious to hear your opinions on my attempt at a generalized scheduler and its corresponding usage in a fitting loop setup.

Feel free to submit issues / PRs / etc to keep me moving. Same goes to you @sgugger if something catches your eye for integration or porting.

mcskinner · August 6, 2018, 3:43pm

@jeremy actually I have a hunch that the full callback API is not strictly necessary. The update-on-event strategy is effective but a bit reactive. When possible, I prefer to be proactive and define the schedule up-front as data.

That was easy enough to do for the hyperparameters, though it’s not a complete coverage of the current fastai callbacks yet. The other two use cases I noticed were telemetry (tracking loss or other metrics) and early stopping based on loss (e.g. to abort learning rate finding).

Those are both very easily handled with on_{batch,epoch}_end, but I’m probably going to explore that area a bit and see if I find any alternatives that I like. In particular, making the stats telemetry a first-class primitive seems likely to have a lot of benefits e.g. with remote stats collectors (I think Paperspace offers something).

jeremy · August 6, 2018, 4:00pm

@sgugger just pushed something inspired by your code, FYI.

Will be interested to see what you come up with - although I don’t think I’d want to get rid of the callbacks, since they allow users to customize things that we didn’t necessarily even think of!

sgugger · August 6, 2018, 4:21pm

Like Jeremy said, I used your idea of ProgrammableOptimizer to handle the hyper-parameters change easily, it’s a good one.
For the callbacks, the use isn’t limited to hyper-parameters settings/telemetry/early stopping: the goal is to remove everything from the training loop (except those calls to the callbacks) and avoid the way it became so crowded in the current fastai library. For beginners in particular, it’s going to be easier to read the code that way.

Also, one callback can then be assigned to a specific task: telemetry, doing true weight decay, dealing with fp16 training, taking care of the HR schedule… Again it’s easier for someone to delve in the code since all the parts relevant to a specific task is in one place and you don’t have to track different pieces in different modules.
Finally like Jeremy said, it’ll allow anyone to implement something we didn’t think of or that hasn’t been invented yes. The way I see it, callbacks are going to be much more flexible than in current fastai, and much more used for all the functionalities we add on top of the basic training.

mcskinner · August 6, 2018, 7:19pm

I’m glad to hear you found something useful!

And I’m not at all opposed that philosophy regarding callbacks as part of an open-closed approach to the fitting loop

I’m gonna continue exploring in the minimalist direction anyhow, maybe it’ll turn up something interesting for the common cases.

jsa169 · August 6, 2018, 8:54pm

Discriminative (per-layer) wd and lr, including different params for weights vs bias vs batchnorm

Discriminative weight decay sounds really interesting. Have you seen results showing this to be effective yet? Seems to make total sense given how effective discriminative learning rates are.

jeremy · August 6, 2018, 9:03pm

The ‘imagenet in 4 mins’ paper found it was critical to remove wd from bias and batchnorm: http://arxiv.org/abs/1807.11205 . Other than that, I don’t think I’ve seen per-layer wd changes. I’m pretty sure they’ll turn out to be important, but I don’t think we’ve gotten anything to work yet - is that right @sgugger?

sgugger · August 6, 2018, 9:18pm

I can’t say I have experimented a lot with those.
You also have to remember that the regularization is done per layer as soon as you use discriminative learning rates: weights become weights - lr * wd * weights (or something else where there is still lr if you do L2 reg instead of weight decay) so I’m not sure adding discriminative wds on top is really going to have an impact.

Even · August 7, 2018, 7:27pm

Given the conversation here: Changing Criterion During Training Provides Good Results
and some personal experiments that I’ve done along these same lines that seemed promising in language modelling I wonder if you might want to consider adding loss as something that you could schedule. I think it’s a very interesting and almost entirely unexplored area in deep learning, and varying from one loss to another (and possibly back) would make for some interesting experimentation.

The easiest way would be to allow two losses with an lr like schedule that allows you two switch between the two w*loss_1 + (1-w)*loss_2 but you may want to make it even more expressive.

Just a thought, and I realize you have a lot to consider when building this so something so experimental and unlikely to be widely used may not be a priority. But I thought I’d bring it up.

jsa169 · August 7, 2018, 9:38pm

…I wonder if you might want to consider adding loss as something that you could schedule.

This sounds like a great idea.

The easiest way would be to allow two losses with an lr like schedule that allows you two switch between the two w*loss_1 + (1-w)*loss_2 but you may want to make it even more expressive.

I’d think it’d be just two pieces of information you need to create a schedule- loss function and the criteria for when it kicks in. Both could be passed as arbitrary functions paired together. One would be the loss function itself, of course. The other would be the function that determines when you switch to it. It could be as simple as a certain epoch number but it could also be based on a loss threshold, for example. Perhaps this is better defined as an abstract class (interface) but I’m not sure if that fits with the design spirit of fast.ai. The only thing you’d have to be careful about when using this is that the criteria for when the loss functions kick in don’t overlap/conflict. I think the basic logic would be to just advance to the next function in the list as soon as its “threshold function” returns true.

It -seems- fairly easy to do… Not sure if I’m missing something obvious.

Even · August 7, 2018, 9:57pm

I don’t think we want to limit ourselves to binary (on/off) loss or thresholds as you may want to do a 50%/50% wighted loss but if you did want to do binary you could using step functions. You’d just have to be sure that for every part of the schedule there was at least some loss.

Now that I think about it it might make more sense to pass in a list of functions and a corresponding list of schedules so that we aren’t limited in expressiveness. There’s been a number of times when I’ve used a combined loss function.

Initially I was thinking @Sylvain’s learning rate mechanism made sense as it allows for stepwise, sawtooth, cyclic functions, etc. Also, counter to my initial thoughts, for full expressiveness I wouldn’t even want to limit the values to 0-1. You might have loss functions that you want to relatively weight, and you also might have loss functions that you want to apply negatively rather than positively. A list of that type would allow a lower level function to combine the losses by multiplying the weight schedule.

jsa169 · August 7, 2018, 10:10pm

I don’t think we want to limit ourselves to binary (on/off) loss or thresholds as you may want to do a 50%/50% wighted loss but if you did want to do binary you could using step functions. You’d just have to be sure that for every part of the schedule there was at least some loss.

Ahh…like a “loss attention” sort of thing…that sounds like an interesting idea!

Initially I was thinking @Sylvain’s learning rate mechanism made sense as it allows for stepwise, sawtooth, cyclic functions, etc. Also, counter to my initial thoughts, for full expressiveness I wouldn’t even want to limit the values to 0-1. You might have loss functions that you want to relatively weight, and you also might have loss functions that you want to apply negatively rather than positively. A list of that type would allow a lower level function to combine the losses by multiplying the weight schedule.

Here’s the thing though- with the amount of flexibility you’re describing, its sounds like we’re just back to what’s already going on in the current library: Pass in a singular loss function that does whatever it wants (it could take care of all this for you- it’s code, after all). Anything closer to a configuration based approach makes assumptions that may prove to be doing more harm than good later (I like to call it a “configuration straight jacket”). If the ideal is simply a huge amount of flexibility, you can’t get more flexible than a single arbitrary function.

Even · August 7, 2018, 10:16pm

My understanding of it is that the function part isn’t hard, it’s the scheduling side. You could pass the shape functions you want as a parameter to the loss function, but it still has to know where you are in the schedule. Something that would make things more flexible would be a general way to call and find out where you are in the schedule, what the lr is, what the last epoch’s loss is etc, but I’m guessing that would be costly, especially if called during a loss function.

jeremy · August 7, 2018, 11:07pm

Check out the 004 notebook that @sgugger just pushed - it should allow you to play with custom loss function schedules (although you’ll need to write your own callback).

sgugger · August 8, 2018, 6:39am

You don’t even need the added flexibility of the callbacks in fastai_v1 for this. You can already experiment in fastai with a custom loss that looks like this:

class VariableLoss(nn.Module):
    def __init__(self, loss_fn1, loss_fn2, w):
        self.loss_fn1,self.loss_fn2,self.w = loss_fn1,loss_fn2,w

    def __call__(x,y):
        return self.w * self.loss_fn1(x,y) + (1-self.w) * self.loss_fn2(x,y)

Then in a callback, you can define your schedule for the w (binary switch or whatever). Of course, this will also work in fastai_v1.

Even · August 9, 2018, 4:43am

Interesting. I’ll have to check it out. I need to get more familiar with callbacks. I wasn’t aware they were so general! I’ll have to take another look at your original post on them.

saiprasanna · August 27, 2018, 2:25pm

@jeremy Have you used Allen NLP library ? https://allennlp.org/ .
If not a look into how it modularizes components by using dependency injection will be very helpful.