Training loop, optimizer, scheduler API

@jeremy actually I have a hunch that the full callback API is not strictly necessary. The update-on-event strategy is effective but a bit reactive. When possible, I prefer to be proactive and define the schedule up-front as data.

That was easy enough to do for the hyperparameters, though it’s not a complete coverage of the current fastai callbacks yet. The other two use cases I noticed were telemetry (tracking loss or other metrics) and early stopping based on loss (e.g. to abort learning rate finding).

Those are both very easily handled with on_{batch,epoch}_end, but I’m probably going to explore that area a bit and see if I find any alternatives that I like. In particular, making the stats telemetry a first-class primitive seems likely to have a lot of benefits e.g. with remote stats collectors (I think Paperspace offers something).

@sgugger just pushed something inspired by your code, FYI. :slight_smile:

Will be interested to see what you come up with - although I don’t think I’d want to get rid of the callbacks, since they allow users to customize things that we didn’t necessarily even think of! :slight_smile:

2 Likes

Like Jeremy said, I used your idea of ProgrammableOptimizer to handle the hyper-parameters change easily, it’s a good one.
For the callbacks, the use isn’t limited to hyper-parameters settings/telemetry/early stopping: the goal is to remove everything from the training loop (except those calls to the callbacks) and avoid the way it became so crowded in the current fastai library. For beginners in particular, it’s going to be easier to read the code that way.

Also, one callback can then be assigned to a specific task: telemetry, doing true weight decay, dealing with fp16 training, taking care of the HR schedule… Again it’s easier for someone to delve in the code since all the parts relevant to a specific task is in one place and you don’t have to track different pieces in different modules.
Finally like Jeremy said, it’ll allow anyone to implement something we didn’t think of or that hasn’t been invented yes. The way I see it, callbacks are going to be much more flexible than in current fastai, and much more used for all the functionalities we add on top of the basic training.

5 Likes

I’m glad to hear you found something useful!

And I’m not at all opposed that philosophy regarding callbacks as part of an open-closed approach to the fitting loop :slight_smile:

I’m gonna continue exploring in the minimalist direction anyhow, maybe it’ll turn up something interesting for the common cases.

Discriminative (per-layer) wd and lr, including different params for weights vs bias vs batchnorm

Discriminative weight decay sounds really interesting. Have you seen results showing this to be effective yet? Seems to make total sense given how effective discriminative learning rates are.

The ‘imagenet in 4 mins’ paper found it was critical to remove wd from bias and batchnorm: http://arxiv.org/abs/1807.11205 . Other than that, I don’t think I’ve seen per-layer wd changes. I’m pretty sure they’ll turn out to be important, but I don’t think we’ve gotten anything to work yet - is that right @sgugger?

1 Like

I can’t say I have experimented a lot with those.
You also have to remember that the regularization is done per layer as soon as you use discriminative learning rates: weights become weights - lr * wd * weights (or something else where there is still lr if you do L2 reg instead of weight decay) so I’m not sure adding discriminative wds on top is really going to have an impact.

1 Like

Given the conversation here: Changing Criterion During Training Provides Good Results
and some personal experiments that I’ve done along these same lines that seemed promising in language modelling I wonder if you might want to consider adding loss as something that you could schedule. I think it’s a very interesting and almost entirely unexplored area in deep learning, and varying from one loss to another (and possibly back) would make for some interesting experimentation.

The easiest way would be to allow two losses with an lr like schedule that allows you two switch between the two w*loss_1 + (1-w)*loss_2 but you may want to make it even more expressive.

Just a thought, and I realize you have a lot to consider when building this so something so experimental and unlikely to be widely used may not be a priority. But I thought I’d bring it up.

5 Likes

…I wonder if you might want to consider adding loss as something that you could schedule.

This sounds like a great idea.

The easiest way would be to allow two losses with an lr like schedule that allows you two switch between the two w*loss_1 + (1-w)*loss_2 but you may want to make it even more expressive.

I’d think it’d be just two pieces of information you need to create a schedule- loss function and the criteria for when it kicks in. Both could be passed as arbitrary functions paired together. One would be the loss function itself, of course. The other would be the function that determines when you switch to it. It could be as simple as a certain epoch number but it could also be based on a loss threshold, for example. Perhaps this is better defined as an abstract class (interface) but I’m not sure if that fits with the design spirit of fast.ai. The only thing you’d have to be careful about when using this is that the criteria for when the loss functions kick in don’t overlap/conflict. I think the basic logic would be to just advance to the next function in the list as soon as its “threshold function” returns true.

It -seems- fairly easy to do… Not sure if I’m missing something obvious.

I don’t think we want to limit ourselves to binary (on/off) loss or thresholds as you may want to do a 50%/50% wighted loss but if you did want to do binary you could using step functions. You’d just have to be sure that for every part of the schedule there was at least some loss.

Now that I think about it it might make more sense to pass in a list of functions and a corresponding list of schedules so that we aren’t limited in expressiveness. There’s been a number of times when I’ve used a combined loss function.

Initially I was thinking @Sylvain’s learning rate mechanism made sense as it allows for stepwise, sawtooth, cyclic functions, etc. Also, counter to my initial thoughts, for full expressiveness I wouldn’t even want to limit the values to 0-1. You might have loss functions that you want to relatively weight, and you also might have loss functions that you want to apply negatively rather than positively. A list of that type would allow a lower level function to combine the losses by multiplying the weight schedule.

3 Likes

I don’t think we want to limit ourselves to binary (on/off) loss or thresholds as you may want to do a 50%/50% wighted loss but if you did want to do binary you could using step functions. You’d just have to be sure that for every part of the schedule there was at least some loss.

Ahh…like a “loss attention” sort of thing…that sounds like an interesting idea!

Initially I was thinking @Sylvain’s learning rate mechanism made sense as it allows for stepwise, sawtooth, cyclic functions, etc. Also, counter to my initial thoughts, for full expressiveness I wouldn’t even want to limit the values to 0-1. You might have loss functions that you want to relatively weight, and you also might have loss functions that you want to apply negatively rather than positively. A list of that type would allow a lower level function to combine the losses by multiplying the weight schedule.

Here’s the thing though- with the amount of flexibility you’re describing, its sounds like we’re just back to what’s already going on in the current library: Pass in a singular loss function that does whatever it wants (it could take care of all this for you- it’s code, after all). Anything closer to a configuration based approach makes assumptions that may prove to be doing more harm than good later (I like to call it a “configuration straight jacket”). If the ideal is simply a huge amount of flexibility, you can’t get more flexible than a single arbitrary function.

2 Likes

My understanding of it is that the function part isn’t hard, it’s the scheduling side. You could pass the shape functions you want as a parameter to the loss function, but it still has to know where you are in the schedule. Something that would make things more flexible would be a general way to call and find out where you are in the schedule, what the lr is, what the last epoch’s loss is etc, but I’m guessing that would be costly, especially if called during a loss function.

1 Like

Check out the 004 notebook that @sgugger just pushed - it should allow you to play with custom loss function schedules (although you’ll need to write your own callback).

3 Likes

You don’t even need the added flexibility of the callbacks in fastai_v1 for this. You can already experiment in fastai with a custom loss that looks like this:

class VariableLoss(nn.Module):
    def __init__(self, loss_fn1, loss_fn2, w):
        self.loss_fn1,self.loss_fn2,self.w = loss_fn1,loss_fn2,w

    def __call__(x,y):
        return self.w * self.loss_fn1(x,y) + (1-self.w) * self.loss_fn2(x,y)

Then in a callback, you can define your schedule for the w (binary switch or whatever). Of course, this will also work in fastai_v1.

5 Likes

Interesting. I’ll have to check it out. I need to get more familiar with callbacks. I wasn’t aware they were so general! I’ll have to take another look at your original post on them.

@jeremy Have you used Allen NLP library ? https://allennlp.org/ .
If not a look into how it modularizes components by using dependency injection will be very helpful.

Yes I just visited AI2 last week! :slight_smile:

Jeremy, it’s good to host you and Sebastian last Friday at AI2. Let me know if I can help with anything related to AI2.

@sebastianruder and I enjoyed our visit! :slight_smile: If you have any ideas for cool stuff from AllenNLP (or elsewhere) that might be a good fit for fastai_v1, please do let us know.

I have been trying to get a handle on how to smoothly transition between losses.

In one of the lessons @jeremy had talked about curriculum learning where the network is trained on increasingly harder images. For example the harder images would have classes which would look very similar. But my experience with this form of curriculum learning hasn’t been great. Started with a bunch a very disparate classes. Every epoch(s), I would introduce a new class which was similar to one already trained on while keeping the learning rate same. after all the classes were added started annealing. The accuracy was poorer than with standard approach of all classes present and with CLR. maybe I need to anneal while introducing new classes.

My idea was to refine this further by changing the loss function as well. instead of CE use EMD (Earth moving Distance). With EMD loss, signals from similar classes are not lost as is the case with CE which treats all classes as orthogonal. But this hasn’t panned out as well. maybe my assigned distances between classes didn’t match. but with EMD loss wasn’t converging

Have a gut feeling this that if I slowly transition from CE to EMD while also exposing the network to more difficult classes should work But not sure.

@jeremy any pointers on this.