Given the conversation here: Changing Criterion During Training Provides Good Results
and some personal experiments that I’ve done along these same lines that seemed promising in language modelling I wonder if you might want to consider adding loss as something that you could schedule. I think it’s a very interesting and almost entirely unexplored area in deep learning, and varying from one loss to another (and possibly back) would make for some interesting experimentation.
The easiest way would be to allow two losses with an lr like schedule that allows you two switch between the two w*loss_1 + (1-w)*loss_2 but you may want to make it even more expressive.
Just a thought, and I realize you have a lot to consider when building this so something so experimental and unlikely to be widely used may not be a priority. But I thought I’d bring it up.