Improved Loss with Curriculum Learning (Paper and Video)

DeepBlender · June 5, 2019, 10:59am

This paper about Curriculum Learning shows surprising improvements. There aren’t many people talking about it on the internet, that’s why I am sharing it here.
The core idea is to first figure out the complexity of each training example, which can be done using an already trained network. For the training, they first only use the simpler examples and over time add more complicated ones. They demonstrate an overall improvement with this strategy.
It is also one of the few papers which considers the techniques from Lesley Smith.

On The Power of Curriculum Learning in Training Deep Networks

I only found this paper because of this video:

LessW2020 · June 5, 2019, 3:35pm

Thanks for the link!
There was another paper on curriculum learning where they found that slowly increasing the dropout rate to allow the network to start off with an easy setting and then steadily make it harder via increasing dropout performed better than just setting dropout and keeping it the same.

I would thus surmise that adding both aspects (slowly increasing warmup and slowly increasing the ‘difficulty’ of images like in this paper) should be synergistic.

DeepBlender · June 5, 2019, 6:43pm

I guess you are talking about this paper:

Intuitively, it is plausible that those two should work nicely together. I totally agree with you on that.

I am very curious to see how this sort of idea evolves. Compared to neural network architecture and optimizers, it seems that this area hasn’t been researched as extensively.

LessW2020 · June 5, 2019, 11:14pm

Good find re: paper and yes that’s the one!
I’d like to implement a callback for doing the dropout as a scheduled percent and test that out and then test out breaking up the classes into easy/medium/hard and see how that affects things together.

And I agree with you - these are areas that haven’t really been tested nearly as much as architecture!

DeepBlender · June 6, 2019, 2:22pm

I am interested to find out whether this kind of approach works for EfficientNet. It seems that other techniques which are especially popular here are not working: https://twitter.com/wightmanr/status/1136297285569015809

Intuitively, there is a high chance it does.

LessW2020 · June 6, 2019, 2:31pm

I have my own implementation of EfficientNet that I will try to test today vs XResNet. Next step I’ll work on setting up the curriculum dropout and test that on both.
Not sure I can get that all done today (do have to work as well lol) but willl post with any findings!

LessW2020 · June 8, 2019, 4:29am

I got a basic curriculum dropout setup in the callback framework. I’ll test it out in the morning on XResNet and post out the code for it.

LessW2020 · June 11, 2019, 5:18am

I hit some issues with the code b/c there is a bug in the notebook callback handler (calls begin_epoch *2), but with some hardcoding, it’s all working.

here’s the code- I greatly welcome any improvements:

class dp_sched(Callback):

def __init__(self, layer, dropout_final=.5, batch_size=538, num_epochs=1):
    self.layer = layer
    self.dp_final = dropout_final
    self.batch_size=0  #len(self.data.train_dl) causes recursive err(?)
    self.total_iterations = 0 
    self.warmup_sets = 0
    self.full_dropout_sets =0
    self.middle_sets =0
    self.current_epoch=0
    self.num_epochs= num_epochs
    
    #print("batch_size ",self.iterations)
    

    
def begin_fit(self, **kwargs):
    print("begin_fit")
    
    
    self.batch_size = len(self.data.train_dl)
    print(self.batch_size," batch size****")
    self.n_epochs = max(1,self.num_epochs)   # min(1,kwargs['n_epochs']) #avoid 0
    self.total_iterations = (self.batch_size * self.n_epochs)
    #main calculations for when to apply dropout %
    self.warmup_sets = int(self.total_iterations * .1)
    self.full_dropout_sets = (self.warmup_sets *2)
    self.middle_sets = self.total_iterations - (self.warmup_sets + self.full_dropout_sets)
    print("breakout of sets: warmup ", self.warmup_sets," middle ",self.middle_sets," final ",self.full_dropout_sets)
    self.start_full_dropout = self.warmup_sets + self.middle_sets
    
def begin_epoch(self):
    print("begin epoch - dp sched")
    self.current_epoch +=1
    print(self.current_epoch, " current epoch - dp sched")

def begin_batch(self):
    #hardcoded batch 

    #print("iteration = ",self.iter)
    if self.iter < self.warmup_sets:
        self.layer.p =0
    elif self.iter > self.start_full_dropout:
        self.layer.p = self.dp_final
    else:
        i = self.iter - self.warmup_sets
        print(i," i val")
        pct = round(i / self.middle_sets,2)
        print(pct," pct%")
        dp_pct = 1-  round(1 * (1/self.middle_sets)**pct,2)
        print(dp_pct," drop pct")
        new_drop = round(dp_pct * self.dp_final,2)
        print("iter ", self.iter, " dp_pct ", new_drop)
        self.layer.p = new_drop
        
    #new_dp = self.curve(1, self.total, self.iter)
    #self.layer.p = new_dp
    #self.total_iterations+=1
    #print("total iter ",self.total_iterations)
    
def after_batch(self):
    pass#print("after batch")

The paper shows that you need a smooth curve on the dropout adjustments - if you make big jumps, then it causes the CNN to ‘forget’ and rest to some degree.
The paper has an algorithm to produce their curve but I couldn’t make it work…and their github just has a schedule they made from matlab or excel, not actual code.
Anyway, I created a similar curve by doing:
0-10% - 0% dropout for first 10% of total iterations to be run
10-70% - an exponential curve for steadily increasing dropout - from 0% to full %…checks at each batch
70-100% - full dropout rate

I’ll try and run it tomorrow on Imagenette with XResNet50 to see how it compares to leaderboard results.

I’m waiting for FastAI v2 to come out and then can hopefully finalize it. Note that you currently have to manually find the dropout layer and pass that in…I’d like to automate that finding aspect.
Also, in the paper they had 3 dropout layers - 1 right after the inputs (90% retention or up to 10% dropout), 75% in the middle conv layer and then up to 50% in the final layer before flatten.
So, technically we need 3 layers to mimic the paper.

Right now I’m just running with one before the final output basically.

As noted, appreciate any input on the very rough code above!

muellerzr · June 12, 2019, 5:40pm

Wonderful work you have done here! I have a quick question. Would this be viable in tabular datasets? Or do they not so much operate in this fashion or need this sort of training framework.

Thank you!

LessW2020 · June 15, 2019, 3:43am

Hi Zachary,
I’m afraid I’ve done almost no work in tabular datasets so I can’t comment.
However, after coding this up - a recent paper just came out that says data augmentation only is the optimal method - superior to dropout and weight decay and they recommend not using it all with performance results to prove their point.

I wrote a quick article about it and has links to the paper. Very interesting results:

muellerzr · June 15, 2019, 4:08am

Facinating read! You’re right, that is an extremely large gap! I may play around a bit with this on Tabular models, as I don’t believe there’s a good way to ‘augment’ our data?

lex · February 24, 2020, 10:14pm

If anyone’s interested, I’ve implemented a simple version of this for fastai v1 here: https://www.kaggle.com/lextoumbourou/bengali-ai-curriculum-dropout-fastai

It seems to have a slight improvement and but nothing major (~0.5%).