A challenge for you all

I’ve managed to get an accuracy of 94.5% for 5 epochs in my notebook. Working with the Curriculum learning approach and the reduced batch size, I’ve adjusted the initial weight generation to be efficient for Mish as suggested here rather than leaky ReLU . I updated the curriculum learning ‘schedule’ to replace more at different epochs and used test time augmentation The best accuracy I’d managed without test time augmentation was 94.2%

With 20 epochs, with similar methods I’ve managed to get an improved accuracy of 94.9% - notebook here

7 Likes

Along with everything in my previous attempts, I’ve tried added Jeremy’s Random Erase augmentation in as well and that made an improvement for 20 and 50 epochs.

For 20 epochs, that’s taken the accuracy to 95.2% - notebook and for 50 epochs an accuracy of 95.5% - notebook. That’s using Dropout, Smaller batch size, Mish activation, Curriculum learning, transforms including Random Erase and test time augmentation.

4 Likes

I was able to combine the winning approaches so far - ConvMixer from @tcapelle, curriculum learning from @tommyc and ideas from @christopherthomas to achieve 94.6% in 5 epochs. Notebook here.

I discovered a bug in that we’re not shuffling the training set in miniai! So you folks might want to try rerunning your notebooks, particularly 20 and 50 epochs, and seeing if you get better results after pulling from git.

2 Likes

Made a few tweaks and up to 94.7 for 5 epochs. Notebook here.

What tweaks?

Reduced Dropout to 0.1 from 0.2, increased depth to 11 from 10, used a patch size of 3x3 instead of 1x1

3 Likes

I created a custom scheduler OneCycleLRWithPlateau . It adds a middle plateau phase to the OneCycleLR scheduler which allows you to maintain the max learning rate for a tunable percentage of the training cycle. It is defined in the snippet below:

OneCycleLRWithPlateau
class OneCycleLRWithPlateau(lr_scheduler.OneCycleLR):
    def __init__(self, *args, pct_start=0.1, pct_plateau=0.2, **kwargs):
        kwargs["pct_start"] = pct_start
        super(OneCycleLRWithPlateau, self).__init__(*args, **kwargs)
        self._schedule_phases = [
            {
                'end_step': float(pct_start * self.total_steps) - 1,
                'start_lr': 'initial_lr',
                'end_lr': 'max_lr',
                'start_momentum': 'max_momentum',
                'end_momentum': 'base_momentum',
            },
            {
                'end_step': float((pct_start + pct_plateau) * self.total_steps) - 2,
                'start_lr': 'max_lr',
                'end_lr': 'max_lr',
                'start_momentum': 'base_momentum',
                'end_momentum': 'base_momentum',
            },
            {
                'end_step': self.total_steps - 1,
                'start_lr': 'max_lr',
                'end_lr': 'initial_lr',
                'start_momentum': 'base_momentum',
                'end_momentum': 'max_momentum',
            },
        ]

Using the configuration below increased the accuracy of @christopherthomas’s most recent submission from 94.5 => 94.7 over 5 epochs.

sched = partial(OneCycleLRWithPlateau, max_lr=0.01, total_steps=tmax, pct_start=0.1, pct_plateau=0.6)

:warning: reproducibility has been a challenge. Generally the configuration above will produce a pre-TTA accuracy of 94.3% but the post-TTA accuracy can fluctuate from 94.5% to 94.7%.

1 Like

I was interested to see what the model was mis-classifying and whether it models are inadequate or the images are so misleading that it might be almost impossible to classify them correctly. I took a model with a performance of 94.3% after 20 epochs and looked at a confusion matrix, which showed the following:

It seems that shirt is the class that is most frequently misclassified (by a considerable amount). The most common miss classification is with T-shirt / top, but Pullover and Coat are also common cases.

To see what was happening I plotted some of the most mis-classified cases (first images of shirts incorrectly classified as T-shirt / top, second of t-shirts/tops predicted as shirts). I then out 20 images of shirts that were correctly predicted as shirts.



It seems to me that as a human I would make some of the same mistakes as the model, it seems that there is some visual overlap between the classes. Maybe this is why all of the models seem to be peaking at around the same point and those that get better performance seem to show a lot of variability depending upon seeds etc.

5 Likes

Put set_seed(1) just before you create the model and let us know what you get. Let’s all aim to use the same seed.

1 Like

I’ve tried increasing the number of filters to 256 in the 2nd ResBlock then the remaining at 256 filters until the last ResBlock (32,256,256,256,256,256). This has resulted in an accuracy of 94.8% with 5 epochs for a few training runs with 94.6% being the accuracy before test time augmentation. Although, it has increased the model’s size to 5,427,508 parameters - notebook here

With this model and training for 5 epochs, it appears to result in a higher accuracy without the dropout layer. I will run my notebooks again as well with a seed of 1 rather than 42.

I did try running my previous notebooks for 5, 20 and 50 epochs again after the bug fix and didn’t get any improvements on the previous results.

1 Like

With the almost the same model (with a dropout layer added back in) as my last best 5 epoch training run and the previous 20 epoch curriculum learning values I’ve got 95.4% accuracy with TTA after 20 epochs of training ( notebook ) That’s using a seed of 1.

I’ve re-run of copy my last 5 epoch notebook with a seed of 1 and I’m still getting 94.8% accuracy after test time augmentation (notebook).

I’ve also trained with my updated ResBlock filters for 50 epochs with the same techniques (Dropout, Smaller batch size, Mish activation, Curriculum learning, transforms of Random Erase, Crop, Horizontal Flip) and that’s increased the accuracy to 95.6% after 50 epochs of training (notebook). The accuracy was 0.1% lower with test time augmentation. That’s again with a seed of 1.

1 Like

To measure the impact of the custom scheduler I ran @christopherthomas’s penultimate 5 epoch challenge notebook 5 times with the regular scheduler and 5 times with the custom scheduler and compared the average accuracy.

As shown in the table below the custom scheduler improved the accuracy from 94.28 => 94.44

Scheduler Avg. Pre-TTA Accuracy (%) Avg. Post TTA Accuracy (%)
Regular 93.96 94.28
Custom 94.22 94.44

Note

  1. The only change I made to Chris’s notebook is using set_seed(1) instead of set_seed(42)
  2. The custom scheduler used the following configuration
    sched = partial(OneCycleLRWithPlateau, max_lr=0.01, total_steps=tmax, pct_start=0.1, pct_plateau=0.3)
Experiment Results (Granular)
Exp no. Scheduler Pre-TTA Accuracy (%) Post TTA Accuracy (%)
1 Regular 94.2 94.5
2 Regular 93.9 94.2
3 Regular 94.0 94.2
4 Regular 93.9 94.1
5 Regular 93.8 94.4
1 Custom 94.0 94.4
2 Custom 94.4 94.6
3 Custom 94.4 94.4
4 Custom 94.1 94.3
5 Custom 94.2 94.5

Is anyone else observing the same level of fluctuation when running their notebooks?

2 Likes

I had been seeing some variation on the accuracy on training runs of the same notebook, particularly with the TTA accuracy. I’d thought it was likely to be a result of Dropout and also the Random transforms during training, especially with so few epochs. With my most recent training with my revised model, without Dropout, I’d only observed a variation of 0.1% across different training run.

From that notebook, I’ve just tried removing the random transforms and removing dropout. The accuracy then appears to be consistent between 92.5% and 92.6% over 6 subsequent training runs.

1 Like

I’ve realised my revised model was actually smaller as the previous one had 13,288,004 parameters. I’ve tried increasing the number of filters to 320 for all but the first ResBlock and that’s resulted in 95.7% accuracy after TTA for 20 epochs of training! (notebook)

Running 50 epochs at the moment to see if that also improves.

1 Like

TL;DR I had an idea that it might be interesting to use spaced repetition for curriculum learning, similar to Anki. I have not implemented it yet, sharing in case someone else wants to try it too.

I first heard about “curriculum learning” above in this topic.

I was thinking about how to continue training during the inference stage, i.e. after a model is released to production, so it can continue to learn, and I came up with the idea of training using spaced repetition, like Anki does for human learning.

We would schedule items for training based on how well the model is doing with them. If it fails on an item, schedule that item for study again ASAP, e.g. in the next batch of the same epoch. If it succeeds on an item, schedule it out into the future a bit. When it repeatedly succeeds, it would revise that item at exponentially increasing intervals, e.g. 1 day, 2 days, 4 days, 8 days, … (or rather, some number of batches / epochs later)

I haven’t tried this yet for machine learning, but spaced repetition can work very well for human study, so I guess it would work well for machine learning too. It would also work well for adding new items to the training set later on (e.g. in production, or for subsequent stages of learning). I’ve personally used Anki to learn kanji and martial arts techniques, it’s surely much more effective than just reading the whole dataset repeatedly as we do for each normal epochs without curriculum learning.

It would make sense to combine this with other curriculum learning ideas, such as tackling the easier cases first.

edit: I found a paper mentioning this approach and some more advanced approaches, which cites some other papers on the topic too: https://arxiv.org/pdf/2011.00080.pdf

2 Likes

I’ve managed to improve the 50 epoch accuracy with TTA to 95.8% using a similar model to before with 256 filter ResBlocks and lowering the learning rate to 1e-2 (notebook). Also, I noticed the accuracy on the horizontal flipped images was often higher and I’ve found increasing the probability of the flip from the default 0.5 to between 0.6 and 0.75 appears to improve the accuracy with TTA. With that and 288 filter ResBlocks I’ve gained a small improvement in accuracy with TTA to 94.9% for 5 epochs of training (notebook).

5 Likes

Is similar to the analysis a I did here:Discord

1 Like

We have a way to cut down the training time by half or more for small batches if we use lazy metrics.Have a look at the code for DDPM. It should work here as well. ( I will post a notebook that include the changes along with my experiments resnet18d.)

2 Likes