A challenge for you all

To measure the impact of the custom scheduler I ran @christopherthomas’s penultimate 5 epoch challenge notebook 5 times with the regular scheduler and 5 times with the custom scheduler and compared the average accuracy.

As shown in the table below the custom scheduler improved the accuracy from 94.28 => 94.44

Scheduler Avg. Pre-TTA Accuracy (%) Avg. Post TTA Accuracy (%)
Regular 93.96 94.28
Custom 94.22 94.44


  1. The only change I made to Chris’s notebook is using set_seed(1) instead of set_seed(42)
  2. The custom scheduler used the following configuration
    sched = partial(OneCycleLRWithPlateau, max_lr=0.01, total_steps=tmax, pct_start=0.1, pct_plateau=0.3)
Experiment Results (Granular)
Exp no. Scheduler Pre-TTA Accuracy (%) Post TTA Accuracy (%)
1 Regular 94.2 94.5
2 Regular 93.9 94.2
3 Regular 94.0 94.2
4 Regular 93.9 94.1
5 Regular 93.8 94.4
1 Custom 94.0 94.4
2 Custom 94.4 94.6
3 Custom 94.4 94.4
4 Custom 94.1 94.3
5 Custom 94.2 94.5

Is anyone else observing the same level of fluctuation when running their notebooks?


I had been seeing some variation on the accuracy on training runs of the same notebook, particularly with the TTA accuracy. I’d thought it was likely to be a result of Dropout and also the Random transforms during training, especially with so few epochs. With my most recent training with my revised model, without Dropout, I’d only observed a variation of 0.1% across different training run.

From that notebook, I’ve just tried removing the random transforms and removing dropout. The accuracy then appears to be consistent between 92.5% and 92.6% over 6 subsequent training runs.

1 Like

I’ve realised my revised model was actually smaller as the previous one had 13,288,004 parameters. I’ve tried increasing the number of filters to 320 for all but the first ResBlock and that’s resulted in 95.7% accuracy after TTA for 20 epochs of training! (notebook)

Running 50 epochs at the moment to see if that also improves.

1 Like

TL;DR I had an idea that it might be interesting to use spaced repetition for curriculum learning, similar to Anki. I have not implemented it yet, sharing in case someone else wants to try it too.

I first heard about “curriculum learning” above in this topic.

I was thinking about how to continue training during the inference stage, i.e. after a model is released to production, so it can continue to learn, and I came up with the idea of training using spaced repetition, like Anki does for human learning.

We would schedule items for training based on how well the model is doing with them. If it fails on an item, schedule that item for study again ASAP, e.g. in the next batch of the same epoch. If it succeeds on an item, schedule it out into the future a bit. When it repeatedly succeeds, it would revise that item at exponentially increasing intervals, e.g. 1 day, 2 days, 4 days, 8 days, … (or rather, some number of batches / epochs later)

I haven’t tried this yet for machine learning, but spaced repetition can work very well for human study, so I guess it would work well for machine learning too. It would also work well for adding new items to the training set later on (e.g. in production, or for subsequent stages of learning). I’ve personally used Anki to learn kanji and martial arts techniques, it’s surely much more effective than just reading the whole dataset repeatedly as we do for each normal epochs without curriculum learning.

It would make sense to combine this with other curriculum learning ideas, such as tackling the easier cases first.

edit: I found a paper mentioning this approach and some more advanced approaches, which cites some other papers on the topic too: https://arxiv.org/pdf/2011.00080.pdf


I’ve managed to improve the 50 epoch accuracy with TTA to 95.8% using a similar model to before with 256 filter ResBlocks and lowering the learning rate to 1e-2 (notebook). Also, I noticed the accuracy on the horizontal flipped images was often higher and I’ve found increasing the probability of the flip from the default 0.5 to between 0.6 and 0.75 appears to improve the accuracy with TTA. With that and 288 filter ResBlocks I’ve gained a small improvement in accuracy with TTA to 94.9% for 5 epochs of training (notebook).


Is similar to the analysis a I did here:Discord

1 Like

We have a way to cut down the training time by half or more for small batches if we use lazy metrics.Have a look at the code for DDPM. It should work here as well. ( I will post a notebook that include the changes along with my experiments resnet18d.)


I’ve improved the speed of your notebook in an attempt to reproduce your results. From 566s per 5 ep. down to 131s. I think it might be useful for anyone still playing with the challange.

Unfortunately, I was not able to get the 94.9%. Your notebook gave me 94.51% without any modification, the rewritten version trains up to 94.74% after TTA (94.44% without).

This is still very good result as without curriculum learning the same model trains to 93.9% TTA.
But It suggest that the model9 is sensitive to small changes in learning, other models like resnet18d are less sensitive (train up to 94.2 without TTA)

Here are the notebooks:


Great! What were the key changes you made to speed it up?

The changes are similar to the ddpm. The bottleneck was sending data and batch construction.

  1. I’ve hidden the latency by letting the second batch be created while the first one is processed on GPU. This improved the training time from 17s to 11s per epoch for a toy model, batch size 1024. (done via LazyProgressCB and LazyMetricsCB)

  2. Then I’ve cached our datasets as list of dicts. (it is way faster than caching provided by huggigface datasets). This improved the time down to 1.23 sec per epoch. (cache_dataset_as_dict)

So we have 14 times speed improvement on small models, I’ve applied the changes to Christophers notebook rewriting the random sampler as well to make sure it does not sync to cpu.


How about resnet18d trained up to 95.78% in 285 sec and 94.2% in 35s (on 2080ti)?

I know this is not about the speed but I hope to spark your interest again as this dataset is very rewording to work with when you use LazyMetrics/LazyProgress. (Jeremy would you accept PR for that?)

94.2% with vanilla resnet18d

The trick was to scale up the image to 64px and initialise resnet properly, then vanilla resnet18d trains up to 94.2%, with AdamW without or with weight decay,

BTW. We are using AdamW that has default weight_decay of 0.01, even though our models have batch norm. Although it doesn’t seems to be bad at least for lr=2e-2.

Upscaling probably helps resnet retain information in later blocks. Here is an accuracy after epoch contrasted with actual image scale

Getting to 95.78% and 95.82%

Scaling worked, but then I wasn’t able to get anywhere near the top results after applying our training improvements. General Relu/Swish/Hardswish, @christopherthomas curriculum learning, @tommyc OneCycleWithPlateau, augmentation with TTA all don’t have visible impact for 5 epochs.

Thanks to @johnri99 analyses I’ve thought about implementing mixup and label smoothing which I hoped it will help with noise. And it does so but only sometimes :). But combined with google’s lion optimiser it let me improve from 95.63%/95.55% up to 95.78% (50 epochs 289s).
Here are the relevant attributes in a small DSL that wraps learner:

run(timm_model('resnet18d', upscale32.bilinear(2), leaky=0.0, drop_rate=0.4),
   get_augcb(transforms.RandomCrop(28, padding=1, fill=-0.800000011920929),
   lion(1e-3, bs=512),          # adamw(1e-2, bs=512) it gives 95.54%
   mixup(0.4, lbl_smooth=0.1),  # if removed gives 95.53% (lion) 95.63 (adamw) 
   epochs=50, tta=True)

Btw. Lion is an absolute marvel. it is almost as simple as sgd, it is smaller (one moving average), and it does not hide the fact that it updates model with learning rate disregarding gradient value which is quite pleasing (Adam does the same, when update direction does not change but it is hard to notice). Here is a simplified (3 loc) implementation that updates 1 parameter

I also trained resnet50d, it gets slightly better results more often than resnet18d.The best I have is 95.82%, I’ve got the result using dadaptation.DadaptAdam (I don’t have the implementation of DadaptAdam from scratch yet as I’m still processing the paper), here is the config:

run(timm_model('resnet50d', upscale32.bilinear(2), leaky=0.0,  drop_rate=0.4),
         transforms.RandomCrop(28, padding=1, fill=-0.800000011920929),
     mixup(0.4, lbl_smooth=0.1),
     opt_func=partial(dadaptation.DAdaptAdam, weight_decay=0.0), base_lr=1,
     epochs=50, tta=True)

If you want to have a closer look here is a notebook with the relevant experiments extracted.


What PR are you proposing exactly?

Sorry for being unclear, I meant speeding up mini ai by making MetricsCB and ProgressCB sync to cpu only once per epoch by default. And maybe adding cache_dataset_as_dict, so it is easier for everyone to benefit from quicker fit.

Sounds reasonable to me, if the PR doesn’t make things too complicated. (And ditto for the caching idea.)

1 Like

Here you go, I’ve improved the code a bit to show loss plot during epoch, every n batches.

1 Like

Hello everybody!
Just jump in, looking for code.
Speaking about Chris Sampler - its look like a “bug” here, but its work good anyway so may be it’s planned…
In case of p=0.5, we take 50% of ours samples with lowest loss and replace it with top_losses. So now we got 1/2 samples two times.
On next epoch if p != 0 we take SAME list of samples and mix throw it.
So now we “overtrain” model on that samples until p==0 and we use full list of indexes.
And it works good. May by i did understand code wrong…
I simplified this code and make new version as i think shod be.
Will share it later.

Here link to notebook with changes.
Change at TopLossesCallback - sort instead of topk, dont calculate top_loss if no need.
Two versions of CustomTrainingSampler - simplified and with mix on new samples list.

colab notebook

If you like to use Mish activation - pytorch got fast implementation torch.nn.Mish

1 Like

Glad to see people still using it to degrees of success. :slight_smile:

This is a bit cheeky but… I think this model technically has the highest score of 5 epochs. This is literally the same code as @christopherthomas just with some small parameter tweaks and increases the final result to 0.95 from 0.949. Here’s the collab link: Google Colab


I was going through the winning entry for 5 epochs and noticed it had a batch size of 256. Jeremy what you said about “the ideal batch size is 1” stuck with me and I gave it a shot and tried to reduce batch size to 1. That failed because the data was a different shape than expected so I switched to 2 and it then trained but not very stable.
I tried to reduce the learning rate since it makes sense that you don’t want to learn too much from just two data points but it didn’t really seem to help much.

lr = 1e-3

lr = 5e-5

lr = 5e-6

so then I tried a batch size of 32 with lr 1e-2 and that did work and trained a fair bit better.

Training still looks a bit more unstable than it should be though so I’m leaving it to run another time with a learning rate of 1e-3 and let’s see if that makes things a bit better. I’ll also leave one with batch size 16 and lr 1e-2 to run and we can see how that works out too.