A challenge for you all

I’ve improved the speed of your notebook in an attempt to reproduce your results. From 566s per 5 ep. down to 131s. I think it might be useful for anyone still playing with the challange.

Unfortunately, I was not able to get the 94.9%. Your notebook gave me 94.51% without any modification, the rewritten version trains up to 94.74% after TTA (94.44% without).

This is still very good result as without curriculum learning the same model trains to 93.9% TTA.
But It suggest that the model9 is sensitive to small changes in learning, other models like resnet18d are less sensitive (train up to 94.2 without TTA)

Here are the notebooks:

2 Likes

Great! What were the key changes you made to speed it up?

The changes are similar to the ddpm. The bottleneck was sending data and batch construction.

  1. I’ve hidden the latency by letting the second batch be created while the first one is processed on GPU. This improved the training time from 17s to 11s per epoch for a toy model, batch size 1024. (done via LazyProgressCB and LazyMetricsCB)

  2. Then I’ve cached our datasets as list of dicts. (it is way faster than caching provided by huggigface datasets). This improved the time down to 1.23 sec per epoch. (cache_dataset_as_dict)

So we have 14 times speed improvement on small models, I’ve applied the changes to Christophers notebook rewriting the random sampler as well to make sure it does not sync to cpu.

7 Likes

How about resnet18d trained up to 95.78% in 285 sec and 94.2% in 35s (on 2080ti)?

I know this is not about the speed but I hope to spark your interest again as this dataset is very rewording to work with when you use LazyMetrics/LazyProgress. (Jeremy would you accept PR for that?)

94.2% with vanilla resnet18d

The trick was to scale up the image to 64px and initialise resnet properly, then vanilla resnet18d trains up to 94.2%, with AdamW without or with weight decay,

BTW. We are using AdamW that has default weight_decay of 0.01, even though our models have batch norm. Although it doesn’t seems to be bad at least for lr=2e-2.

Upscaling probably helps resnet retain information in later blocks. Here is an accuracy after epoch contrasted with actual image scale

Getting to 95.78% and 95.82%

Scaling worked, but then I wasn’t able to get anywhere near the top results after applying our training improvements. General Relu/Swish/Hardswish, @christopherthomas curriculum learning, @tommyc OneCycleWithPlateau, augmentation with TTA all don’t have visible impact for 5 epochs.

Thanks to @johnri99 analyses I’ve thought about implementing mixup and label smoothing which I hoped it will help with noise. And it does so but only sometimes :). But combined with google’s lion optimiser it let me improve from 95.63%/95.55% up to 95.78% (50 epochs 289s).
Here are the relevant attributes in a small DSL that wraps learner:

#95.78
run(timm_model('resnet18d', upscale32.bilinear(2), leaky=0.0, drop_rate=0.4),
   get_augcb(transforms.RandomCrop(28, padding=1, fill=-0.800000011920929),
                      transforms.RandomHorizontalFlip(0.5),
                      RandErase(pct=0.2)),
   lion(1e-3, bs=512),          # adamw(1e-2, bs=512) it gives 95.54%
   mixup(0.4, lbl_smooth=0.1),  # if removed gives 95.53% (lion) 95.63 (adamw) 
   epochs=50, tta=True)

Btw. Lion is an absolute marvel. it is almost as simple as sgd, it is smaller (one moving average), and it does not hide the fact that it updates model with learning rate disregarding gradient value which is quite pleasing (Adam does the same, when update direction does not change but it is hard to notice). Here is a simplified (3 loc) implementation that updates 1 parameter

I also trained resnet50d, it gets slightly better results more often than resnet18d.The best I have is 95.82%, I’ve got the result using dadaptation.DadaptAdam (I don’t have the implementation of DadaptAdam from scratch yet as I’m still processing the paper), here is the config:

run(timm_model('resnet50d', upscale32.bilinear(2), leaky=0.0,  drop_rate=0.4),
    get_augcb(
         transforms.RandomCrop(28, padding=1, fill=-0.800000011920929),
         upscale32.bilinear(2),
         transforms.RandomHorizontalFlip(0.5),
         RandErase(pct=0.2)),
     mixup(0.4, lbl_smooth=0.1),
     opt_func=partial(dadaptation.DAdaptAdam, weight_decay=0.0), base_lr=1,
     epochs=50, tta=True)

If you want to have a closer look here is a notebook with the relevant experiments extracted.

3 Likes

What PR are you proposing exactly?

Sorry for being unclear, I meant speeding up mini ai by making MetricsCB and ProgressCB sync to cpu only once per epoch by default. And maybe adding cache_dataset_as_dict, so it is easier for everyone to benefit from quicker fit.

Sounds reasonable to me, if the PR doesn’t make things too complicated. (And ditto for the caching idea.)

1 Like

Here you go, I’ve improved the code a bit to show loss plot during epoch, every n batches.

1 Like

Hello everybody!
Just jump in, looking for code.
Speaking about Chris Sampler - its look like a “bug” here, but its work good anyway so may be it’s planned…
In case of p=0.5, we take 50% of ours samples with lowest loss and replace it with top_losses. So now we got 1/2 samples two times.
On next epoch if p != 0 we take SAME list of samples and mix throw it.
So now we “overtrain” model on that samples until p==0 and we use full list of indexes.
And it works good. May by i did understand code wrong…
I simplified this code and make new version as i think shod be.
Will share it later.

Here link to notebook with changes.
Change at TopLossesCallback - sort instead of topk, dont calculate top_loss if no need.
Two versions of CustomTrainingSampler - simplified and with mix on new samples list.

colab notebook

If you like to use Mish activation - pytorch got fast implementation torch.nn.Mish

1 Like

Glad to see people still using it to degrees of success. :slight_smile:

This is a bit cheeky but… I think this model technically has the highest score of 5 epochs. This is literally the same code as @christopherthomas just with some small parameter tweaks and increases the final result to 0.95 from 0.949. Here’s the collab link: Google Colab

image

I was going through the winning entry for 5 epochs and noticed it had a batch size of 256. Jeremy what you said about “the ideal batch size is 1” stuck with me and I gave it a shot and tried to reduce batch size to 1. That failed because the data was a different shape than expected so I switched to 2 and it then trained but not very stable.
I tried to reduce the learning rate since it makes sense that you don’t want to learn too much from just two data points but it didn’t really seem to help much.

lr = 1e-3

lr = 5e-5

lr = 5e-6

so then I tried a batch size of 32 with lr 1e-2 and that did work and trained a fair bit better.

Training still looks a bit more unstable than it should be though so I’m leaving it to run another time with a learning rate of 1e-3 and let’s see if that makes things a bit better. I’ll also leave one with batch size 16 and lr 1e-2 to run and we can see how that works out too.