Meet Ranger - RAdam + Lookahead optimizer

muellerzr · November 20, 2019, 8:33pm

@sgugger can also confirm the 69%. Here is my v2 notebook:

muellerzr/fastai-Experiments-and-tips/blob/master/ImageWoofTests/Experiments.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Experiments.ipynb",
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "code",
      "metadata": {
        "id": "B5FPSGllAeyW",
        "colab_type": "code",

This file has been truncated. show original

Could there be a difference in the fact of how in transform we already call a size in v1 and then do a presize as well?

Edit: I’ve updated the notebook to include a v1 run where there’s 74%

sgugger · November 20, 2019, 8:42pm

I think the difference might be that the validation set is center-cropped in v1, squished to the size in v2.
Also v1 resizes to size+32 before taking the center crop for the validation set.

muellerzr · November 20, 2019, 8:46pm

How should we readjust the transforms to replicate this on v2? (Transforms are a bit of the v2 I haven’t dived into yet)

Is it as simple as doing: if split_idx: self.cp_size = (self.cp_size[0], self.cp_size[1])

Looks like most likely not

sgugger · November 20, 2019, 9:01pm

Just pushed so that RandomResizedCrop does exactly the same thing as presize in v1.

muellerzr · November 20, 2019, 9:10pm

Awesome! Retesting now.

muellerzr · November 20, 2019, 9:21pm

I did the following:

dbch = dsrc.databunch(after_item=[ToTensor(),RandomResizedCropC(128, min_scale=0.35), FlipItem()], 
                      after_batch=batch_tfms, 
                      bs=64, num_workers=nw)

(Assume C is our new code (faster than repulling))
Still managed only a 70.6%

See here: https://github.com/muellerzr/fastai-Experiments-and-tips/blob/master/ImageWoofTests/fastai_v2.ipynb

lgvaz · November 20, 2019, 9:23pm

I’m also getting ~69% after the modification (improving from before, it was ~66%).

sgugger · November 20, 2019, 9:53pm

From what I see in the code you’re running, it doesn’t test if the data is the same in v1/v2, only that there is a problem somewhere. To identify the actual source, we need to have one version used for everything except one thing (like the data, the training loop, the optimizer) and test v1 vs v2 on that specific thing.

muellerzr · November 20, 2019, 10:10pm

Got it. I will get something this evening for you.

Thank you for all the help. We really appreciate it

These look the same to me if I’m not mistaken

([flip_lr(p=0.5)], []), size=128) (v1)
FlipItem(), Resize(128) (v2)

I tried following where the size goes to to ensure it’s got the same modes (squish and reflection) but it was a bit hard to follow through. Looking at the databunchs they look the same

sgugger · November 21, 2019, 1:19am

Those are the same yes.

muellerzr · November 21, 2019, 1:29am

Okay awesome. I tried removing every transform so there is only a resize being used, still got a 71% in v2 vs a 77% in v1. I am moving now onto the model (copying everything over from v2 over to v1 though this should still be the exact same).

The fact of how there is still this difference after verifying that the exact same transforms are being used means that I should move away from those for the time being

muellerzr · November 21, 2019, 1:39am

Finally! So the good news is I was able to get 70.8% on v1. Bad news is it was with the ‘new’ model that did it.

learn = Learner(data, myxresnet50(sa=True, c_out=10, act_cls=MishJit), opt_func=opt_func,
              loss_func=LabelSmoothingCrossEntropy(),
               metrics=[accuracy])

(assume myxresnet is the new version we used).

Now that I made it here, I’m going to move over directly the other mxresnet architecture and see if that still works

jeremy · November 21, 2019, 2:01am

Great debugging process @muellerzr, and impressive patience and tenacity. (Those 2 are the most important skills IMHO, and are hard to learn!)

muellerzr · November 21, 2019, 2:04am

Thank you for the kind words Jeremy, working night shifts and having netflix on the screen helps ease the tension

So update on the model part, straight mxresnet also did not solve the issue.

and using the version we have in v1 with the same activation and all achieved the 74% (Meaning running in v1 with v2 model, activation).

jeremy · November 21, 2019, 2:04am

If you’re 100% sure the models are the same (you should save the printouts to files and diff them to be totally sure) then other possibilities are:

Init
- Conv
- Linear
- BN
- For each of the above: bias and weights
Whether bias is on or off for each layer
The implementation of forward() since that doesn’t appear in the model output
(I may have missed things too)

muellerzr · November 21, 2019, 2:11am

The models and initializers should all be exactly the same, should they not? As I do not use any of the v2 library when it comes to initializing/generating the model. I instead used the original repository’s implementation

(I know then bare minimum I can scratch off the forward being an issue )

x that, I outputted them to text documents. Minor slight variations. For example:

v1:

[[ 3.7118e-01,  2.4901e-01,  3.4924e-01],
          [ 6.1605e-02, -1.8064e-01, -5.7317e-01],
          [-4.0040e-02,  1.2741e-01, -1.8215e-01]]

v2:

[[-2.9317e-01, -5.0387e-01, -2.1530e-01],
          [-3.7322e-01, -2.0616e-02,  6.7522e-02],
          [ 1.2084e-01,  2.0487e-01, -2.5438e-01]]

These are all somewhat close to zero though so can we count them as the same?

If so, the initialized layers are all exactly the same (found via scrolling a very large text document. For those wanting to go through the same exercise, here is what I did:

learn = Learner(...)
with open("params.txt", "w") as text_file:
  params = list(learn.model.parameters())
  for item in params:
    text_file.write("%s\n" % item)

Now for directly addressing your list:

Init - Verified all are the same
- Conv
- Linear
- BN
- For each of the above: bias and weights
Whether bias is on or off for each layer - Biases are off on both models until last layer
The implementation of forward() since that doesn’t appear in the model output - Both are same architecture, so same forward
(I may have missed things too)

Next bit I am going to do is test out the optimizers (just to make sure). I’m going to baseline adam in v1 and v2

david_page · November 21, 2019, 4:07pm

Hi, It turns out I’ve been looking at the exact same problem of trying to tie out versions of Imagenette leaderboard models on fastai v1 and v2 codebases! Looks like we’ve found a lot of the same issues, but in case it’s useful here’s a work-in-progress notebook with details. In particular, I managed to tie out the forward computation on the two codebases.

muellerzr · November 21, 2019, 4:21pm

@david_page awesome work! Correct me if I’m wrong but it looks like you fixed the issue? (As v1 and v2 results line up) I noticed you used mxresnet18. Could you try with 50? (I will later today too)

Also very impressed you ported over the 1.0 optimizer directly… I need to study that more

david_page · November 21, 2019, 5:16pm

Thanks. I’m pretty sure there are still differences, eg. v1 is not applying weight decay to biases and batch norm weights, whilst I think the v2 code in my notebook is applying weight decay everywhere. But I haven’t seen large differences in final accuracy. Of course I’m using the v1 optimiser and a common DALI dataloader so there’s plenty more to port over to get a completely v2 version.

I’ve checked the forward computation ties out on xresnet50 and I think accuracies were similar but I haven’t done enough runs to be certain. Will take a look at that + other things tomorrow - let me know if you find anything before then!

muellerzr · November 21, 2019, 5:19pm

Will do! You solved the biggest issue for me with directly using the Ranger optimizer (including an lr in the call to learn). Thanks for the work!

Tested resnet50, got the same results (89.9%)