Meet Ranger - RAdam + Lookahead optimizer

@sgugger can also confirm the 69%. Here is my v2 notebook:

Could there be a difference in the fact of how in transform we already call a size in v1 and then do a presize as well?

Edit: I’ve updated the notebook to include a v1 run where there’s 74%

I think the difference might be that the validation set is center-cropped in v1, squished to the size in v2.
Also v1 resizes to size+32 before taking the center crop for the validation set.

2 Likes

How should we readjust the transforms to replicate this on v2? (Transforms are a bit of the v2 I haven’t dived into yet)

Is it as simple as doing: if split_idx: self.cp_size = (self.cp_size[0], self.cp_size[1])

Looks like most likely not

Just pushed so that RandomResizedCrop does exactly the same thing as presize in v1.

4 Likes

Awesome! Retesting now.

I did the following:

dbch = dsrc.databunch(after_item=[ToTensor(),RandomResizedCropC(128, min_scale=0.35), FlipItem()], 
                      after_batch=batch_tfms, 
                      bs=64, num_workers=nw)

(Assume C is our new code (faster than repulling))
Still managed only a 70.6%

See here: https://github.com/muellerzr/fastai-Experiments-and-tips/blob/master/ImageWoofTests/fastai_v2.ipynb

1 Like

I’m also getting ~69% after the modification (improving from before, it was ~66%).

1 Like

From what I see in the code you’re running, it doesn’t test if the data is the same in v1/v2, only that there is a problem somewhere. To identify the actual source, we need to have one version used for everything except one thing (like the data, the training loop, the optimizer) and test v1 vs v2 on that specific thing.

1 Like

Got it. I will get something this evening for you. :slight_smile:

Thank you for all the help. We really appreciate it :slight_smile:

These look the same to me if I’m not mistaken

([flip_lr(p=0.5)], []), size=128) (v1)
FlipItem(), Resize(128) (v2)

I tried following where the size goes to to ensure it’s got the same modes (squish and reflection) but it was a bit hard to follow through. Looking at the databunchs they look the same

1 Like

Those are the same yes.

1 Like

Okay awesome. I tried removing every transform so there is only a resize being used, still got a 71% in v2 vs a 77% in v1. I am moving now onto the model (copying everything over from v2 over to v1 though this should still be the exact same).

The fact of how there is still this difference after verifying that the exact same transforms are being used means that I should move away from those for the time being

1 Like

Finally! So the good news is I was able to get 70.8% on v1. Bad news is it was with the ‘new’ model that did it.

learn = Learner(data, myxresnet50(sa=True, c_out=10, act_cls=MishJit), opt_func=opt_func,
              loss_func=LabelSmoothingCrossEntropy(),
               metrics=[accuracy])

(assume myxresnet is the new version we used).

Now that I made it here, I’m going to move over directly the other mxresnet architecture and see if that still works

2 Likes

Great debugging process @muellerzr, and impressive patience and tenacity. (Those 2 are the most important skills IMHO, and are hard to learn!)

3 Likes

Thank you for the kind words Jeremy, working night shifts and having netflix on the screen helps ease the tension :wink:

So update on the model part, straight mxresnet also did not solve the issue.

and using the version we have in v1 with the same activation and all achieved the 74% (Meaning running in v1 with v2 model, activation).

1 Like

If you’re 100% sure the models are the same (you should save the printouts to files and diff them to be totally sure) then other possibilities are:

  • Init
    • Conv
    • Linear
    • BN
    • For each of the above: bias and weights
  • Whether bias is on or off for each layer
  • The implementation of forward() since that doesn’t appear in the model output
  • (I may have missed things too)
2 Likes

The models and initializers should all be exactly the same, should they not? As I do not use any of the v2 library when it comes to initializing/generating the model. I instead used the original repository’s implementation

(I know then bare minimum I can scratch off the forward being an issue :slight_smile: )

x that, I outputted them to text documents. Minor slight variations. For example:

v1:

[[ 3.7118e-01,  2.4901e-01,  3.4924e-01],
          [ 6.1605e-02, -1.8064e-01, -5.7317e-01],
          [-4.0040e-02,  1.2741e-01, -1.8215e-01]]

v2:

[[-2.9317e-01, -5.0387e-01, -2.1530e-01],
          [-3.7322e-01, -2.0616e-02,  6.7522e-02],
          [ 1.2084e-01,  2.0487e-01, -2.5438e-01]]

These are all somewhat close to zero though so can we count them as the same?

If so, the initialized layers are all exactly the same (found via scrolling a very large text document. For those wanting to go through the same exercise, here is what I did:

learn = Learner(...)
with open("params.txt", "w") as text_file:
  params = list(learn.model.parameters())
  for item in params:
    text_file.write("%s\n" % item)

Now for directly addressing your list:

  • Init - Verified all are the same
    • Conv
    • Linear
    • BN
    • For each of the above: bias and weights
  • Whether bias is on or off for each layer - Biases are off on both models until last layer
  • The implementation of forward() since that doesn’t appear in the model output - Both are same architecture, so same forward
  • (I may have missed things too)

Next bit I am going to do is test out the optimizers (just to make sure). I’m going to baseline adam in v1 and v2

1 Like

Hi, It turns out I’ve been looking at the exact same problem of trying to tie out versions of Imagenette leaderboard models on fastai v1 and v2 codebases! Looks like we’ve found a lot of the same issues, but in case it’s useful here’s a work-in-progress notebook with details. In particular, I managed to tie out the forward computation on the two codebases.

4 Likes

@david_page awesome work! Correct me if I’m wrong but it looks like you fixed the issue? (As v1 and v2 results line up) I noticed you used mxresnet18. Could you try with 50? (I will later today too)

Also very impressed you ported over the 1.0 optimizer directly… I need to study that more :slight_smile:

Thanks. I’m pretty sure there are still differences, eg. v1 is not applying weight decay to biases and batch norm weights, whilst I think the v2 code in my notebook is applying weight decay everywhere. But I haven’t seen large differences in final accuracy. Of course I’m using the v1 optimiser and a common DALI dataloader so there’s plenty more to port over to get a completely v2 version.

I’ve checked the forward computation ties out on xresnet50 and I think accuracies were similar but I haven’t done enough runs to be certain. Will take a look at that + other things tomorrow - let me know if you find anything before then!

2 Likes

Will do! You solved the biggest issue for me with directly using the Ranger optimizer (including an lr in the call to learn). Thanks for the work!

Tested resnet50, got the same results (89.9%)

2 Likes