Are you adjusting the learning rate for mixed precision?
Will do thanks, Iāll be offline for a day or two so Iāll check back here then and hopefully get QHAdam pushed.
EDIT: @muellerzr just checked re eps and it looks like @LessW2020 had left it outside the sqrt in both Ranger and RangerQH. RAdam in fastai v2 also has it outside and my port of QHAdam also leaves it outside.
def qhadam_step(p, lr, mom, sqr_mom, sqr_avg, nu_1, nu_2, step, grad_avg, eps, **kwargs):
debias1 = debias(mom, 1-mom, step)
debias2 = debias(sqr_mom, 1-sqr_mom, step)
p.data.addcdiv_(-lr, ((1-nu_1) * p.grad.data) + (nu_1 * (grad_avg / debias1)),
(((1 - nu_2) * (p.grad.data)**2) + (nu_2 * (sqr_avg / debias2))).sqrt() + eps)
return p
qhadam_step._defaults = dict(eps=1e-8)
#export
def QHAdam(params, lr, mom=0.999, sqr_mom=0.999, nu_1=0.7, nu_2 = 1.0, eps=1e-8, wd=0., decouple_wd=True):
"An `Optimizer` for Adam with `lr`, `mom`, `sqr_mom`, `nus`, eps` and `params`"
from functools import partial
steppers = [weight_decay] if decouple_wd else [l2_reg]
steppers.append(qhadam_step)
stats = [partial(average_grad, dampening=True), partial(average_sqr_grad, dampening=True), step_stat]
return Optimizer(params, steppers, stats=stats, lr=lr, nu_1=nu_1, nu_2=nu_2 ,
mom=mom, sqr_mom=sqr_mom, eps=eps, wd=wd)
@muellerzr bit of a delay here, had a small hand surgery last week, so typing was a little slow
I compared v1 and v2 with Ranger but without any image transforms and got pretty much the same result after averaging for 5 runs,
V1 notebok: (69.4+72.3+70.3+68.4+69.6)/5 = 70.0%
- to make sure Ranger was doing something, a 1-run accuracy for Adam instead of Ranger was 64%
V2 notebok: (68.4+71.6+71+69+71.2)/5 = 70.2%
Note for v2 I used after_item=[ToTensor(),Resize(128)]
to do the resizing, which squishes the image, which is the same as the v1 resize. Your previous notebook used a random crop I think which would impact the data being shown to the model and might explain some difference.
Will add the transforms back and let you know how it looks!
Awesome @morgan! Iāve been busy doing key pints for a bit so havenāt looked into it. Canāt wait to hear an update. Great work!
So after adding transforms a big difference in performance emerges: 73.6% vs 69.08%, V1 vs V2. So its probably our implementation of the v2 transforms, that is driving the difference (maaaybe a small chance its a difference in the implementation of the transforms, but unlikely Iād guess).
Will try do an ablation test tomorrow to see if I can narrow down the culprit. Note that I need to look properly for a v2 version of the 3rd transform below (āresize and cropā)
Transforms used (V1 naming)
- flip_lr
- presize(128, scale=(0.35,1)) (Resize images to
size
usingRandomResizedCrop
) - size=128 (equivalent to resize and crop, āno transformā version above used size=(128,128) which is equal to squish
Fastai V1 Result
(73.6+74.2+73.8+72+74.6)/5 = 73.64%
Databunch code:
img_ls = ImageList.from_folder(src).split_by_folder(train='train', valid='val').label_from_folder()
img_ls = img_ls.transform(([flip_lr(p=0.5)], []), size=(128))
data =img_ls.databunch(bs=64, num_workers=nw).presize(128, scale=(0.35,1)).normalize(imagenet_stats)
Fastai V2 Result
(66.8+71+67.8+71+68.8)/5 = 69.08
Databunch code:
tfms = [[PILImage.create], [parent_label, lbl_dict.__getitem__, Categorize()]]
item_tfms = [FlipItem(0.5)]
dsrc = DataSource(items, tfms, splits=split_idx)
batch_tfms = [Cuda(), IntToFloatTensor(), Normalize(*imagenet_stats)]
dbch = dsrc.databunch(item_tfms=item_tfms,
after_item=[ToTensor(), RandomResizedCrop(128, min_scale=0.35)], after_batch=batch_tfms,
bs=64,
num_workers=nw)
I tried it out again and included a c_out
(which was missing before and leading to slightly higher losses). Good news is after one of my runs I got our 74.8%!!! It looks like the bug in the head of the models was the issue (which Jeremy fixed yesterday). @morgan Iām running it for five but here is my code:
learn = Learner(dbch,xresnet50(sa=True, c_out=10), opt_func=opt_func, loss_func=LabelSmoothingCrossEntropy(),
metrics=accuracy)
fit_fc(learn, 5, 4e-3)
Average of five was not as good though, [70.4,74.8,71.0,71.2,73.0]
but much better!!! (average is 72.08)
@muellerzr Iām guessing youāre using custom functions defined in this notebook? I tried running with only fastai and my accuracy is around 65%
, here is the code in case someone wants to check out.
I was not, Iāll post a new notebook later (when I have the time to do so, Iām all over the place this week) but Iāll let you know.
Reread your bit @lgvaz and yes, I was using those custom functions that were in the notebook. Try that and see if it helps. If you get the same as I did Iāll compare whatās in there vs the library
I can confirm that running the code on that notebook gets the accuracy to 72%.
Iāll compare all the functions from my nb and his nb and try to spot the difference.
@lgvaz it looks like the two models are slightly different. First we should be specifying that , act_cls=MishJit
Also looks like some of the Convolution sizes are different. Eg ours in group 1 has a Conv2d(32,64,...)
whereas fastai has Conv2d(32,32,...)
. Along with this, in our implementation the ResBlock
has a final activation whereas fastai does not.
I found where that is. Line 431 in layers.py
self.idconv = noop if ni==nf else ConvLayer(ni, nf, 1, act_cls=None, ndim=ndim, **kwargs)
@jeremy Iām going to tag you into here. If we can show a version working with the activation change and filter change can we post a PR for the layers?
Along with this you do not have an activation present in the last ConvLayer of a resblock but we do
@muellerzr Maybe this is what you were saying with:
But I think thereās a bug in xresnet.py
line 23:
stem.append(ConvLayer(sizes[i], sizes[i+1], stride=2 if i==0 else 1))
act_cls
is not being passed here, which causes the stem to use ReLU
instead of MishJIT
Thereās a bug there too (I think, let me look at it before I say anything definitive), but Iām saying in your call to xresnet
, xresnet50(sa=True, c_out=10, act_cls=MishJit)
Ow Iām already doing that
Got it to your bit, weād want stem.append(ConvLayer(sizes[i], sizes[i+1], stride=2 if i==0 else 1, act_cls=act_cls if defaults.activation else act_cls()))
there. Iām running this now. Let me know what else you find/try
Last bit we need to solve is that 32/32 in the resblock (I think)
Got any improvements so far?
Iām having some problems to modify fastai_dev source code (some permission errors) so Iām a little bit behind. But Iām also going to change it here and then we can compare results
Iām just modifying it in a code above when we make the model and it will override the libraries code. What Iāve been doing is using a difference browser to look at what was different between everything and going from there
Thanks this is helpful. Let me know what you find and I can update the lib later this afternoon as needed.
Weāre working our way there, but we can confirm that the original mxresnet
implementation gets 72.6% (average out of five with a STD of .5%)
OK so the v2 results are within 1 std of thatā¦ Although it sounds like their are definite bugs in v2 (not surprising - some of those changes to it I made very recently and under time pressure!)
Correct. There were a few changes, weāre working on verifying that it works correctly one more time and then make a PR with a few changes to the architecture design on the dev repo, and we can discuss what to do. (For example with the sizes and make_layer
)