One thing to check - in your code, your conv defaults to padding=ks//2… def conv(ni, nf, ks=3, stride=1, groups=1, bias=False): return nn.Conv2d(ni, nf, kernel_size=ks, stride=stride, padding=ks//2, groups= groups, bias=bias)
but in most of the TF impl, they use padding=‘same’ which I am writing as padding=0 (to produce same size).
Otherwise we are ending up with very similar stuff overall, and I’m liking how you did the actual efficientNet class setup…I’m changing my code there to mimic yours (I had a more hardcoded setup).
Yes, I tried to simplify things quite a bit (could refactor more obviously). I don’t really like the whole code language for model (s3k3r5i2o6 etc), so I changed that as well. For example I combined all the channel sizes into one list.
I still think padding = ks//2 is right unless I missed something. The reason is that if you have ks=3, you need padding=1 to keep the same image size.
Regarding = padding=‘same’ …ok you have a good point regarding ks//2 and ks=3, though I don’t think it’s exactly the same as TF padding=‘same’ but it’s closer than my 0 pad.
Here’s a whole thread about it in PyTorch - a feature request hanging out for a year to add this in along with a number of code samples on various ways to compute same padding. It’s affected by stride, dilation, etc. though since we are by default using square kernels and square images (usually) then maybe ks//2 does most of it.
My current understanding is that when strides = 2, we halve the height and width of the input with padding = same.
Tensorflow method is explained in the answer here:
There’s a Conv2dSamePadding implementation here:
Reading the TF link above, with stride = 2 and ks = 3, we pad 1 on bottom and right, 0 on top and left if the dimensions are even, and 1 everywhere if the dims are odd.
I drew a 4*4 input on the back of an envelope, I think that padding 1 all around is equivalent to pad_left=1, pad_top=1, pad_right=0,pad_bottom=0 (because the bottom and right padding do not get used with stride=2)
So it’s just a matter of having the 1 pixel padding on top vs bottom, left vs right. Does it really matter?
For stride = 2 and ks = 5, again odd sized images have padding= 2 (=ks//2) all around. Even sized get 1 on top, 1 on left, 2 on right, 2 on bottom. I think it’s the same story as with ks=3 with reversing the left/right and top/bottom padding.
Hopefully I got this right.
I don’t think there is any issue with rectangular images that I’ve seen, but I could be wrong.
I’m going to keep ks//2 for now, but if things don’t work out I’ll probably try the conv2samepadding module I posted.
I said earlier I hadn’t planned to load pretrained weights soon. But it turns out this model doesn’t appear (if my implementation is right) to train from scratch very fast/well.
So I’m going to need to validate with the pretrained weights.
I am working on a bit of a hacky way, which I’ll detail here before having tried it out:
The good news is that my implementation has the same list of weight tensors in the same order and dimension as this implementation (but they are named differently…):
(or at least I checked with B0 and B3)
His pretrained weights as converted from TF to Pytorch are available here:
I used @Seb’s refactoring for the main class loading (thanks for this work Seb!), and other than minor syntax differences for the rest, I think we ended up at the same place, which is good.
I did put the ‘same’ padding code in ahead of my conv function to try and be as exact as possible. I don’t know that it will make much difference but I’ll try and do a test.
Hopefully tomorrow I can run on ImageNette and check it out and work on expanding it to B1-B7.
@Seb, if you can get the TF weights loading that will be outstanding!
I don’t think it will work doing it that way, because, even though the weight tensors are in the same order and have the same shape, the tensors have different names.
Nice! One subtle issue I found is that, when using a functional version of dropconnect, the following line has to be in the forward function of MBConv and not in the init, otherwise inference doesn’t seem to turn off dropconnect (self.training doesn’t work I guess)
self.dc = partial(drop_connect,p=self.drop_connect_rate, training=self.training) if self.drop_connect_rate else noop
I was looking at google’s reccomendation on how to train this on the TPU and found this snippet below at this page
For a single Cloud TPU device, the procedure trains the EfficientNet model ( efficientnet-b0 variant) for 350 epochs and evaluates every fixed number of steps. Using the specified flags, the model should train in about 23 hours. With real imagenet data, the settings should obtain ~76.5% top-1 accuracy on ImageNet validation dataset.
Typical resnet50 is trained for 90 epochs. I don’t know what this means in terms of training time for this network, as 10x less flops / 5x less parameters vs 4x+ more epochs means. You might have a better idea based on your runs. I’m curious though.
It might be better to start with existing weights ported to pytorch and then fine tune on pytorch. This way you can train B0-B4 faster.
How long did it take for you to train this network for 80 epochs? What card are you using to train?
It took me about 4/5 days to train Resnet50 on Imagenet with a 2080Ti card on Fp32. This is the reason why I am suggesting that start with existing weights and preferably smaller dataset. This would make your loop train/test/fix bugs loop shorter.
@Surya501 I have only been training with Imagewoof so far (10 dog classes from Imagenet).
I don’t have an impression of efficiency so far. Recommended image size for B3 is 300 which would make things go very slow if followed for the entire training. Convergence in the first few epochs is not that great, and indeed if they require more epochs, this won’t train as fast.
But let’s see what kind of results others get before making any conclusion.