Yes it is and it’s what the current leaderboard is based on. The baselines were run with the original setup that we found to bring in a fair comparison.
I’ve made a start by using twist in some of the “standard” models using fastai v2 dev: https://github.com/pete88b/data-science/blob/master/fastai-things/train-imagewoof-with-TwistLayer.ipynb
hope it helps
Thanks, @pete88b. As I mentioned, there’s a lot of waste in parameters.
If you are testing on ResNeXt, did you give each conv2d
groups argument? (That’s as much as I understand ResNeXt)
Briefly, I’m adding two extra conv2d (that I call convx and convy), but you can see that I “symmetrized” the weights, so instead of 9 parameters for each filter/channel/feature-map, it’s only 4 in effect. I also copied the convx weights into convy so the entire convy is extraneous. Of course once we have tested various possibilities we could write the TwistLayer more efficiently.
You asked where you can learn about TwistLayer. It’s related to the Neural ODE paper (and others) that interprets ResNet as solving a differential equation. I wrote about the mathematics here
but at the time I didn’t actually know ResNet (even now I know very little beyond ResNet) and I should do a complete rewrite.
I can open up a separate thread to answer questions. [Update: new thread here]
@liuyao that sounds very interesting. Since you’re decreasing the param count, you should get the best benefits with more epochs (so try 200) and less regularization (so try less mixup and larger random resize crop area).
Might not have followed the developments in this thread correctly, but can somebody briefly explain to me what’s a Twist Layer?
I’ve simplified it a bit and it seems to be doing better (I’ll update with results). Here’s the conv_twist layer, replacing each 3x3 convolution. I don’t know if I can explain more briefly than the code:
class conv_twist(nn.Module): # replacing 3x3 Conv2d def __init__(self, ni, nf, stride=1): super(conv_twist, self).__init__() self.conv = nn.Conv2d(ni, nf, kernel_size=3, stride=stride, padding=1, bias=False) self.convx = nn.Conv2d(ni, nf, kernel_size=3, stride=stride, padding=1, bias=False) self.convy = nn.Conv2d(ni, nf, kernel_size=3, stride=stride, padding=1, bias=False) self.convx.weight.data = (self.convx.weight - self.convx.weight.flip(2).flip(3)) / 2 self.convy.weight.data = self.convx.weight.transpose(2,3).flip(2) # self.radii = nn.Parameter(torch.Tensor(nf), requires_grad=True) self.center_x = nn.Parameter(torch.Tensor(nf), requires_grad=True) self.center_y = nn.Parameter(torch.Tensor(nf), requires_grad=True) # self.radii.data.uniform_(0.3, 0.7) self.center_x.data.uniform_(-0.7, 0.7) self.center_y.data.uniform_(-0.7, 0.7) def forward(self, x): self.convx.weight.data = (self.convx.weight - self.convx.weight.flip(2).flip(3)) / 2 # make convx a first-order operator by symmetrizing it self.convy.weight.data = (self.convy.weight - self.convy.weight.flip(2).flip(3)) / 2 # self.convy.weight.data = self.convx.weight.transpose(2,3).flip(2)) # make convy a 90 degree rotation of convx x1 = self.conv(x) _, c, h, w = x1.size() XX = torch.from_numpy(np.indices((1,h,w))*2/w).type(x.dtype).to(x.device) - self.center_x.view(-1,1,1) YY = torch.from_numpy(np.indices((1,h,w))*2/h).type(x.dtype).to(x.device) - self.center_y.view(-1,1,1) # mask = ramp_func((XX**2+YY**2)/(self.radii.type(x.dtype).to(x.device).view(-1,1,1)**2)) return x1 + (XX * self.convx(x) + YY * self.convy(x)) # * mask
|Size (px)||Epochs||model||mixup||Accuracy||# Runs|
|128||5||RMS + twist||0||70.95%||5, mean|
|128||20||RMS + twist||0||85.24%||5, mean|
|128||80||RMS + twist||0.2||87.81%||1|
|128||80||RMS + twist||0.5||88.52%||1|
|128||200||RMS + twist||0.2||88.70%||1|
|256||200||RMS + twist||0.2||91.52%||1|
|Size (px)||Epochs||model||mixup||Accuracy||# Runs|
|256||200||RMS + twist||0.5||95.87%||1|
@a_yasyrev, if you could help test with your ResNet trick + MaxBlurPool, that would be very nice.
Any literature references for this?
Not that I know of. As I mentioned above, the initial observation in Neural ODE paper (and probably others) is related, but I don’t know about this particular implementation.
Maybe I can write about it in the fastpages blog
I inadvertently replaced all the 1x1 convolutions (in addition to the 3x3’s) by conv_twist, and it bumped the acc up…
The resnet50 uses 1x1 in the bottleneck block (1x1 3x3 1x1), as well as some of the skip connections.
I don’t know what to make of it.
(Update: It probably only happens with 5 epochs… Ok, not a fair comparison. Still, that it doesn’t just break down is a surprise.)
Found time to make long runs.
Add nbdev to repo with notebooks, so now you can find all results and links to nbs on ‘doc’ page https://ayasyrev.github.io/imagenette_experiments/.
On 128 small improvements, but on 192 and 256 good enough.
80 eps - 87.63% (now 87.20%)
200 eps - 88.30% (87.20%)
80 eps - 89.69% (89.21%)
200 eps - 90.35% (89.54%)
80 eps - 90.63% (90.48%)
200 eps - 91.14% (90.38%)
Did 3-4 runs.
Worth mentioning what i used start_pst 0.4-0.2 at long runs.
I started test twist layer.
Lets move this discussion to separate trade!
I think we need not only one leaderboard with best score. May be good idea have another page with list, than we can compare results not with only the best but with different models or settings. So we can compare not againt best, but for example against default xresnet or resnet.
We can have template for submissions, with results in different sizes, links to nbs etc. We can submit results to some folder, then some script forms list, best can be moved to main liderboard.
Sorry for using this thread as a lab notebook… but today I was trying to tweak the “hyperparameter” (the 0.7 in initializing parameters with
conv_twist) and the 5-epoch accuracy went further up as I increased it. The way I had in mind is that (center_x, center_y) is the point around which the conv_twist is “twisting” the image, which in this scale is situated at the [-1,1]x[-1,1] square.
turning it up to 1.5 (so the center can be outside the image)
[0.741156 0.73632 0.744973 0.751591 0.72563 ] 0.7399338 0.008722797
[0.757954 0.747264 0.748282 0.750064 0.75719 ] 0.75215065 0.004522691
[0.759735 0.755663 0.75999 0.753372 0.761262] 0.75800455 0.0029829554
(I’m not sure we can still call this ResNet, as it is using 3x3 exclusively.)
I will start a new thread [Update: new thread here]. This reminds me of the PolyMath projects (of Terry Tao and others), a series of online collaborative projects in research mathematics in recent years. Participants would post small comments to a blog post describing the problem, and once in a while (when comments reach the hundreds) the host would start a new post, summarizing what they have learned, and discussion would take off from there. A couple of success stories. I’m new to fastai, and it is probably what’s happening here too.
I did short test with Twist - only on 128, 5 and 20 runs. Look very good!
5 (3 runs) - 75.13%
20 (2 runs) - 86.39%
Here is notebook - https://github.com/ayasyrev/imagenette_experiments/blob/master/Woof_twist_s128_e5_7513_e20_8639.ipynb
Right now i tring to finalize couple experiments wiith new tricks. Hope post about it tomorrow (max 2 days). And I tried it with Twist, have some finding - results VERY promising!
So, on 5 epochs 76 -76.5%. Need to do more runs and checks… Stay tuned!
I stumbled over this publication, which (somehow) reminded me of your twist layer:
TL;DR: The Ghost module creates additional feature maps from a conv module but with “cheaper” operations.
I’ve been playing around trying to beat the Imagewoof leaderboards and ran into something odd regarding the activations in ResNet. Figured I’d share what I have so far.
TL;DR: Using DoubleMish instead of Mish after the residual connection improves accuracy by 1%. Other tweaks to the activation functions improves accuracy by 2%. At least for 128px, 5 epoch, XResNet-18.
Basically I was looking at the default XResNet-18 architecture and found that the way activations are applied along the residual path is inconsistent in scaling blocks versus non-scaling blocks. In a non-scaling block the residual connection adds non-activated values to activated values. Whereas in a scaling block both are non-activated because the residual side passes through a pooling layer, conv, and batchnorm first.
So I was messing around trying to fix that and noticed that in general adding activations along the residual paths tended to improve accuracy. I tried something crazy, went back to default XResNet-18, but with DoubleMish instead of Mish after every residual connect (i.e. here:
return self.act(self.convpath(x) + self.idpath(x))). DoubleMish is just
Testing against Imagewoof, 128px, 5 epochs, 20 runs each, I saw a mean accuracy of 67.9% for default XResNet18 and 69.1% with the above tweak.
I’ve since rigged up a custom XResNet18 model where the activation functions can be specified in a variety of places. Running evolutionary search using that I’ve gotten mean accuracy up to 69.9%. The discovered configuration for that adds Mish after the stem, adds DoubleMish before the final AdaptiveAvgPool, adds DoubleMish before the convolutional path in each resblock, adds Mish before the pooling along idpath, adds Mish after the convolution in idpath, and uses DoubleMish instead of Mish after residual connections.
Bit of a laundry list of changes. I’m still experimenting to see which of those are important. The DoubleMish after residual connection alone is +1% so that’s at least significant. Using DoubleMish instead of Mish everywhere does not provide a benefit. So it’s only useful in key locations. I find that quite odd.
Some things to note:
I was never able to match the accuracies on the leaderboard, even when I attempted to replicate the same configurations. So I just run all my experiments referenced to the default Imagenette example code. XResNet18 is the simplest to mess with. All of this is at 128px, 5 epochs, 20 runs each. The evolutionary algorithm uses None, Mish, or DoubleMish in all the configurable activation locations. Ranger optimizer with fit_flat_cos. 64 batch size. 1e-2 lr for all runs.
Obviously this might be overfitting the hyperparameters to Imagewoof. I figure after all my experimentation is done I can try the resulting architecture against other datasets to see if the improvement is consistent.
Since I’m not varying LR it’s possible that plays a role in the differences, but Ranger tends to be forgiving. Again, I can always fine-tune LR once I have an architecture I want to validate.
Given the limited training it’s possible this doesn’t scale to higher epochs. I’ve only got a measly 970 to experiment on right now so I’m working with what I’ve got
Here’s a gist with the core bits of the code: https://gist.github.com/fpgaminer/57232ab085b8be1decb0906c4eb03356 I’ll try to release a more complete notebook once I’ve finished experimenting. Right now the experimental notebook is … a nightmare of cells.
Here’s a CSV of all the experimental runs thus far: https://gist.github.com/fpgaminer/04be013f894997f92bb33a89bc39fc76
That’s all I’ve got for now. Just really strange that more aggressive activation specifically along the residual path leads to improved accuracy. Though Mish itself is quite surprising. It’s so similar to Swish and yet whatever subtle difference is there is significant. So maybe it’s not so odd that DoubleMish is useful in certain places.
I seem to have found a simple way to get consistent good results across the board. (It has nothing to do with what I posted above, and I’ll see if I can incorporate that in.) In fastai’s implementation of the Bottleneck block in ResNet50, change the middle one (3x3) as follows:
ConvLayer(ni, nh, 1, act_fn=act_fn, bn_1st=bn_1st)), ConvLayer(nh, nh*4, 3, groups=nh, act=False, bn_1st=bn_1st)), ConvLayer(nh*4, nf, 1, zero_bn=zero_bn, act=False, bn_1st=bn_1st))
This seems to be what MobileNet etc. did with so-called “depthwise separable convolution” with a depth multiplier=4. Not sure why it isn’t more widely used.
[Actually, looking at fastaiv2’s source code it seems that
dw=True would be doing just that.]
I’ll fill in the results on ImageWoof as they come in, compared with the current leaderboard (though I know many of you have better results but didn’t PR.)
|Size (px)||Epochs||URL (as of 7/7)||acc (as of 7/7)||new||# Runs|
|128||5||fastai2 train_imagenette.py 2020-01 + MaxBlurPool||73.37%||77.83 (81.89)||5|
|128||20||fastai2 train_imagenette.py 2020-01 + MaxBlurPool||85.52%||86.62 (88.77)||5|
|128||80||fastai2 train_imagenette.py 2020-01||87.20%||88.13 (90.22)||1|
|128||200||fastai2 train_imagenette.py 2020-01||87.20%||(90.71)||1|
|192||5||Resnet Trick + Mish + Sa + MaxBlurPool||77.87%||81.15 (81.76)||5|
|192||20||Resnet Trick + Mish + Sa + MaxBlurPool||87.85%||88.37||5|
|192||80||fastai2 train_imagenette.py 2020-01||89.21%||89.89 (91.44)||1|
|192||200||fastai2 train_imagenette.py 2020-01||89.54%||90.32||1|
|256||5||Resnet Trick + Mish + Sa + MaxBlurPool||78,84%||82.33||5|
|256||20||Resnet Trick + Mish + Sa + MaxBlurPool||88,58%||89.53||5|
|256||80||fastai2 train_imagenette.py 2020-01||90.48%||90.93||1|
|256||200||fastai2 train_imagenette.py 2020-01||90.38%||1|
In parentheses are the results when applying depthwise (x4) to the stem as well, sticking with the [3,32,64,64] that fastai has optimized. Unfortunately it slows down the training considerably (with batch size=16). There must be ways to do better, timewise.
You may run the code directly here (thanks to @a_yasyrev) https://github.com/liuyao12/imagenette_experiments/blob/master/Woof_ResNet_separable.ipynb or implement it easily in your own network (ResNet or not)
Very interesting! It reminds me of the lecture where Jeremy said something along the lines of if you see an old network without resnets, try adding them. Maybe now we can add, ‘if you see a network without separable depthwise convolutions, try adding them’. Staying tuned…
In you code, you didnt use BN and activation function in that middle block.
Did you try with Act and Bn and some variants?
Yes, at first I had both act and bn, then I took out act for most of the 80 and 200 runs for it seems to perform better. Then, in order to run depthwise (stem+body) on 192px and keep at bs=16 (on a single GPU on colab), I had to remove bn. Sorry I didn’t document it in the notebook.
Turns out I forgot to do stride=2 at the very first layer (3 channels to 32 channels), which meant the entire network was four times as costly (clocking at 7 min per epoch, instead of 2 min). But when I put the stride=2 back, it doesn’t seem to do as well. So the higher numbers in parentheses are attributable to the model being 4 times bigger, rather than to depthwise (x4) as I thought I’m re-doing the 80 epoch run to see if I can beat the all-time record 91.44. [Update: Indeed, with no striding at the beginning, the model without any of my tweaks is able to achieve 91.21 after 80 epochs. So I didn’t find a better model after all ]