I am working on image inpainting/completion at the moment, and it seems that it really benefits from having dilated convolutions, see this paper and dilated convolutions have also been successfully used for segmentation in deeplab v3.
I modified the xresnet in fastai2 to support dilation like they do in this paper (dilated resnets), ie use dilation in both convolutions in a basic block, and only in the middle convolution in a bottleneck block.
Should I send a pull request with these changes? Would this be useful for users?
Sure, Iām just starting to run experiments today, and because there are no pretrained models it will take a while until I get the results.
Will share them when I get them.
I expect it would! If youāre able to support it without adding too much complexity to the code, or changing the behavior of existing networks, I think that would be great.
@j.laute Definitely a valueable addition for using the networks in segmentation/obj detection tasks. A few comments,
Regarding your current impl, a suggestion wrt to the location of the dilations. For each stride=2 conv, the next level of dilation should kick in after that conv, and then that dilation should remain in effect for the rest of the network, with each subsequent stride -> dilation adding on top of the previous. This means for a sequence of blocks, the first block (with the stride 2 conv) has a different dilation then the next blocks in that sequence. Also, for the basic block (two 3x3 convs), the second 3x3 conv needs a different dilation from the first strided one.
Also, in networks that suppor this sort of dilation, I often see it implemented at the model/creation interface with an output_stride=x arg instead of the user needing to know what to specifiy there. Itās fairly straight forward to then compute the list of dilations from that. Supporting output strides of 8, 16, and the default of 32 is most common.
A good test is to verify that the accuracy of a pretrained classification model (w/ adaptive global pooling of course) does not degrade if you apply the dilations. Itāll run slower, use more GPU memory, but should be pretty close to the original accuracy +/- a bit (usually +).
Let me repeat so I am sure that I understand fully:
Suppose we have a xresnet18 with 2 resblocks per stage, ie layers=[2,2,2,2] and dilations=[1,2,4,8]
With the current code it would look like this:
stage:
block
conv1: stride=2, dilation=1
conv2: stride=1, dilation=1
block
conv1: stride=1, dilation=1
conv2: stride=1, dilation=1
stage:
block
conv1: stride=2, dilation=2
conv2: stride=1, dilation=2
block
conv1: stride=1, dilation=2
conv2: stride=1, dilation=2
stage:
block
conv1: stride=2, dilation=4
conv2: stride=1, dilation=4
block
conv1: stride=1, dilation=4
conv2: stride=1, dilation=4
stage:
block
conv1: stride=2, dilation=8
conv2: stride=1, dilation=8
block
conv1: stride=1, dilation=8
conv2: stride=1, dilation=8
and you are suggesting that it should look like this:
stage:
block
conv1: stride=2, dilation=1
conv2: stride=1, dilation=1
block
conv1: stride=1, dilation=1
conv2: stride=1, dilation=1
stage:
block
conv1: stride=2, dilation=1
conv2: stride=1, dilation=2
block
conv1: stride=1, dilation=2
conv2: stride=1, dilation=2
stage:
block
conv1: stride=2, dilation=2
conv2: stride=1, dilation=4
block
conv1: stride=1, dilation=4
conv2: stride=1, dilation=4
stage:
block
conv1: stride=2, dilation=4
conv2: stride=1, dilation=8
block
conv1: stride=1, dilation=8
conv2: stride=1, dilation=8
The output stride as a parameter I understand and implement.
I just checked the official implementation here and saw that they also remove the stride 2 at the beginning of the blocks that use dilation, should I do that as well?
Also could you explain or share a resource that explains why the first conv should have the dilation of the previous stage?
Thanks for the help
@j.laute yes, the second part of your example is correct but with stride=2 changed to stride=1 in the blocks where you apply dilation (as you say in the follow up). Typically Iāve only seen stage 3&4 modified, for total network stride of 8 or 16. What you show is a stride 4 network, still valid, but not very common, itād have very large feature maps.
A stride 4 network that changes stage 2, or even stride 2 with edit, woops stage 1=stride1, so stem and stages modified would still be valid. I just donāt see it very often, but definitely wouldnāt be wrong to support them. DeepLab typically has a output_stride=8 or 16 choice for the backbone. I think RetinaNets sometimes use a output_stride=16 config. Many networks based on FPN donāt alter the stride of the backbone and just tap each stage at their default strides.
@rwightman@jeremy I implemented the feedback here and created a collab notebook with some initial examples here.
Apart from some issues with weight loading and saving, I noticed two problems:
differently from what you suggested (if I understood you correctly), there is a massive drop in accuracy when inferencing at a different output stride than the model was trained on (see notebook), finetuning with dilation seems to get similar/slightly better results in my initial experiments
how should we handle xresnets with more than 4 stages, ie the āxresnetxx_deepā and āxresnet_deeperā versions? They have output strides 128 and 512 respectively, which is achieved by adding stages with just one resblock.
For example xresnet34 has [3, 4, 6, 3] resblocks, xresnet34_deep has [3,4,6,3,1,1] and xresnet34_deeper has [3,4,6,3,1,1,1,1].
This implementation will supports the unchanged case, but I am unsure how to handle these models. edit: not a priority at this moment
In the meantime I will run more experiments (segmentation, detection) and think about how to fix the weight loading issues.
@rwightman@jeremy I implemented the feedback here and created a collab notebook with some initial examples here.
I donāt have time to check the code right now, but two thingsā¦
the weights should remain 100% compatible
the classification accuracy of a pretrained model should not change much
For reference, a few days back I did a quick Resnet34 check on one of my own models + pretrained classsification weights with this scheme implemented, output_stride=32 (no dilation), 75.1% , output_stride=16, 74.94, output_stride=8, 75.17
The reason why the weights arenāt 100% compatible (need some renaming) is that unlike the original resnet which uses a 1x1 stride 2 convolution on the identity path in the first block of a stage, the xresnet uses a 2x2 average pooling followed by a 1x1 stride 1 conv in these blocks, with the average pooling beeing only present if that block has a stride 2. So the name change is āidpath.1.convā -> āidpath.0.convā
I will try my implementation with a standard resnet and report results.
Right, the āDā style shortcut that xresnet uses makes this more challenging. My impl does support weight compatibility with that shortcut, but havenāt figured out a way around the performance loss with that shortcut present. To avoid weight compat issue, you can replace the AvgPool2d with an identity block when dilation != 1.
The 2x2 pooling kernel makes it challenging to alter the pooling stride as the padding needs to be asymmetric to maintain dimensions between the shortcut and the conv path. I had a thought to explore it at one point, but didnāt dig in.