Support dilated convolutions in xresnet

j.laute · January 25, 2020, 4:46pm

I am working on image inpainting/completion at the moment, and it seems that it really benefits from having dilated convolutions, see this paper and dilated convolutions have also been successfully used for segmentation in deeplab v3.

I modified the xresnet in fastai2 to support dilation like they do in this paper (dilated resnets), ie use dilation in both convolutions in a basic block, and only in the middle convolution in a bottleneck block.

Should I send a pull request with these changes? Would this be useful for users?

Cheers, Johannes

muellerzr · January 25, 2020, 4:49pm

Could you show an example with the improvement In the framework? I’d certainly be interested in seeing that

j.laute · January 25, 2020, 5:43pm

Sure, I’m just starting to run experiments today, and because there are no pretrained models it will take a while until I get the results.
Will share them when I get them.

muellerzr · January 25, 2020, 5:43pm

I’d recommend something like ImageWoof as a comparison on 20 epochs (you can add on the rest of the techniques that were found successful too)

jeremy · January 27, 2020, 11:33am

I expect it would! If you’re able to support it without adding too much complexity to the code, or changing the behavior of existing networks, I think that would be great.

j.laute · January 27, 2020, 6:31pm

PR is submitted here, I continue with experiments now

rwightman · January 27, 2020, 7:36pm

@j.laute Definitely a valueable addition for using the networks in segmentation/obj detection tasks. A few comments,

Regarding your current impl, a suggestion wrt to the location of the dilations. For each stride=2 conv, the next level of dilation should kick in after that conv, and then that dilation should remain in effect for the rest of the network, with each subsequent stride -> dilation adding on top of the previous. This means for a sequence of blocks, the first block (with the stride 2 conv) has a different dilation then the next blocks in that sequence. Also, for the basic block (two 3x3 convs), the second 3x3 conv needs a different dilation from the first strided one.

Also, in networks that suppor this sort of dilation, I often see it implemented at the model/creation interface with an output_stride=x arg instead of the user needing to know what to specifiy there. It’s fairly straight forward to then compute the list of dilations from that. Supporting output strides of 8, 16, and the default of 32 is most common.

rwightman · January 27, 2020, 7:45pm

A good test is to verify that the accuracy of a pretrained classification model (w/ adaptive global pooling of course) does not degrade if you apply the dilations. It’ll run slower, use more GPU memory, but should be pretty close to the original accuracy +/- a bit (usually +).

jeremy · January 27, 2020, 7:58pm

Thanks a lot for the PR @j.laute, and for the comments @rwightman.

Ross knows much more about this than me, so I’m happy when he’s happy

j.laute · January 27, 2020, 8:00pm

Thanks for the feedback!

Let me repeat so I am sure that I understand fully:
Suppose we have a xresnet18 with 2 resblocks per stage, ie layers=[2,2,2,2] and dilations=[1,2,4,8]

With the current code it would look like this:

stage:
1. block
  - conv1: stride=2, dilation=1
  - conv2: stride=1, dilation=1
2. block
  - conv1: stride=1, dilation=1
  - conv2: stride=1, dilation=1
stage:
1. block
  - conv1: stride=2, dilation=2
  - conv2: stride=1, dilation=2
2. block
  - conv1: stride=1, dilation=2
  - conv2: stride=1, dilation=2
stage:
1. block
  - conv1: stride=2, dilation=4
  - conv2: stride=1, dilation=4
2. block
  - conv1: stride=1, dilation=4
  - conv2: stride=1, dilation=4
stage:
1. block
  - conv1: stride=2, dilation=8
  - conv2: stride=1, dilation=8
2. block
  - conv1: stride=1, dilation=8
  - conv2: stride=1, dilation=8

and you are suggesting that it should look like this:

stage:
1. block
  - conv1: stride=2, dilation=1
  - conv2: stride=1, dilation=1
2. block
  - conv1: stride=1, dilation=1
  - conv2: stride=1, dilation=1
stage:
1. block
  - conv1: stride=2, dilation=1
  - conv2: stride=1, dilation=2
2. block
  - conv1: stride=1, dilation=2
  - conv2: stride=1, dilation=2
stage:
1. block
  - conv1: stride=2, dilation=2
  - conv2: stride=1, dilation=4
2. block
  - conv1: stride=1, dilation=4
  - conv2: stride=1, dilation=4
stage:
1. block
  - conv1: stride=2, dilation=4
  - conv2: stride=1, dilation=8
2. block
  - conv1: stride=1, dilation=8
  - conv2: stride=1, dilation=8

The output stride as a parameter I understand and implement.

j.laute · January 27, 2020, 8:03pm

I just checked the official implementation here and saw that they also remove the stride 2 at the beginning of the blocks that use dilation, should I do that as well?
Also could you explain or share a resource that explains why the first conv should have the dilation of the previous stage?
Thanks for the help

rwightman · January 27, 2020, 9:32pm

@j.laute yes, the second part of your example is correct but with stride=2 changed to stride=1 in the blocks where you apply dilation (as you say in the follow up). Typically I’ve only seen stage 3&4 modified, for total network stride of 8 or 16. What you show is a stride 4 network, still valid, but not very common, it’d have very large feature maps.

j.laute · January 27, 2020, 10:24pm

So we would have for a total network stride of 8:
stages 1 & 2 unmodified, and then

stage:
1. block
  - conv1: stride=1, dilation=1
  - conv2: stride=1, dilation=2
2. block
  - conv1: stride=1, dilation=2
  - conv2: stride=1, dilation=2
stage:
1. block
  - conv1: stride=1, dilation=2
  - conv2: stride=1, dilation=4
2. block
  - conv1: stride=1, dilation=4
  - conv2: stride=1, dilation=4

and for a network stride of 16 just stage 4 modified.

This means that the maximum dilation used would be 4 as in the example above, unless the network had more stages, is that right?

rwightman · January 27, 2020, 11:01pm

Yes, that’s correct.

A stride 4 network that changes stage 2, or even stride 2 with edit, woops stage 1=stride1, so stem and stages modified would still be valid. I just don’t see it very often, but definitely wouldn’t be wrong to support them. DeepLab typically has a output_stride=8 or 16 choice for the backbone. I think RetinaNets sometimes use a output_stride=16 config. Many networks based on FPN don’t alter the stride of the backbone and just tap each stage at their default strides.

j.laute · January 29, 2020, 10:23pm

@rwightman @jeremy I implemented the feedback here and created a collab notebook with some initial examples here.

Apart from some issues with weight loading and saving, I noticed two problems:

differently from what you suggested (if I understood you correctly), there is a massive drop in accuracy when inferencing at a different output stride than the model was trained on (see notebook), finetuning with dilation seems to get similar/slightly better results in my initial experiments
how should we handle xresnets with more than 4 stages, ie the “xresnetxx_deep” and “xresnet_deeper” versions? They have output strides 128 and 512 respectively, which is achieved by adding stages with just one resblock.
For example xresnet34 has [3, 4, 6, 3] resblocks, xresnet34_deep has [3,4,6,3,1,1] and xresnet34_deeper has [3,4,6,3,1,1,1,1].
This implementation will supports the unchanged case, but I am unsure how to handle these models. edit: not a priority at this moment

In the meantime I will run more experiments (segmentation, detection) and think about how to fix the weight loading issues.

Thanks again for helping with this

jeremy · January 29, 2020, 10:28pm

I wouldn’t worry about them too much - they’re experimental ideas and not everything works well with them.

j.laute · January 29, 2020, 10:31pm

Thanks, will not worry about it.
These models are not broken by default settings.

rwightman · January 29, 2020, 10:42pm

@rwightman @jeremy I implemented the feedback here and created a collab notebook with some initial examples here.

I don’t have time to check the code right now, but two things…

the weights should remain 100% compatible
the classification accuracy of a pretrained model should not change much

For reference, a few days back I did a quick Resnet34 check on one of my own models + pretrained classsification weights with this scheme implemented, output_stride=32 (no dilation), 75.1% , output_stride=16, 74.94, output_stride=8, 75.17

j.laute · January 29, 2020, 11:01pm

The reason why the weights aren’t 100% compatible (need some renaming) is that unlike the original resnet which uses a 1x1 stride 2 convolution on the identity path in the first block of a stage, the xresnet uses a 2x2 average pooling followed by a 1x1 stride 1 conv in these blocks, with the average pooling beeing only present if that block has a stride 2. So the name change is “idpath.1.conv” -> “idpath.0.conv”

I will try my implementation with a standard resnet and report results.

rwightman · January 30, 2020, 6:23am

Right, the ‘D’ style shortcut that xresnet uses makes this more challenging. My impl does support weight compatibility with that shortcut, but haven’t figured out a way around the performance loss with that shortcut present. To avoid weight compat issue, you can replace the AvgPool2d with an identity block when dilation != 1.

The 2x2 pooling kernel makes it challenging to alter the pooling stride as the padding needs to be asymmetric to maintain dimensions between the shortcut and the conv path. I had a thought to explore it at one point, but didn’t dig in.