Support dilated convolutions in xresnet

I am working on image inpainting/completion at the moment, and it seems that it really benefits from having dilated convolutions, see this paper and dilated convolutions have also been successfully used for segmentation in deeplab v3.

I modified the xresnet in fastai2 to support dilation like they do in this paper (dilated resnets), ie use dilation in both convolutions in a basic block, and only in the middle convolution in a bottleneck block.

Should I send a pull request with these changes? Would this be useful for users?

Cheers, Johannes

3 Likes

Could you show an example with the improvement In the framework? I’d certainly be interested in seeing that :slight_smile:

2 Likes

Sure, I’m just starting to run experiments today, and because there are no pretrained models it will take a while until I get the results.
Will share them when I get them. :slight_smile:

1 Like

I’d recommend something like ImageWoof as a comparison on 20 epochs (you can add on the rest of the techniques that were found successful too)

I expect it would! If you’re able to support it without adding too much complexity to the code, or changing the behavior of existing networks, I think that would be great.

PR is submitted here, I continue with experiments now :slight_smile:

1 Like

@j.laute Definitely a valueable addition for using the networks in segmentation/obj detection tasks. A few comments,

Regarding your current impl, a suggestion wrt to the location of the dilations. For each stride=2 conv, the next level of dilation should kick in after that conv, and then that dilation should remain in effect for the rest of the network, with each subsequent stride -> dilation adding on top of the previous. This means for a sequence of blocks, the first block (with the stride 2 conv) has a different dilation then the next blocks in that sequence. Also, for the basic block (two 3x3 convs), the second 3x3 conv needs a different dilation from the first strided one.

Also, in networks that suppor this sort of dilation, I often see it implemented at the model/creation interface with an output_stride=x arg instead of the user needing to know what to specifiy there. It’s fairly straight forward to then compute the list of dilations from that. Supporting output strides of 8, 16, and the default of 32 is most common.

3 Likes

A good test is to verify that the accuracy of a pretrained classification model (w/ adaptive global pooling of course) does not degrade if you apply the dilations. It’ll run slower, use more GPU memory, but should be pretty close to the original accuracy +/- a bit (usually +).

2 Likes

Thanks a lot for the PR @j.laute, and for the comments @rwightman.

Ross knows much more about this than me, so I’m happy when he’s happy :slight_smile:

2 Likes

Thanks for the feedback!

Let me repeat so I am sure that I understand fully:
Suppose we have a xresnet18 with 2 resblocks per stage, ie layers=[2,2,2,2] and dilations=[1,2,4,8]

With the current code it would look like this:

  1. stage:
    1. block
      • conv1: stride=2, dilation=1
      • conv2: stride=1, dilation=1
    2. block
      • conv1: stride=1, dilation=1
      • conv2: stride=1, dilation=1
  2. stage:
    1. block
      • conv1: stride=2, dilation=2
      • conv2: stride=1, dilation=2
    2. block
      • conv1: stride=1, dilation=2
      • conv2: stride=1, dilation=2
  3. stage:
    1. block
      • conv1: stride=2, dilation=4
      • conv2: stride=1, dilation=4
    2. block
      • conv1: stride=1, dilation=4
      • conv2: stride=1, dilation=4
  4. stage:
    1. block
      • conv1: stride=2, dilation=8
      • conv2: stride=1, dilation=8
    2. block
      • conv1: stride=1, dilation=8
      • conv2: stride=1, dilation=8

and you are suggesting that it should look like this:

  1. stage:
    1. block
      • conv1: stride=2, dilation=1
      • conv2: stride=1, dilation=1
    2. block
      • conv1: stride=1, dilation=1
      • conv2: stride=1, dilation=1
  2. stage:
    1. block
      • conv1: stride=2, dilation=1
      • conv2: stride=1, dilation=2
    2. block
      • conv1: stride=1, dilation=2
      • conv2: stride=1, dilation=2
  3. stage:
    1. block
      • conv1: stride=2, dilation=2
      • conv2: stride=1, dilation=4
    2. block
      • conv1: stride=1, dilation=4
      • conv2: stride=1, dilation=4
  4. stage:
    1. block
      • conv1: stride=2, dilation=4
      • conv2: stride=1, dilation=8
    2. block
      • conv1: stride=1, dilation=8
      • conv2: stride=1, dilation=8

The output stride as a parameter I understand and implement.

I just checked the official implementation here and saw that they also remove the stride 2 at the beginning of the blocks that use dilation, should I do that as well?
Also could you explain or share a resource that explains why the first conv should have the dilation of the previous stage?
Thanks for the help :slight_smile:

@j.laute yes, the second part of your example is correct but with stride=2 changed to stride=1 in the blocks where you apply dilation (as you say in the follow up). Typically I’ve only seen stage 3&4 modified, for total network stride of 8 or 16. What you show is a stride 4 network, still valid, but not very common, it’d have very large feature maps.

So we would have for a total network stride of 8:
stages 1 & 2 unmodified, and then

  1. stage:
    1. block
      • conv1: stride=1, dilation=1
      • conv2: stride=1, dilation=2
    2. block
      • conv1: stride=1, dilation=2
      • conv2: stride=1, dilation=2
  2. stage:
    1. block
      • conv1: stride=1, dilation=2
      • conv2: stride=1, dilation=4
    2. block
      • conv1: stride=1, dilation=4
      • conv2: stride=1, dilation=4

and for a network stride of 16 just stage 4 modified.

This means that the maximum dilation used would be 4 as in the example above, unless the network had more stages, is that right?

Yes, that’s correct.

A stride 4 network that changes stage 2, or even stride 2 with edit, woops stage 1=stride1, so stem and stages modified would still be valid. I just don’t see it very often, but definitely wouldn’t be wrong to support them. DeepLab typically has a output_stride=8 or 16 choice for the backbone. I think RetinaNets sometimes use a output_stride=16 config. Many networks based on FPN don’t alter the stride of the backbone and just tap each stage at their default strides.

@rwightman @jeremy I implemented the feedback here and created a collab notebook with some initial examples here.

Apart from some issues with weight loading and saving, I noticed two problems:

  1. differently from what you suggested (if I understood you correctly), there is a massive drop in accuracy when inferencing at a different output stride than the model was trained on (see notebook), finetuning with dilation seems to get similar/slightly better results in my initial experiments

  2. how should we handle xresnets with more than 4 stages, ie the “xresnetxx_deep” and “xresnet_deeper” versions? They have output strides 128 and 512 respectively, which is achieved by adding stages with just one resblock.
    For example xresnet34 has [3, 4, 6, 3] resblocks, xresnet34_deep has [3,4,6,3,1,1] and xresnet34_deeper has [3,4,6,3,1,1,1,1].
    This implementation will supports the unchanged case, but I am unsure how to handle these models.
    edit: not a priority at this moment

In the meantime I will run more experiments (segmentation, detection) and think about how to fix the weight loading issues.

Thanks again for helping with this :slight_smile:

1 Like

I wouldn’t worry about them too much - they’re experimental ideas and not everything works well with them.

Thanks, will not worry about it.
These models are not broken by default settings.

@rwightman @jeremy I implemented the feedback here and created a collab notebook with some initial examples here.

I don’t have time to check the code right now, but two things…

  1. the weights should remain 100% compatible
  2. the classification accuracy of a pretrained model should not change much

For reference, a few days back I did a quick Resnet34 check on one of my own models + pretrained classsification weights with this scheme implemented, output_stride=32 (no dilation), 75.1% , output_stride=16, 74.94, output_stride=8, 75.17

2 Likes

The reason why the weights aren’t 100% compatible (need some renaming) is that unlike the original resnet which uses a 1x1 stride 2 convolution on the identity path in the first block of a stage, the xresnet uses a 2x2 average pooling followed by a 1x1 stride 1 conv in these blocks, with the average pooling beeing only present if that block has a stride 2. So the name change is “idpath.1.conv” -> “idpath.0.conv”

I will try my implementation with a standard resnet and report results.

Right, the ‘D’ style shortcut that xresnet uses makes this more challenging. My impl does support weight compatibility with that shortcut, but haven’t figured out a way around the performance loss with that shortcut present. To avoid weight compat issue, you can replace the AvgPool2d with an identity block when dilation != 1.

The 2x2 pooling kernel makes it challenging to alter the pooling stride as the padding needs to be asymmetric to maintain dimensions between the shortcut and the conv path. I had a thought to explore it at one point, but didn’t dig in.