Introducing Convolutional Layer with a Twist

I’m putting together a notebook that hopefully explains Convolutional Layer with a Twist for everyone. For those who have never heard about it, I recently used it on ResNet for the Imagenette/Imagewoof challenge, and it appears to be working pretty well (with some surprises). As it can replace any 3x3 convolutional layer in your model, I’d very much like to see it applied to other kinds of CNN models (detection, GANs, etc.) and datasets.

I’m trying to use fastai’s fastpages to “publish” it, and it’s still a draft. For one thing, I couldn’t find instruction for writing LaTeX which it promised to support. [Update: I was dumb… simply using $ signs in markdown works.] (Jupyter notebook, LaTeX, and a public comment section is a killer combination, from my view.)

I’d also like to ask for suggestion of a notebook that has code that takes an image, turns it to a PyTorch tensor, passes through a neural net, and “plots” the outcome (or a feature map) as an image. That would help me a lot. Thanks in advance!

8 Likes

You mean like class activation maps?

Thanks, I should try that at some point. But for this it’s probably too big a hammer.

It seems that kornia is great library that’s equivalent to OpenCV but using PyTorch. I think I can find an example or two that could get me started.

Why is it “too big a hammer”? It’s already in fastai v1, and I just linked code for doing it in fastai v2.

Looks like it’s showing what part of the original image is “activated” at a certain feature map, while what I’m looking for is just showing the feature map, of a model that is simply one Conv layer (not a model at all). I’ll take a look.

It’s too big a hammer because I’m new to fastai. Nothing against it.

Updated the notebook. I simply used OpenCV’s imread, and use matplotlib to plot the before and after.

The math is still to be written. But if you don’t care about that, you can already play with it (with your own image).

Crazy times. Picking up the discussion from Imagenette/ Imagewoof Leaderboards

I’d like to point out one problem that seems to happen a lot, if not always, and that is the training loss can stay well above validation loss. When I asked this, people say my training is not done right. But I don’t know what to do about that.

Here’s one run, size=128, epochs=80, lr=4e-3, mixup=0.5

data path   /root/.fastai/data/imagewoof2
Learn path /root/.fastai/data/imagewoof2
epoch	train_loss	valid_loss	accuracy	top_k_accuracy	time
0	2.161465	2.075368	0.333673	0.801985	01:49
1	1.962958	1.690776	0.466531	0.893357	01:49
2	1.873881	1.507369	0.564266	0.928735	01:48
3	1.776708	1.422561	0.602698	0.937134	01:49
4	1.692935	1.365374	0.629677	0.946806	01:48
5	1.627976	1.234447	0.697378	0.958514	01:50
6	1.584895	1.212052	0.705523	0.969967	01:50
7	1.551299	1.135982	0.740646	0.966149	01:49
8	1.511394	1.122625	0.745737	0.965386	01:50
9	1.484772	1.075874	0.766353	0.969203	01:50
10	1.474622	1.066350	0.772970	0.972767	01:50
11	1.456863	1.008994	0.796131	0.976839	01:51
12	1.420721	0.993462	0.804276	0.976584	01:51
13	1.395644	0.980982	0.804276	0.975821	01:50
14	1.376894	0.972562	0.804276	0.979384	01:48
15	1.348262	0.946901	0.825655	0.975057	01:49
16	1.354564	0.976460	0.808348	0.976839	01:48
17	1.348575	0.980201	0.803512	0.979639	01:47
18	1.340504	0.970298	0.809621	0.975057	01:48
19	1.311690	0.929736	0.831255	0.976330	01:48
20	1.291686	0.924482	0.836854	0.978112	01:48
21	1.306390	0.945454	0.828710	0.972003	01:48
22	1.275748	0.917969	0.833800	0.981166	01:49
23	1.278674	0.901997	0.841435	0.979893	01:49
24	1.276080	0.921529	0.826928	0.980148	01:49
25	1.268407	0.921693	0.832273	0.981420	01:49
26	1.238546	0.900284	0.838381	0.982438	01:49
27	1.241901	0.884923	0.851616	0.979893	01:50
28	1.232780	0.900848	0.838890	0.981929	01:50
29	1.222811	0.894986	0.841181	0.978875	01:49
30	1.211469	0.905154	0.839145	0.980911	01:48
31	1.221968	0.933286	0.834818	0.977348	01:49
32	1.232239	0.889966	0.851616	0.979639	01:50
33	1.218874	0.894085	0.852634	0.978112	01:49
34	1.196597	0.892226	0.847544	0.980911	01:50
35	1.188675	0.891138	0.849071	0.978366	01:49
36	1.183693	0.879083	0.849071	0.979384	01:49
37	1.173107	0.890478	0.847544	0.980402	01:48
38	1.171805	0.887006	0.850598	0.977602	01:48
39	1.171755	0.880086	0.859252	0.976839	01:47
40	1.182200	0.891929	0.835582	0.980911	01:47
41	1.172583	0.871285	0.862051	0.977857	01:46
42	1.167472	0.890622	0.846780	0.980402	01:47
43	1.152007	0.890563	0.843217	0.983202	01:47
44	1.161367	0.880082	0.855688	0.979893	01:48
45	1.136539	0.867136	0.855434	0.980148	01:49
46	1.162885	0.863992	0.856961	0.982438	01:49
47	1.150314	0.893648	0.849835	0.979639	01:47
48	1.150542	0.906185	0.840672	0.977602	01:48
49	1.132196	0.868540	0.858997	0.978366	01:49
50	1.138945	0.873602	0.859506	0.975566	01:49
51	1.141462	0.882345	0.849835	0.975821	01:49
52	1.123753	0.884568	0.846526	0.977348	01:49
53	1.131320	0.878340	0.850853	0.978875	01:50
54	1.102213	0.879064	0.854161	0.978112	01:51
55	1.121324	0.876642	0.855434	0.979893	01:50
56	1.129894	0.868407	0.846526	0.980657	01:49
57	1.103063	0.864340	0.858997	0.979130	01:49
58	1.137609	0.870569	0.849835	0.979639	01:49
59	1.094992	0.869274	0.860779	0.978621	01:49
60	1.099305	0.858406	0.862306	0.977093	01:49
61	1.123867	0.855571	0.864597	0.978621	01:48
62	1.081016	0.867963	0.860524	0.976075	01:48
63	1.079784	0.842200	0.862560	0.982184	01:49
64	1.093885	0.847684	0.860779	0.980148	01:50
65	1.064888	0.839558	0.868669	0.980911	01:50
66	1.076255	0.846076	0.864851	0.974294	01:50
67	1.061109	0.827110	0.874523	0.978621	01:50
68	1.054539	0.831806	0.870196	0.975821	01:50
69	1.040568	0.823592	0.871723	0.981166	01:49
70	1.049583	0.828491	0.870705	0.980148	01:49
71	1.030693	0.818572	0.874268	0.980911	01:49
72	1.041651	0.820911	0.876050	0.979130	01:49
73	1.015236	0.818820	0.876559	0.980911	01:49
74	1.022821	0.813191	0.877068	0.981675	01:49
75	1.029228	0.804716	0.880886	0.983202	01:50
76	1.019998	0.803970	0.879613	0.981675	01:52
77	1.021127	0.803941	0.883431	0.982184	01:52
78	1.025089	0.801169	0.881395	0.982184	01:51
79	1.011409	0.803271	0.881904	0.981675	01:50

I guess my question would be, for those who have done lots of training (from scratch), how often do you see train_loss > valid_loss? Is that a bad sign and how do you “fix” it?

Did you compared your setup to a standard setup?

Where is your setup different from the standard setup? (This should be the best hint on where to start looking.)

Are there differences between the train and valid phase? (E.g., with MixUp you can see a similar change in the loss, if MixUp is only active during training, which should be the case.)

Did you do a lot of hyper parameter tuning and maybe overfit to your validation set?

How big is your validation set?

What happens when you use another train/valid split?

You are using very high mixup number. So you augment hard. If you compare training with augmentation vs wo it will be usually same picture. If you training loss still higher then validation it mean you can train longer. Or, if you want higher result on this, reduce mixup.
Is it imagenette of woof?

Thank you both for the feedback. I used @a_yasyrev’s last notebook and run for 80 epochs. I’ll try it with a lower mixup.

It’s Imagewoof2, so I thought that already had a fixed train/valid split.

Am I correct that mixup does not do “rotation” of the images? I think I tried to manually add rotation in transforms (no mixup), but didn’t see much improvement.

I think that’s right, with mixup it is reasonable that train_loss > valid_loss.

I did a run with mixup=0.2, and the losses are like this:

loss

Last five epochs:

epoch	train_loss	valid_loss	accuracy	top_k_accuracy	time
75	0.806449	0.819721	0.878086	0.979384	01:40
76	0.811659	0.817289	0.878595	0.978112	01:40
77	0.817209	0.812791	0.881649	0.979384	01:40
78	0.810033	0.815445	0.877832	0.979130	01:39
79	0.816474	0.812994	0.879868	0.979130	01:40 

Maybe with ConvTwist it might be justified to have mixup as high as 0.5?

Try different variants. If 0.2 not enough, try higher rate.
Very good results!

I did a 200-epoch run, but the result is no better than the 80-epoch runs above. I also wanted to see what the centers are at the end of training.

[Update: that was with mixup=0.2. I then did a 200-epoch run with mixup=0.5 and reached 88.65 or 88.83 (but without ConvTwist it also reached 88.57 and 88.80, compared to 87.20 on the leaderboard). But I just found out that the stem part of ResNet50 didn’t use ConvTwist.]

Well, I’ve been making a lot of changes, but haven’t seen definitive improvements…

(Only been testing on Woof2, size=128 epochs=80)

First, for each of the additional conv operations that I said cost 4 parameters per filter, I can reduce it to 2 parameter without seeing much difference. And for long runs, it seems that I can do away with the “centers”. I also added an iters parameters, that runs the same ConvTwist layer multiple times. Then I also played with the groups parameter in Conv2d (aka cardinality in ResNeXt), that reduces the overall model size without sacrificing performance (Maybe because ResNet50 was designed for ImageNet, and for a 10-class dataset a much smaller network can work just as well. If that hypothesis is true, one could randomly “freeze” half of the connections at the outset and achieve the same accuracy.)

What I feel could be a possible direction is to tweak the “block pattern”. Not the block one normally means in a neural network, but the block as in “block-diagonal” matrix. With groups=2 the kernel is like a 2x2 block diagonal matrix. What if we make it off-diagonal?

Here’s the new version of ConvTwist

class ConvTwist(nn.Module):  # replacing 3x3 Conv2d
    def __init__(self, ni, nf, stride=1, groups=2, iters=4):
        super(ConvTwist, self).__init__()
        self.twist = True
        self.same = ni==nf and stride==1
        if not (ni%groups==0 and nf%groups==0): groups = 1
        elif ni%64==0: groups = ni//32
        self.conv = nn.Conv2d(ni, nf, kernel_size=3, stride=stride, padding=1, bias=False, groups=groups)
        if self.twist:
            std = self.conv.weight.std().item()
            self.coeff_Ax = nn.Parameter(torch.empty((nf,ni//groups)).normal_(0, std), requires_grad=True)
            self.coeff_Ay = nn.Parameter(torch.empty((nf,ni//groups)).normal_(0, std), requires_grad=True)
        self.iters = iters
        self.stride = stride
        self.groups = groups

    def kernel(self, coeff_x, coeff_y):
        D_x = torch.Tensor([[-1,0,1],[-2,0,2],[-1,0,1]]).to(coeff_x.device)
        D_y = torch.Tensor([[1,2,1],[0,0,0],[-1,-2,-1]]).to(coeff_x.device)
        return coeff_x[:,:,None,None] * D_x + coeff_y[:,:,None,None] * D_y

    def full_kernel(self, kernel): # permuting the groups
        if self.groups==1: return kernel
        n = self.groups
        a,b,_,_ = kernel.size()
        a //= n
        KK = torch.zeros((a*n,b*n,3,3)).to(kernel.device)
        KK[:a,-b:] = kernel[:a]
        for i in range(1,n):
            KK[a*i:a*(i+1),b*(i-1):b*i] = kernel[a*i:a*(i+1)]
        return KK

    def _conv(self, inpt, kernel=None):
        permute = True
        if kernel is None:
            kernel = self.conv.weight
        if self.groups==1 or permute==False:
            return F.conv2d(inpt, kernel, padding=1, stride=self.stride, groups=self.groups)
        else:
            return F.conv2d(inpt, self.full_kernel(kernel), padding=1, stride=self.stride, groups=1)

    def forward(self, inpt):
        out = self._conv(inpt)
        if self.twist is False:
            return out
        _,_,h,w = out.size()
        XX = torch.from_numpy(np.indices((1,1,h,w))[3]*2/w-1).type(out.dtype).to(out.device) 
        YY = torch.from_numpy(np.indices((1,1,h,w))[2]*2/h-1).type(out.dtype).to(out.device)
        kernel_x = self.kernel(self.coeff_Ax, self.coeff_Ay)
        kernel_y = kernel_x.transpose(2,3).flip(3)  # make conv_y a 90 degree rotation of conv_x
        out = out + XX * self._conv(inpt, kernel_x) + YY * self._conv(inpt, kernel_y)
        if self.same and self.iters>1:
            out = inpt + out / self.iters
            for _ in range(self.iters-1):
                out = out + (self._conv(out) + XX * self._conv(out, kernel_x) + YY * self._conv(out, kernel_y)) / self.iters
            out = out - inpt
        return out

And here’s the notebook that you can directly play with

2 Likes

As much as I like to see ConvTwist fair better, I’ve been testing the “permuting groups” idea further, which is mostly orthogonal to Twist. Still following the Bottleneck architecture of ResNet50 (but deeper), I split the 3x3 conv into groups of 8 channels each (thus significantly reduces the model size). The permutation that seems to work well consists of “4-cycles”, if you know some basic group theory.

On size=128, epochs=80, it achieves 88.62 with layers=[4,6,8,10].

On size=192, epochs=80, it achieves 89.79 with layers=[4,6,8,10]. How does it seem to you, @a_yasyrev? You have been testing on this setting, right?

On size=192, epochs=5, it scores 80.87 on 5-run average. (The notebook above has been updated to contain the results.)

Edit: I should add that the “permuting the groups” has a nice interpretation in terms of the small-world network of Watts-Strogatz from late '90s. (Here’s a beautiful rewrite of the paper.) The name came from the saying “it’s a small world” when two people first met and found they had a sort of six-degree connection. You don’t need everyone connected to everyone else (too costly), but having small hubs (school friends, coworkers) and occasional connections between them, is sufficient to have the small-world phenomenon. In ResNet, at least for our 10-class toyset, having the fully-connected (on the channel level) may be wasteful.

Unsurprisingly, there are DL papers that have “small world” in the title, but I can’t make the judgement if they are talking about the same thing. (Update: I realized “permuting the groups” is like shuffling the channels, and that reminds me of ShuffleNet.) By the way, where does one find an expert’s “review” on a paper, say a year later, that hopefully gives a fair take whether the idea pans out or not? There are just too much stuff on arXiv.

2 Likes

Its cool!
Tried new version (still in work), on size 192, 5 epochs, layers [3.4.6.3] got 78.2, std 0.009, 5 runs.
On [4,6,8,10] got 796, anyway its very good!

I think mixup 0.5 is to much! With 0.2 i got 90.20 (only 1 run yet), but without MaxBlurPool.

2 Likes

Great! That’s why we need more people testing :slight_smile:

It seems that, to my surprise, groups of 4 channels can also do pretty well.

Let me illustrate the “sparse connections” and “permuting the groups” by drawing on a Game of Life board…


This 64x64 grid represents a 64x64x3x3 kernel, dividing the 64 channels into 16 groups of 4 channels each. The 16 groups are further grouped into 4 “meta-groups”, and the permutation is cyclic within each meta-group.

Next I’d like to add connections between the meta-groups, like this:

Added Twist module to my model constructor.
https://ayasyrev.github.io/model_constructor/Twist/

Short example how to use and fast modify here - https://github.com/ayasyrev/imagenette_experiments/blob/master/Twist_experiments.ipynb

2 Likes