Help me understand Lesson 10 (Part 2)! :)

Applying BatchNorm and non-linearities in-between the layers changes things a bit, but the smaller (serial) convolutions in essence “tie” some of the parameters together, similarly to for example hierarchical/multilevel models.

So yes in theory larger convolutions have more “freedom” i.e. more parameters that they can learn to combine pixel together, but in practice smaller convolutions might very well perform as well as if not better thanks to the additional non-linearity (and save some computations in the process).

I suspect that we don’t have much literature on 2x2 kernels because of the weirdness connected with their implementation! Another example on top of my mind is: think what happens when you want to reduce the image/activation size, with 3x3 convolutions you apply a stride of 2, you half the height and width of the output layer (wrt inputs) but you still retain some overlap between consecutive kernels (which you want in order to maintain some correlation between the activations)!

If you use a 2x2 kernel there’s no way to have overlapping applications that also reduce the image size (other than by 1 pixel at a time if you don’t pad one side [if you pad both you actually increase the size by 1)!!

Probably the marginal (if any) gain associated with it didn’t warrant any real-work application, contrary to the gains for 3x3 VS bigger 5x5 or 7x7 kernels…

1 Like

Hi @radek I was thinking that the number of weights is the only criterion for learning, but you correctly point out that the depth is also a factor for learning nonlinearities.

But then, could you explain why you think that a 3x3 conv can learn “more” than two consecutive 2x2 convs?

But the very next layer in 2x2 will “do the overlapping”. e.g. instead of a single 3x3 Conv with stride 2 you would have a 2x2 Conv with stride 2 followed by a 2x2 Conv with stride 1.

1 Like

Without non-linearities sure, but imagine if one of the activations is zeroed by a ReLU!

I’m not arguing they have no place, but as I’ve written before I think that if you add up this ‘tricky bits’ , the historic perspective and the added computations (mainly for batchnorm, relu is quite cheap, plus the sequential computation) I can see why even if you’d have a little reduction in weights the community hasn’t really exploited this additional factorization.

I agree it would be interesting to see if there’s a difference, but I suspect that if there is one it won’t be very relevant… I might explore the idea a little if I find the time this week.

Also a 2x2 kernel is very restrictive in terms of what it can represent (even if we’re dealing with continuous weights, you don’t want two kernels to pick up the same thing, just slightly scaled differently), so maybe it’ll warrant added consideration into the useful filter sizes at this low kernel size, going back to point 1 of my post.

1 Like

Hm… Very good point, is it crazy to have a sequence of some convolutions without ReLU?

I’ll work on this rn actually, and will post results here :grin:

1 Like

Not Radek :slight_smile: , but you can consider a 2x2 kernel

a b
c d

and a 3x3 kernel

a b 0
c d 0
0 0 0

They both do the same thing to your input! But then, as you start substituting the zeros in the 3x3 with other weights you can see how the 3x3 can be more general in the transformation it applies to the inputs!

For serial applications (e.g. 2 2x2 kernels vs 1 3x3 kernel) the argument is slightly trickier (and totally possible only without non-linearities involved) but we need to introduce the input x in the equation to be completely rigorous (I might write it up later tonight), but in synthesis you can still see that the second 2x2 kernel will introduce 4 other parameters, so the 3x3 still “wins” by one free parameter!

1 Like

That is a very nice observation :slight_smile:

My assumption is that people have experimented with various kernel sizes and that 3x3 turned out to be the best, at least on Imagenet and within the context of modern architectures.

Let’s define our quantities! The input patch is

a 3x3 kernel is made of

and the two 2x2 kernels are

Now, let’s apply the 3x3 convolution to the input, the result is

When instead we apply the 2x2 convolution in serie, we get first

by applying the first kernel to the input and then

by applying the second kernel to it!

If we rearrange things a bit we can see that we can collect the x s into

and so this is where we compare this result with the initial application of the 3x3 to the inputs (recall that it was this below)

If we equate the two results we can see that for them to be equivalent we must have

but you can verify easily that when you try to solve this wrt to the kernel A (considering all of B abc C as given), this is trivially a system of 9 equations in 9 unknowns. I’ve laid out all you need to find the values of A!

(In layman terms, the 3x3 convolution can exactly reproduce the result of the two 2x2 convolutions if we do not take the non-linearities into account)

The problem comes from when you try to solve it the other way around, wrt to B and C! Unless we restrict some values of A, exactly one to be precise, this is a system of 9 equations in 8 (!!!) unnknowns, the values b_1, ... b_4, c_1, ... c_4, so the system has no solution!

Again, informally speaking this means that (non-linearity notwithstanding) the two 2x2 kernel cannot in general reproduce exaclty the result of a single 3x3 convolution!

The same of course apply to 5x5 versus two 3x3 kernels though, and we know that in practice often we get satisfactory results as well, but this is how you prove “formally” that a single bigger kernel is in general more expressive.

I hope that clear! :slight_smile:

EDIT: argh! equations are not rendered! Give me a sec while I work on a solution! :stuck_out_tongue:
EDIT2: saved? yes! Thanks codecogs

2 Likes

This is extremely interesting. I just finished the experiments on this and the results are spot on on what you just described. You can find the code here.

I run this experiment under the imagewoof dataset, the base accuracy I get by running the 3x3 convs is 0.65.

I first tried to just replace all 3x3 convs with 2x2 convs, the accuracy (not surprisingly) went down to 0.60.

Next step was to replace all 3x3 convs with two 2x2 convs having a non-linearity after every conv. Got a result close to 0.60 again.

For the final step, I replaced all 3x3 convs with two 2x2 convs, but only the second 2x2 conv had a activation, and then… Exactly 0.65 accuracy, exactly on point with:

Wow, this is incredible… Very very good work @marco_b

2 Likes

Thanks for running the experiments!

I’m glad some mathematical intuition still holds in this field ! :rofl:

2 Likes

Au contraire… The 3x3 loses by one free parameter. A sequence of two consecutive 2x2 convolutions can encode the spatial dependence within a 3x3 receptive field with only eight parameters, while a 3x3 convolution needs nine parameters to do the same job!

Sorry, the math did not render ^{\dagger} well on my Windows 10 machine running Firefox or on my Android.

But I do understand what you did, and it’s quite a nice demonstration (Math is beautiful!) The bottom line is that the eight parameters in the two 2x2 filters can be determined in terms of the nine parameters in the 3x3 filter (which yields an overdetermined system of equations), but the nine parameters of the 3x3 filter cannot be recovered in terms of the eight parameters of the two 2x2 filters (which yields an underdetermined system of equations).

More importantly, the result of @lgvaz’ excellent work comparing implementations of the two systems confirms the intuition that two 2x2 convolutions in series does the same job as the 3x3 convolution, but with 11% less memory required. This is essentially the point I was making to @radek

^{\dagger} You might try typesetting the latex equations directly into the message, i.e. not using a 3rd party app like codecogs.

For example, the following code
$T = \frac{T_0}{\sqrt{1-\frac{v^2}{c^2}}}$,

produces the following output (when you strip off the enclosing ``)
T = \frac{T_0}{\sqrt{1-\frac{v^2}{c^2}}}

There are many latex cheatsheets online that’ll get you up and running quickly. Here’s one.

1 Like

Interestingly enough doing two 2x2 convolutions in series takes more time to train.

1 Like

Ah yes this is what I’d expected in my comments to @radek. The 9 parameters of the 3x3 conv are solved in parallel on the GPU but a serial application of 2x2 convs presents a bottleneck and slows things down (by how much, incidentally?)

1 Like

3x3 is taking on avg 56s on my gpu, while 2x2 71s. So 26% slower

1 Like

So perhaps that’s why there isn’t widespread use of sequences of 2x2 convs?

Great tread!
I doing some experiments with resnet / xresnet, tryng better understend it. Refactor it a lot, have some interesting thougts, and then Radek start it!
I tried modify resnet and find best results (with woof) is when act_fn before Bn. Same with stem, and best result when in stem bn after maxpool.
Then i take stem from xresnet, but results not so good, Better with xresnet stem(3x3) with relu before bn, and last bn after maxpoool - it so good as base 7x7 conv, but non better,
Will check on long runs.

That’s what I did at first, but the subset of LaTeX that works on the Markdown implementation of this forums is apparently not enough to render my equations, that’s why I had to switch to codecogs. They’re simple GIF now, so they should render on any device … maybe you looked at the post before I edited and switched to the external tool?

By the way, it’s not like codecogs is a speech/handwriting recognition app, I still had to type those in there so I’m not sure what you mean by suggesting a LaTeX Cheatsheet :sweat_smile:


Back on topic,

The same could be said for the two 3x3 convs versus a 5x5 conv as well though. Most likely a combination of reasons…

I think point 2 is probably “solved” by now, I still don’t understand point 1, aka “you’re just shuffling the numbers” but If not proven wrong I’ll just assume that what he meant was similar to what I expressed here

so in the end we do agree even on this topic, just expressed in a way that made it hard for me to rally see it!

The equations still do not render in Firefox, either on Android or Windows 10 64-bit. I also checked the Chrome browser on Android, with the same result.

They render fine for me both on desktop and Android

They are now embedded images, there’s no reason other that weird / overaggressive adblocker or similar that they would not render … they’re GIFs! I’ve send you a PM since this is probably going too off-topic otherwise :slight_smile: