Help me understand Lesson 10 (Part 2)! :)

marco_b · November 19, 2019, 6:47pm

Not Radek , but you can consider a 2x2 kernel

a b
c d

and a 3x3 kernel

a b 0
c d 0
0 0 0

They both do the same thing to your input! But then, as you start substituting the zeros in the 3x3 with other weights you can see how the 3x3 can be more general in the transformation it applies to the inputs!

For serial applications (e.g. 2 2x2 kernels vs 1 3x3 kernel) the argument is slightly trickier (and totally possible only without non-linearities involved) but we need to introduce the input x in the equation to be completely rigorous (I might write it up later tonight), but in synthesis you can still see that the second 2x2 kernel will introduce 4 other parameters, so the 3x3 still “wins” by one free parameter!

radek · November 19, 2019, 8:30pm

That is a very nice observation

My assumption is that people have experimented with various kernel sizes and that 3x3 turned out to be the best, at least on Imagenet and within the context of modern architectures.

marco_b · November 19, 2019, 9:11pm

Let’s define our quantities! The input patch is

$X = \begin{pmatrix} x_1 & x_2 & x_3 \ x_4 & x_5 & x_6 \ x_7 & x_8 & x_9 \end{pmatrix}$

a 3x3 kernel is made of

$A = \begin{pmatrix} a_1 & a_2 & a_3 \ a_4 & a_5 & a_6 \ a_7 & a_8 & a_9 \end{pmatrix}$

and the two 2x2 kernels are

$B= \begin{pmatrix} b_1 & b_2 \ b_3 & b_4 \end{pmatrix}$$ $$C= \begin{pmatrix} c_1 & c_2 \ c_3 & c_4 \end{pmatrix}$

Now, let’s apply the 3x3 convolution to the input, the result is

When instead we apply the 2x2 convolution in serie, we get first

$\begin{pmatrix} b_1 x_1 + b_2 x_2 + b_3 x_4 + b_4 x_5 & b_1 x_2 + b_2 x_3 + b_3 x_5 + b_4 x_6 \ b_1 x_4 + b_2 x_5 + b_3 x_7 + b_4 x_8 & b_1 x_5 + b_2 x_6 + b_3 x_8 + b_4 x_9 \end{pmatrix}$

by applying the first kernel to the input and then

$c_1 \left( b_1 x_1 + b_2 x_2 + b_3 x_4 + b_4 x_5 \right ) + c_2 \left( b_1 x_2 + b_2 x_3 + b_3 x_5 + b_4 x_6 \right )+ c_3 \left(b_1 x_4 + b_2 x_5 + b_3 x_7 + b_4 x_8 \right) + c_4 \left( b_1 x_5 + b_2 x_6 + b_3 x_8 + b_4 x_9 \right)$

by applying the second kernel to it!

If we rearrange things a bit we can see that we can collect the x s into

and so this is where we compare this result with the initial application of the 3x3 to the inputs (recall that it was this below)

If we equate the two results we can see that for them to be equivalent we must have

$\begin{cases} (c_1 b_1) = a_1 \ (c_1 b_2 + c_2 b_1) = a_2 \ (c_2 b_2) = x_3 \ (c_1 b_3+c_3 b_1) = a_4 \ (c_1 b_4 + c_2 b_3 + c_3 b_2 + c_4 b_1) = a_5 \ (c_2 b_4+c_4 b_2) = a_6 \ (c_3 b_3) = a_7 \ (c_3 b_4 +c_4 b_3) = a_8 \ (c_4 b_4) = a_9 \end{cases}$

but you can verify easily that when you try to solve this wrt to the kernel A (considering all of B abc C as given), this is trivially a system of 9 equations in 9 unknowns. I’ve laid out all you need to find the values of A!

(In layman terms, the 3x3 convolution can exactly reproduce the result of the two 2x2 convolutions if we do not take the non-linearities into account)

The problem comes from when you try to solve it the other way around, wrt to B and C! Unless we restrict some values of A, exactly one to be precise, this is a system of 9 equations in 8 (!!!) unnknowns, the values b_1, ... b_4, c_1, ... c_4, so the system has no solution!

Again, informally speaking this means that (non-linearity notwithstanding) the two 2x2 kernel cannot in general reproduce exaclty the result of a single 3x3 convolution!

The same of course apply to 5x5 versus two 3x3 kernels though, and we know that in practice often we get satisfactory results as well, but this is how you prove “formally” that a single bigger kernel is in general more expressive.

I hope that clear!

EDIT: argh! equations are not rendered! Give me a sec while I work on a solution!
EDIT2: saved? yes! Thanks codecogs

lgvaz · November 19, 2019, 9:32pm

This is extremely interesting. I just finished the experiments on this and the results are spot on on what you just described. You can find the code here.

I run this experiment under the imagewoof dataset, the base accuracy I get by running the 3x3 convs is 0.65.

I first tried to just replace all 3x3 convs with 2x2 convs, the accuracy (not surprisingly) went down to 0.60.

Next step was to replace all 3x3 convs with two 2x2 convs having a non-linearity after every conv. Got a result close to 0.60 again.

For the final step, I replaced all 3x3 convs with two 2x2 convs, but only the second 2x2 conv had a activation, and then… Exactly 0.65 accuracy, exactly on point with:

Wow, this is incredible… Very very good work @marco_b

marco_b · November 19, 2019, 10:17pm

Thanks for running the experiments!

I’m glad some mathematical intuition still holds in this field !

jcatanza · November 19, 2019, 11:08pm

Au contraire… The 3x3 loses by one free parameter. A sequence of two consecutive 2x2 convolutions can encode the spatial dependence within a 3x3 receptive field with only eight parameters, while a 3x3 convolution needs nine parameters to do the same job!

jcatanza · November 19, 2019, 11:23pm

Sorry, the math did not render ^{\dagger} well on my Windows 10 machine running Firefox or on my Android.

But I do understand what you did, and it’s quite a nice demonstration (Math is beautiful!) The bottom line is that the eight parameters in the two 2x2 filters can be determined in terms of the nine parameters in the 3x3 filter (which yields an overdetermined system of equations), but the nine parameters of the 3x3 filter cannot be recovered in terms of the eight parameters of the two 2x2 filters (which yields an underdetermined system of equations).

More importantly, the result of @lgvaz’ excellent work comparing implementations of the two systems confirms the intuition that two 2x2 convolutions in series does the same job as the 3x3 convolution, but with 11% less memory required. This is essentially the point I was making to @radek

^{\dagger} You might try typesetting the latex equations directly into the message, i.e. not using a 3rd party app like codecogs.

For example, the following code
$T = \frac{T_0}{\sqrt{1-\frac{v^2}{c^2}}}$ ,

produces the following output (when you strip off the enclosing ``)
T = \frac{T_0}{\sqrt{1-\frac{v^2}{c^2}}}

There are many latex cheatsheets online that’ll get you up and running quickly. Here’s one.

lgvaz · November 19, 2019, 11:27pm

Interestingly enough doing two 2x2 convolutions in series takes more time to train.

jcatanza · November 19, 2019, 11:39pm

Ah yes this is what I’d expected in my comments to @radek. The 9 parameters of the 3x3 conv are solved in parallel on the GPU but a serial application of 2x2 convs presents a bottleneck and slows things down (by how much, incidentally?)

lgvaz · November 19, 2019, 11:50pm

3x3 is taking on avg 56s on my gpu, while 2x2 71s. So 26% slower

jcatanza · November 19, 2019, 11:58pm

So perhaps that’s why there isn’t widespread use of sequences of 2x2 convs?

a_yasyrev · November 20, 2019, 7:44am

Great tread!
I doing some experiments with resnet / xresnet, tryng better understend it. Refactor it a lot, have some interesting thougts, and then Radek start it!
I tried modify resnet and find best results (with woof) is when act_fn before Bn. Same with stem, and best result when in stem bn after maxpool.
Then i take stem from xresnet, but results not so good, Better with xresnet stem(3x3) with relu before bn, and last bn after maxpoool - it so good as base 7x7 conv, but non better,
Will check on long runs.

marco_b · November 20, 2019, 8:23am

That’s what I did at first, but the subset of LaTeX that works on the Markdown implementation of this forums is apparently not enough to render my equations, that’s why I had to switch to codecogs. They’re simple GIF now, so they should render on any device … maybe you looked at the post before I edited and switched to the external tool?

By the way, it’s not like codecogs is a speech/handwriting recognition app, I still had to type those in there so I’m not sure what you mean by suggesting a LaTeX Cheatsheet

Back on topic,

The same could be said for the two 3x3 convs versus a 5x5 conv as well though. Most likely a combination of reasons…

I think point 2 is probably “solved” by now, I still don’t understand point 1, aka “you’re just shuffling the numbers” but If not proven wrong I’ll just assume that what he meant was similar to what I expressed here

marco_b:

I think you want to gradually increase the portion of the image you’ve looked at and the number of filters/features ‘together’, because the 3x3 kernels look at a very small part of the image and (in that space, i.e. the original image) there’s only so much you can “catch” with a 3x3 kernel (edges, small corners, … ) and it’s better in my opinion to use more computations at higher layers where the receptive fields capture representations of bigger patches of the original image!

This has more to do with the “semantics” of the image classification (or representation rather) rather then the actual computations done. It has to do with the fact that we’re talking about the actual input image and not just a 1 x width x height tensor of activations of some kind!

so in the end we do agree even on this topic, just expressed in a way that made it hard for me to rally see it!

jcatanza · November 20, 2019, 8:57am

The equations still do not render in Firefox, either on Android or Windows 10 64-bit. I also checked the Chrome browser on Android, with the same result.

marco_b · November 20, 2019, 9:25am

They render fine for me both on desktop and Android

They are now embedded images, there’s no reason other that weird / overaggressive adblocker or similar that they would not render … they’re GIFs! I’ve send you a PM since this is probably going too off-topic otherwise