Let’s define our quantities! The input patch is

a 3x3 kernel is made of

and the two 2x2 kernels are

Now, let’s apply the 3x3 convolution to the input, the result is

When instead we apply the 2x2 convolution in serie, we get first

by applying the first kernel to the input and then

by applying the second kernel to it!

If we rearrange things a bit we can see that we can collect the x s into

and so this is where we compare this result with the initial application of the 3x3 to the inputs (recall that it was this below)

If we equate the two results we can see that for them to be equivalent we must have

but you can verify easily that when you try to solve this wrt to the kernel A (considering all of B abc C as given), this is trivially a system of 9 equations in 9 unknowns. I’ve laid out all you need to find the values of A!
(In layman terms, the 3x3 convolution can exactly reproduce the result of the two 2x2 convolutions if we do not take the non-linearities into account)
The problem comes from when you try to solve it the other way around, wrt to B and C! Unless we restrict some values of A, exactly one to be precise, this is a system of 9 equations in 8 (!!!) unnknowns, the values b_1, ... b_4, c_1, ... c_4, so the system has no solution!
Again, informally speaking this means that (non-linearity notwithstanding) the two 2x2 kernel cannot in general reproduce exaclty the result of a single 3x3 convolution!
The same of course apply to 5x5 versus two 3x3 kernels though, and we know that in practice often we get satisfactory results as well, but this is how you prove “formally” that a single bigger kernel is in general more expressive.
I hope that clear! 
EDIT: argh! equations are not rendered! Give me a sec while I work on a solution! 
EDIT2: saved? yes! Thanks codecogs