Applying BatchNorm and non-linearities in-between the layers changes things a bit, but the smaller (serial) convolutions in essence “tie” some of the parameters together, similarly to for example hierarchical/multilevel models.
So yes in theory larger convolutions have more “freedom” i.e. more parameters that they can learn to combine pixel together, but in practice smaller convolutions might very well perform as well as if not better thanks to the additional non-linearity (and save some computations in the process).
I suspect that we don’t have much literature on 2x2 kernels because of the weirdness connected with their implementation! Another example on top of my mind is: think what happens when you want to reduce the image/activation size, with 3x3 convolutions you apply a stride of 2, you half the height and width of the output layer (wrt inputs) but you still retain some overlap between consecutive kernels (which you want in order to maintain some correlation between the activations)!
If you use a 2x2 kernel there’s no way to have overlapping applications that also reduce the image size (other than by 1 pixel at a time if you don’t pad one side [if you pad both you actually increase the size by 1)!!
Probably the marginal (if any) gain associated with it didn’t warrant any real-work application, contrary to the gains for 3x3 VS bigger 5x5 or 7x7 kernels…