I think part of the answer is simply because AlexNet did it. Initial layer for that network used an 11 x 11 kernel.
The only paper I can remember directly contrasting multiple 3 x 3 kernels vs a single 5 x 5 or larger kernel came down in favor of multiple 3 x 3 kernels for exactly the reason mentioned above. (The exact paper eludes me right now - think it was one from Google.)
So I don’t think there is a really good reason to do it this way (although if someone else can think of a reason I’m all ears).
That said, there is a less compelling reason to use a larger initial kernel size - the first layer filters are much more visually appealing when you include them as a figure in your paper: