In lesson 8, a list of steps for tackling overfitting is shown. Step 3 is “Generalizable architectures”, such as adding batch norm layers, or dense nets, etc. Step 5 is “Reduce architecture complexibility”, such as having “less layers or less activations”.
I think I kind of get the difference between complexibility and generalizability. But isn’t it true that the two aren’t completely independent of each other. For example, is it possible that reducing the number of layers might also reduce the generalizability. Also, since batch norm layers have trainable parameters, doesn’t adding them also in some way increase the complexibility?
I don’t have a complete answer, but can offer some angles that could help to unpack these questions.
I think of overfitting to mean “memorizing the training examples instead of generalizing from them”.
Reducing architecture complexity reduces the model’s capacity to memorize, therefore it is forced to generalize better. Up to a point of course: if the model is made too simple it can neither memorize nor generalize.
You’re right that batchnorm adds trainable parameters. But recall that it also normalizes the activations per filter between layers. That means inputs that once created very different activations (in their means and standard deviations) now look more similar to the next layer. You have removed information that once distinguished particular examples, it cannot memorize them as readily, so now the model has to learn a more general rule for classifying them.
When batchnorm follows a linear or conv layer, those extra parameters (that set the new mean and s.d.) are redundant to the previous layer’s weights and biases, in the sense that they could have been learned by that layer’s parameters. They are easier to train but do not add much to the model’s inherent complexity (dimensionality).
HTH some, and experts please feel free to make corrections.