Why Large Weights induces Overfitting

What is such a property that large weights induces Overfitting. Any intuitive explanation would be helpful.

I am guessing you are talking about in regard to weight decay?

I think we could think of it this way. Our dataset is one for identifying dogs versus cats. Lets say that there is a specific image of a cat that has a bee in the background. Well, we would only activate on the bee, for that specific image, therefore identifying the cat. This is bad, because this is only useful for identifying a specific image “ie” overfitting for that image.

Weight decay makes the network “forget” what is not useful for identifying a large number of images. As it penalizes weights that aren’t useful in a lot of images.

Thanks for your answer @marii . How does it forget that specific weight which is inducing overfitting and not affect other weights (cat eyes, cat ears, tail, …)? In your example, it is like some weight is giving “bee” importance and using L1 and L2 regularization we will forget that specific weight. I just cant understand this because when we add alpha*w^2 to cost function, it reduces all weights and not only eyes.

Batch:     1   2    3    4     5     6 
Bee:       1   0    0    0     0     0
BeeWeght:  1   0.9  0.8  0.7   0.6   0.5
Cat Eye:   1   1    0    1     1     1 
EyeWeight: 1   2    1.9  2.9   3.9   4.9

Here is a simplified chart. I am kind of simplying the problem a lot here, but the general idea is if the feature is in the image it gets +1.1 due to a positive gradient, though weight decay gives it a -0.1(no gradient due to no activation). So in the case of the feature being in the image the weight is +1, otherwise -0.1 (due to weight decay).

This is a very simplified example to show a point. The gradient should keep the important features alive since they are present across all batches, even though they are being continually penalized by weight decay.


Thanks for this icing on cake @marii !!


  1. If the feature is there, it gets +1.1 due to positive gradient and if its not there it receives -0.1. Why this kind of behaviour in magnitude. Shouldnt it be almost equal in magnitude?

  2. Your explanation was perfect just that I missed on one fine intuitive detail to understand. Can you kind of explain me how backpropogation is keeping weights lower and lower (bees neuron weights) and ultimately to zero by gradient descent when new and new batches are fed. I want to understand in simple steps how backpropagation is telling a particular neuron that is firing for bee to shut down when cat images are there in output label and that particular overfitting neuron is having higher weight.