I searched for the answer so I do apologize if it has been discussed already.
The idea is as follows: when doing SGD, apply a penalty to the weight updates which are a relatively small magnitude. The purpose would be to move in the direction of “decoupling” what groups of parameters recognize a certain feature. I figure that a given training example might provide information to the model that is relevant to a small subset of the parameters rather than all of them at once. And because of the interconnectedness of all the parameter values, by the time you’ve updated D-Z, the updates to A, B, and C might not be as effective in moving down the loss function.
Does this idea even make any sense? I’d be curious to learn about the results of anyone who has tried it.