Updating the large gradient parameters more than the small ones?

I searched for the answer so I do apologize if it has been discussed already.

The idea is as follows: when doing SGD, apply a penalty to the weight updates which are a relatively small magnitude. The purpose would be to move in the direction of “decoupling” what groups of parameters recognize a certain feature. I figure that a given training example might provide information to the model that is relevant to a small subset of the parameters rather than all of them at once. And because of the interconnectedness of all the parameter values, by the time you’ve updated D-Z, the updates to A, B, and C might not be as effective in moving down the loss function.

Does this idea even make any sense? I’d be curious to learn about the results of anyone who has tried it.

Interesting idea and definitely worth experimenting with to validate your intuitions! Just for context, I believe the usual approach is the opposite of what you mention. Both RMSProp and Adam optimizers give higher learning rate to parameters with a small gradient (assuming that the loss is flat so we can move faster) and lower learning rate to parameters with a high gradient (to avoid divergence).

Thanks for the response! Consider me surprised that they do the opposite haha. I’ll keep my eye out while working through the course for where that is in the API. I appreciate your input, Darek.