I know this topic is discussed piecemeal in various places, but wondering if there is a single resource that clearly shows, for each kind of layer (bias, batch norm, layer norm, etc…), what the best practices are re things like:
-
Should weight-decay be applied (or any other hyperparameter based on layer type)?
-
Should the layer always be marked as trainable even in gradual unfreezing?
-
Should be enabled/disabled in different phases (training v. eval)?
Maybe this can even be wikified and resources linked too for each question above (along with folks being able to add other questions)???