I am trying to confirm the direction of the chain rule for backprop - When you are calculating gradients for Backprop at a particular hidden layer say 10, are we considering layers 1-10 or 10-16? I am assuming it is the former?
layers = model.layers
# Get the index of the first dense layer...
first_dense_idx = [index for index,layer in enumerate(layers) if type(layer) is Dense]
# ...and set this and all subsequent layers to trainable
for layer in layers[first_dense_idx:]: layer.trainable=True
we are calculating new weights for every layer. And once again calculating new weights for layers 12-16 onwards?:
for layer in layers[12:]: layer.trainable=True
So it looks like we can randomly choose layers again to recalculate weights. Is there a benefit of performing a backprop immediately after another backprop operation especially in the example the model architecture has remained unchanged after the first backpop?