I figured this out thanks to your help and also thx to a very helpful comment on github.
Turns out with scaling the weights the model was outputting less confident predictions which helped with the loss (clipping effect) and also I think there was some luck involved that the new parameters produced loss that was so much better. Might be something else going on in here but mostly it was the clipping effect.
Nonetheless, this is all not really important - the important bit is that I learned that you shouldn’t rescale weights when you change the p value on dropout in keras! Turns out keras does something rather smart - instead of scaling weights at test time, it scales them at train time! This way when you do layer.get_weights() regardless what the dropout value was they will always be of correct scale! I tested this and indeed this is the case.
Let me cc @jeremy and @rachel please as this seems like a very useful tidbit to know.