I assume that STLR is used to jump out of local minima and to learn again. But there are other LR scheduling techs such as tf.train.cosine_decay_restarts. I also didn’t see any other papers applied with STLR. So what the difference would be if some other learning rate decay with restarting techs were applied?
Thanks.