Optimizing loss by "walking" instead of "stepping" (idea)

joshfp · November 5, 2018, 8:06am

Hi everyone,

While playing with SGD on lesson 2, I came up with a simple idea to improve training and avoiding the need to tune the learning rate.

Vanilla SGD is based on steps: it computes the gradient, makes an small step (according to the learning rate) and then repeat it for the new weights. It looks like a kind of blind algorithm, that only can scan the very close surrounding through its white cane. Additionally, tuning the learning rate is critical to achieve a good performance.

Basically, the idea is to take “walks”, instead of “steps”. The sequence for each iteration is as follows:

Compute the gradient: getting a promising direction to explore.
Walk: Compute multiple steps (sets of new weights) by using the same gradient, but different learning rates. It can be thought as walking multiple steps in a linear path following the gradient direction, but of different distances.
Compute the loss for every step and choose the step with the lower loss.
Compute the gradient of the best step and use it to update the weights (the others non-optimal steps are dismissed).

Here is the notebook with the implementation and some experiments I run:

Some intuitions:

In a training loop, I think backprop is the more computationally demanding task. So, it makes sense to squeeze the hard-to-compute gradients and use them multiple times with different learning rates.
I guess it could complement with optimized learning algorithms, like adam, RMSprop, etc. Unlike those algorithms that leverage on the training history (momentum) to adjust the learning rate, this technique can adjust the learning rate very fast in every iteration. I will try to run some experiments.
An optimization could be to use vanilla SGD on most iterations and this technique once in a while. This would allow to run faster, but also providing the freedom to adjust learning rates quickly.

Benchmark:
I performed same basic benchmark, with the following conclusions:

Usually trains faster than vanilla SGD, since it converges in less epochs.
It shows low-variance training time, unlike vanilla SGD that trains a little bit faster if tuned with the optimal learning rate, but can take much longer with sub-optimal learning rates.

There are a lot of questions that came to my mind:

Does this technique already exist?
Is it useful/applicable or there are important drawbacks?
How it would behave with more complex networks and higher dimensional data?

I’m eager to receive some feedback!

devforfu · November 5, 2018, 1:05pm

Interesting! I guess it would be also helpful to train your algorithm on something more challenging then quadratic surface. And probably you can add time measurements, like (or using profiler):

import time
start = time.time()
# train your algorithm
elapsed = time.time() - start

Also, I would say that probably SGD with momentum is not that blind because it allows you to ignore little “pits” on your way and to roll further

Anyway, it is always a good idea to make a try, and compare your algorithms with other implementations using “classical” datasets and compare your performance/parameters with vanilla ones.

joshfp · November 5, 2018, 1:30pm

Thanks for you feedback! I will try to train on CIFAR10 and will share the results.

eunosM3 · November 5, 2018, 11:51pm

I’m looking forward to your updated attempt.

marii · November 6, 2018, 12:03am

Not sure if you are familiar with it but “Adam” offers a changing learning rate in order to attempt to solve this problem.

ramanan · November 6, 2018, 7:59am

Nice idea and worth experimenting.

In terms of other approaches you might be interested in Beam Search where k paths are explored in parallel over several steps before making a decision on the ‘best’ path.

My initial thoughts are:

Tradeoff to either spend compute time on making more steps or optimising every step length. With fastai the largest/optimal value for the initial learning rate is identified using lr_find (which tries a wide range of values). Your approach is a bit like using lr_find on every SGD step. I’ve found it best to stick with the lr_find value until it stops working e.g. loss stops going down. In which case we can unfreeze and run lr_find again.
Randomness is a good thing. With SGDR the learning rate is modulated during every epoch. In addition each step direction has a measure of randomness (depending on the batch size). Both of these help us escape early local minima. Perhaps your approach needs to incorporate some randomness as well.

Good effort and look forward to seeing more results.