Why NN gradients are less than 1?

lightgray · November 4, 2020, 3:20pm

I have a guess, probably incorrect, regarding the behavior of gradients.
In the most first iteration of a NN, we just guessing our weights. Looking at some specific weight it should be pretty far from its final value. I expect the most first gradient of this weight must be pretty big, atleast bigger than 1. For example, like in Jeremy’s picture

grad_illustration

We see here the slope is bigger than 45 degrees -> the gradient is more than 1.

I want to explore it. I took 3 or 4 different models from the fastbook courses and still can’t find gradients (absolute values) more than 1.
For example

def abs_10_power(x): 
    return (x.abs() + (1e-30)).log10().flatten().numpy().astype(int)

class GradCallback(HookCallback):
  def __init__(self): super().__init__(is_forward=False)
  def before_fit(self): 
    super().before_fit()
    self.out_grads = L()

  def hook(self, m, i, o):
    self.out_grads.append(abs_10_power(o[0]))

path = untar_data(URLs.MNIST_SAMPLE)
Path.BASE_PATH = path
dls = ImageDataLoaders.from_folder(path)
learn = cnn_learner(dls, resnet18, pretrained=False,
                    loss_func=F.cross_entropy, metrics=accuracy)

gcb = GradCallback()
learn.fit_one_cycle(1, 0.1, cbs=gcb)

pd.Series(gcb.out_grads, name='batch_grads')\
  .apply(np.max).to_frame('log10_max_grad')\
  .reset_index().rename(columns={'index': 'iteration'})\
  .plot.scatter('iteration', 'log10_max_grad');

index

Pomo · November 4, 2020, 7:08pm

Hi Sergey,

The parabola is just a particular example to illustrate SGD. When you go further from the minimum, the slope increases for a parabola, but that may not the case in general for the loss vs. weights function. I am really not sure - it would depend whether the loss function’s slope is bounded. It’s likely that for any we commonly use that the slope increases indefinitely when you get far from the minimum. What we actually rely on in SGD, though, is that the gradient gets smaller as we approach a local minimum. That does not mean that the opposite must necessarily be true.

I am delighted that you investigated your question with an actual experiment. A couple of comments:

There is no reason to test against the number one. The scale of the gradient gets determined by the architecture and the loss function. (For example, when you put a factor in front of the loss function, it scales the gradient.) That’s why different models need different learning rates.
So you would really want to test whether the gradient gets arbitrarily big as you move farther away. If you look at the gradients along a training path that already starts from a reasonable weight initialization, I doubt you will see any large values.
A possibly more general way to measure the gradient is to take its norm (length) rather than the maximum abs() component.
To observe some huge gradient values, try setting the learning rate too high. To be more intentional, you might create a loss function that rewards large losses, and train!

I hope these angles help. I am interested, theoretically, in knowing whether gradients get arbitrarily large toward the edges of weight space, so please post what you discover.

lightgray · November 5, 2020, 5:08pm

I agree about the parabola.
In my assumptions about gradients and the underlying parameter-loss surface, I take the next picture in my mind from lesson 14 of the fastbook course

So here the only option to get max-gradients less than 1 in the whole process of learning, for all iterations, is to get into some of the local minimums. I don’t think this is a likely case.
Regarding your notes:

I don’t test against the most first gradient. I look at all gradients aggregated with max or mean by a module on every iteration.
I’ll check it
I understand your point. My idea is to try to find “any” partial gradient more than 1.
Thanks for the idea, I checked it in the colab.
Train loss is getting bigger

image1204×321 20.1 KB

Gradients are still suspiciously small

image2434×714 123 KB

If you want to play with gradients too I created a colab: