You can only compute the gradient of functions that return scalar values

botkop · October 12, 2019, 8:17am

I was a bit surprised to get below error when trying to obtain the gradient for a layer without activation or loss attached to it:

Precondition failed: The function being differentiated produced a tensor with shape [5, 4, 3]. 
You can only compute the gradient of functions that return scalar values.

Then I tried a similar thing in PyTorch, which gave me:
grad can be implicitly created only for scalar outputs
So, that seems about the same error as in S4TF.

PyTorch though allows me to write:
outputs.backward(torch.ones_like(outputs))
which works, and allows me to test the correctness of the gradients of the layer.

Any idea how this can be done in S4TF?
Thank you.

TomB · October 12, 2019, 11:06am

Haven’t used S4TF at all, and not very familiar with the maths, so could be off here, but I have written a couple of custom backwards kernels so some familiarity with the mechanics.
Yeah, you need a scalar for backwards, I’ve use stuff like:

x = torch.randn(10) # Input
y = my_forward(x)
loss = y.mean()
grad_out, = torch.autograd.grad(l, y, retain_graph=True) # retain_graph needed to be able to call grad again
# grad_out is now the gradient of loss w.r.t output of my_forward (y)

# my_backward calculates gradient from input and grad_out, so I pass x
grad_inp, = my_backward(grad_out, x) 
assert grad_inp = torch.autograd.grad(l, x)

I think that you should find:

grad_inp == y.backward(torch.ones_like(y)) / grad_out

Not entirely across the details, code largely copied from elsewhere, but seems to work (though the above may not as that’s just adapting from this code which is a little harder to follow).

Doesn’t help with how to do it in S4TF but might at least help understand what’s going on on the PyTorch side to let you just do l.backward() or . y.backward(torch.ones_like(y)). This post also looks to have some nice details on how y.backward(...) works in PyTorch.
I gather part of your issue is that whatever you’re using in S4Tf behaves more like torch.autograd.grad than .backward(). I frequently encountered the only scalars issue with .grad().

pela · October 15, 2019, 10:13am

Hi there.
In PyTorch when outputs is not a scalar and you pass a tensor to the gradient argument of backward, it first takes the dot product of outputs and gradient to obtain a scalar, and then it applies backward() to it.
So, in your case outputs.backward(torch.ones_like(outputs)) should be the same as:

outputs = outputs.sum()
outputs.backward()

stephenjohnson · October 25, 2019, 3:33am

Try using appliedForBackpropagation

Example:

import TensorFlow
let layer = Dense<Float>(inputSize: 5, outputSize: 3)
let input = Tensor<Float>([[2.3, -1.2, 4.7, -2.1, 3.0]])
let (predicted, backprop) = layer.appliedForBackpropagation(to: input)
let gradients = backprop(Tensor<Float>(onesLike: predicted))
print(gradients)

▿ 2 elements
  ▿ layerGradient : TangentVector
    - weight : [[ 2.3,  2.3,  2.3],
 [-1.2, -1.2, -1.2],
 [ 4.7,  4.7,  4.7],
 [-2.1, -2.1, -2.1],
 [ 3.0,  3.0,  3.0]]
    - bias : [1.0, 1.0, 1.0]
  - inputGradient : [[-0.56547254,   1.2013183,  -1.1583307, -0.40139845,  -1.4616363]]

botkop · October 25, 2019, 4:46am

Aha. Yes, I guess that is that I was looking for. Thank you.