Properly timing GPU work?

I was spending some time trying to figure out why these two loops take the same 0.01ms:

var big = Tensor<Float>(randomNormal: [50, 50])
time(repeating: 10) { big = big • big }

var evenBigger = Tensor<Float>(randomNormal: [10000, 10000])
time(repeating: 10) { evenBigger = evenBigger • evenBigger }

and I realize that this is probably because there is no GPU sync, so the compute is probably just all async. How do I measure this properly?

Ah, I think I figured it out. Copying a scalar from the tensor back to the host forces a GPU sync:

    // Copy a scalar back to the host to force a GPU sync.
    _ = tmp[0, 0].scalar

Lemme know if there is some better way to do this :slight_smile:

Side question: most of the code I’ve seen uses scalarized(). I didn’t know about scalar - but that’s much nicer! Is they’re any reason we’re not using that? Does it do something different?

Huh I didn’t know about scalarized().

It looks like .scalar returns an optional (because the tensor may not be zero’d) and scalarized() aborts if the input has more than one scalar, so it doesn’t return an optional. That is a crazy subtle distinction for such similar names.

I’ll filed this bug TF-454 - t.scalar vs t.scalarized() is super confusing to track sorting this out.

I’d rather it was called scalar and behaved like scalarized() if that’s an option.

1 Like