Just learned that the Geforce RTX cards (not the Titan RTX) have half-speed FP32 accumulate, when running in FP16 mode: Ugh
Does anyone know if this is enough of a bottleneck to make mixed precision training not worth it? It seemed that way with the GTX 10-series, where FP16 itself was crippled at 1/64 speed. I don’t know enough about MPT to figure this out, and I can’t find any benchmarks. I was hoping to buy an RTX machine after finishing the fast.ai classes, thinking I’d have full MPT capability at my fingertips, but reality may prove different.
EDIT: There are some benchmarks here, but none of these have full “uncrippled” capability so it’s hard to see what the difference would be with the uncrippled Titan RTX.
Well it appears the answer was staring me in the face the whole time. The Titan V card is actually not crippled, so comparing FP16 vs FP32, at least in the context of Resnet50, shows that the RTX 20’s lose roughly 15-20% performance on account of the half-speed MPT accumulate. While less than ideal, the gain in speed is still substantial enough to make MPT worth it, even on a crippled card.
Interesting, thanks for sharing! I’m not sure I’m following you 100% though, an operation like convolution is completely done in FP16 isn’t it?
I don’t really understand MPT (I’m just trying to plan ahead for the future - still on part 1, lesson 3!) but I think the accumulation step is generally done in FP32 to avoid losing accuracy. Hopefully someone with a better understanding can correct if needed.