Mixed precision training

Thanks for the awesome explanation. What does your overall memory consumption look like when using mixed precision vs single precision? Are you able to increase your batch sizes in practice or does keeping a single precision copy of your weights cancel out the single precision gains w/ respect to overall memory usage. I’m sure it varies based on which model you use, but any insights on what you’ve actually seen in practice would been interesting to know.