I ran a bunch of benchmarks on the Pets notebook, someone’s modified version of it, using my 2080.
|RN|FP|BS|SZ|Time|Error|
|34|16|100|224|4_47|0.053|
|34|32|100|224|5_16|0.055|
|34|32| 48|320|7_29|0.053|
|50|32| 32|320|9_19|0.044|
|50|16| 32|320|8_40|0.043|
|50|16| 32|299|8_07|0.044|
|50|16| 64|299|7_20|0.041|
I think I have some bottleneck as another posted similar times on a 1060, but doubling the batch size was as simple as wrapping the learner in to_fp16(…).