Interesting…It worked out of the box for my LM transfer learning work on AWS P3.2X where half precision flag is used. But I didn’t check the wall times to see if the speed up was actually due to the half precision cores.
Can you override the half flag and see if it still works?
I’m experimenting on a dataset with 1cycle lr schedule with a cycle length of 10. I warmed up the GPU by doing one run and subsequently did the following test using a python script.
The first run is with fp16 and the second run is without. As we can see the runtime is 5:58 with fp16 and 6:14 without. The seconds / iteration is almost the same - 35 ~ 37 - the difference is insignificant, right?
I’m currently using 96x96 images to quickly test different models. Do you suspect that as each epoch of around 30k images finish in around 36 seconds, the transformations could be a bottleneck? Currently, the cpu utilization is hovering around 90-95%.
To establish if this is the problem, should I load in slightly larger images and see the difference? Would that be a reasonably good direction to debug?
You could try using torchvision transforms and the regular pytorch dataloader - we noticed in DAWNBench that the fastai transform pipeline is slower when using lots of cores (we’re working on fixing this). You should install pillow-simd with accelerated libjpeg using this:
Looks like learn.half() works perfectly in a notebook too! Using it with the torchvision dataloaders was giving a really good speedup almost 40%. Now that the model is in half-precision mode, I could use a larger batch size (almost double) and get some more speedup on my datasets.
I haven’t gotten benchmarks of with and without pillow-simd but that’s alright.
To put things in perspective, the model which took around 11+ hours to run now runs in ~4 hours. The tricks are: torchvision loaders + fp16 + bs*2 + pillow-simd + data prefetching.
This week when I get a chance, I’ll try it on MultiGPU and report the results.
All thanks to the imagenet-fast repo, it contains a lot of tiny aspects that make training way faster. These aren’t documented at a single place on the web except the dawn repo. Maybe I should write a post about this.
I’m having some issues getting FP16 training working on a V100. It looks like my gradients are too large?
Here’s the weird part. Every time I try to run the notebook the number in the error is the same - 6291456. Loss scaling doesn’t help the issue. I’ve tried scaling the loss by a factor of 10 up to a factor of 1,000,000. The problem persists, and the 6291456 value remains the same. Any ideas on debugging this?
Also I’m not sure if this is relevant, but I’m using conv_learner to add a custom head. When I call learn.half() I get an error 'Model' object has no attribute 'fc_model' during if not isinstance(self.models.fc_model, FP16): self.models.fc_model = FP16(self.models.fc_model).