OT: you need more RAM, at the very least twice.
IT: I’ll post my benchmarks ASAP
OT: you need more RAM, at the very least twice.
IT: I’ll post my benchmarks ASAP
This is a pretty tiny system as and Jeremy noted less than 8Gb RAM will be frustrating no doubt. Still it got through everything it would seem, at least with a batch size of 12.
But I did have something weird happen to learn.lr_find()
which stopped at 61.96% and set the thermometer counter widget to the colour red. No error messages, and everything else just continued to work, but perhaps the learning rate got miscalculated? I suspect this was due to to little resources (memory)? Perhaps @jeremy knows more on what may have gone wrong. Anyway I think I’ll move over to one of the online solutions which have worked great for me previously.
OS: Ubuntu 18.04.1 LTS on a 60 Gb partition (dualboot with Windows 10)
RAM: 16GB
CPU: Intel i5-8600K CPU @ 3.60GHz
HD: Samsung NVMe SSD Controller SM961/PM961
GPU: GeForce GTX 1060 6GB
Benchmarks:
Training: resnet34
learn.fit_one_cycle(4): Total time: 01:56 (single gpu)
after Unfreezing, fine-tuning, and learning rates
learn.fit_one_cycle(1): Total time: 00:38 (single gpu)
learn.fit_one_cycle(2, max_lr=slice(1e-6,1e-4)): Total time: 01:14 (single gpu)
Training: resnet50
Batch size (bs) set to 12
learn.fit_one_cycle(5): Total time: 09:02 (single gpu)
after Unfreeze:
learn.fit_one_cycle(1, max_lr=slice(1e-6,1e-4)): Total time: 02:30 (single gpu)
That’s great!
That’s the expected behavior. We’ll learn more next week.
The LR finder (at least the old one) did underperform on small batch sizes. It has/had to see enough data in order to give you a reliable estimate about LR vs. loss.
Note also that using different batch sizes for the finder and the actual training should NOT be done, since the optimal LR is a function (among other things) of the batch size.
If you want a more real world benchmark for dual GPU training time then you need to double your batch size. I realize you need to set parameters for benchmarking to try and get good comparisons but I can’t think of a reason you wouldn’t want to do this in.
Adding my own measurments to the list (the low end…)
and who needs spreadsheets, we have markdown
My 1060 measurements closely mirror those of @Edward (but I was able to use bs=32)
Task | 1050 | 1060 | 1080 Ti | 2080 Ti | K80 | V100* | |
---|---|---|---|---|---|---|---|
GPU Mem | 4 GB | 6 GB | 11 GB | 11GB | 12GB | 16GB | |
CUDA - Driver | 9.2 - 396.54 | 9.2 396.54 | |||||
System | Dell XPS-15 | – | – | – | AWS p2.xl | Sagemaker p3.2xl | |
CPU | i7-7700HQ@2.8 | i5-7600K@3.8 | Intel 6850K | i7-7700K | E5-2686 | ? | |
RAM | 16GB | 16GB | 64GB | 32GB | 61GB | 61GB | |
Storage f. training data | Samsung NVMe | SSD | Samsung Nvme 960 | Samsung SM961 NVMe | SSD | SSD | |
OS | Ubuntu 18.04.1 | Ubuntu 18.04.1 | Ubuntu 18.04.1 | Ubuntu 18.04.1 | Ubuntu 16.04.5 | ? | |
resnet34 (bs=64) | fp | ||||||
F learn.fit_one_cycle(4) | 04:03 | 02:01 | 01:10 | 01:23 | 03:55 | 01:56 | |
U learn.fit_one_cycle(1) | 01:22 | 00:40 | 00:21 | 01:02 | 00:29 | ||
resnet50 (bs=48) | |||||||
F learn.fit_one_cycle(5) | 17:24¹ | 09:02² | 04:21 | 03:21 | 15:45 | 03:41 | |
F learn.fit_one_cycle(5) | fp16 | 02:46 | |||||
U learn.fit_one_cycle(1) sl(1e-6,1e-4) | 04:52¹ | 02:24² | 01:09 | 00:51 | 04:04 | 00:46 | |
U learn.fit_one_cycle(1) sl(1e-6,1e-4) | fp16 | 00:40 |
¹ bs=16
² bs=32
F = frozen, U = learn.unfreeze()
* V100 taken from the v3 lesson 1 (Jeremy runs ml.p3.2xl on sagemaker as can be seen in the video)
Substantial gains. But less than what I was expecting.
What about accuracy in FP16 vs 32?
Sorry to be clear the 1080ti and 2080ti were taken from @FourMoBro and @digitalspecialists posts.
Would be interesting to add the acchieved accuracy to the comparison, because for me it would also be interesting if the batch size actually influences what the accuracy is after the same number of epochs…
Mhh, this would be an entirely different matter.
My question was more about if (and how much) FP16 taxes accuracy. Since the rounding/truncation-sensitive parts of mixed mode have been left in FP32, it should not impact on accuracy. But it would be useful to see some comparisons just to be sure. The hyperparameters should be left untouched, obviously.
agreed, but it would both be answered by adding the same metric
Here are the times I got for the lowest end of the Salamander GPU servers (AWS cloud).
Server: Salamander 1x K80 GPU.
OS: Ubuntu 16.04.5 LTS
RAM: 61GB
CPU: 4x vCPU (Intel E5-2686)
HD: SSD (couldn’t find model)
GPU: K80 (12GB RAM)
Benchmarks, following the default steps and values in the notebook:
Training: resnet34
learn.fit_one_cycle(4): Total time: 03:55
learn.fit_one_cycle(1): Total time: 01:02
learn.fit_one_cycle(2, max_lr=slice(1e-6,1e-4)): Total time: 02:03
Training: resnet50
learn.fit_one_cycle(5): Total time: 15:45
learn.fit_one_cycle(1, max_lr=slice(1e-6,1e-4)): Total time: 04:04
Man i really need Markdown practice…at any rate…
Well, I started that challenge this evening but I didn’t get as far as I would have liked. I did a few experiments using both GPUs in parallel and watched the GPU mem usage as it worked. The first thing I did was simply increase the epoch# to 25 in both the initial fit and unfreeze operations.
I can get the error down to .051231 with the training loss ending at .038438 and val loss at .219566 for a resent34. resnet50 was better with the error down to .042848 and train/val losses at .018975 and .184779. Based upon memory usage and the downward trend in the training loss, I decided to try more epochs and adjusting the batch size.
So I increased the epochs to 50 and bs =256 for resnet34, and bs=64 for resnet50. I did not touch the learning rate, I should have because the graph made no sense to me. I need to check the code and see if the bs change is reflected in the lr code or not. Anyways…resnet34 had worse results for error(.059215) and val loss(.258286) but train loss was (.012579). resnet50 was also worse with train/val/error being (.004768/.215050/.048780). Again, I need to check out that lr code tomorrow.
But for now, the copper can cool and everything can rest for fresh eyes tomorrow.
OK, I’ll check up on setting the LR BS-parameter. Thanks!
Very interesting! So actually the K80 is closer to a 1050 than a 1060 speedwise, so if memory is not an issue, the 1060 is preferable?!
Added it to the table https://forums.fast.ai/t/lesson-1-pets-benchmarks/27681/9?u=marcmuc
my benchs, low end .
Ubuntu 16.04.1
Core i3 (4 core)
disk - hdd, raid edition
mem - 16g
in docker
resnet34:
learn.fit_one_cycle(4) - 03:16
unfreeze, learn.fit_one_cycle(1) - 00:46
resnet50, bs=48:
learn.fit_one_cycle(5) - 05:49
unfreeze, learn.fit_one_cycle(1, max_lr=slice(1e-6,1e-4)) - 01:17
My dated CPU and/or memory are apparently bottlenecking my nvme drive and 2080 as I am getting resnet34 times similar to the 1060, and a wide range of batch sizes and fp32 v fp16 don’t affect the times.
What I can add to the benchmarks is that at fp32, my max batch size is 192 and at fp16 I can get at least 350. I get OOM error at 380, so not quite double the capacity. I’d find the precise max but I am already late to work ;).
1:13 - resnet34 no changes
32 GB RAM
1080ti
AMD Rayzen 7
SSD cheap SATA drive
M
Rounding down by 20% now, are we?..
Google Colab
Os: Ubuntu 18.04 Bionic Beaver
Cpu: 2x vCPU, Xeon E5
Ram: 12g
Gpu: Tesla K80, 12g
resnet34:
learn.fit_one_cycle(4): 6min 56s
unfreeze, learn.fit_one_cycle(1): 1min 50s
unfreeze, learn.fit_one_cycle(2, max_lr=slice(1e-6,1e-4)): 3min 41s
resnet50:
learn.fit_one_cycle(5): 16min 48s
unfreeze, learn.fit_one_cycle(1, max_lr=slice(1e-6,1e-4)): 4min 18s