Lesson-1-pets Benchmarks

Let’s try to keep this thread on topic…

The purpose of this thread is to share your experience running the “lesson-1-pets.ipynb” jupyter notebook on various platforms. Many people are eager to get their local servers up and running or even building a brand new box with the latest hardware. Others may prefer to use a paid cloud option. Either way, it costs money upfront to build a local server, or you will pay as you go with a cloud option (once credits run out). Hopefully, in the end, people will see which platform suits them the best for their current situation. All I am asking for is for people to share their processing times for various sections of the notebook, that is it. This is not an install help thread, nor a “what does this do” thread, nor is it a “mine is better” thread.

If you have a local server, please list the relevant components. For the cloud options, which configurations you chose, etc.

I will get this started:

I have a local server, here are the specs:
OS: Ubuntu 18.04.1 LTS
RAM: 64GB
CPU: Intel 6850K
HD: Samsung Nvme 960
GPU: 1080ti x 2

Benchmarks:
Training: resnet34
learn.fit_one_cycle(4): Total time: 01:10 (single gpu)
learn.fit_one_cycle(4): Total time: 01:12 (dual gpu)

after Unfreezing, fine-tuning, and learning rates
learn.fit_one_cycle(1): Total time: 00:21 (single gpu)
learn.fit_one_cycle(1): Total time: 00:19 (dual gpu)

learn.fit_one_cycle(2, max_lr=slice(1e-6,1e-4)): Total time: 00:42 (single gpu)
learn.fit_one_cycle(2, max_lr=slice(1e-6,1e-4)): Total time: 00:37 (dual gpu)

Training: resnet50
learn.fit_one_cycle(5): Total time: 04:21 (single gpu)
learn.fit_one_cycle(5): Total time: 03:03 (dual gpu)

after Unfreeze:
learn.fit_one_cycle(1, max_lr=slice(1e-6,1e-4)): Total time: 01:09 (single gpu)
learn.fit_one_cycle(1, max_lr=slice(1e-6,1e-4)): Total time: 00:46 (dual gpu)

As you can see in this example, running multiple gpus for resnet34 did not improve performance. It performed about the same as a single. For resnet50, dual gpus was about 25% faster.

Please share your cloud experiences or your local server experience if you have one. Overtime, myself or someone else will create a spreadsheet to track them all.

Thanks again.

edit: I run the notebook “As-is”. For a single gpu, I change nothing. To test dual gpu, I simply added “learn.model = torch.nn.DataParallel(learn.model, device_ids=[0, 1])” before fitting. If you make any deviations in the code, please note them.

7 Likes

@FourMoBro Interesting to compare the iniitial numbers with your results, given a lesser powered GPU, but appreciating the GPU isn’t everything. You’ve got me wondering about my HD IO. Here are the key run times.

OS: Ubuntu 18.04.1 LTS
RAM: 32GB
CPU: Intel i7-7700K
HD: Samsung SM961 Polaris M.2 NVMe
GPU: Titan XP, 2080 Ti. CUDA 9.2, Driver 410.

Benchmarks:
Training: resnet34
learn.fit_one_cycle(4): Total time: 01:23 (2080 ti)
learn.fit_one_cycle(4): Total time: 01:23 (2080 ti fp16)
learn.fit_one_cycle(4): Total time: 01:25 (Titan XP)

Training: resnet50
learn.fit_one_cycle(5): Total time: 03:21 (2080 ti)
learn.fit_one_cycle(5): Total time: 02:46 (2080 ti fp16)
learn.fit_one_cycle(5): Total time: 03:58 (Titan XP)

resnet50 after Unfreeze:
learn.fit_one_cycle(1, max_lr=slice(1e-6,1e-4)): Total time: 00:51 (2080 ti)
learn.fit_one_cycle(1, max_lr=slice(1e-6,1e-4)): Total time: 00:40 (2080 ti fp16)
learn.fit_one_cycle(1, max_lr=slice(1e-6,1e-4)): Total time: 01:02 (Titan XP)

No dual GPU improvement.

2 Likes

I’d also be interested to hear if anyone can get more accurate results, by changing learning rates, # epochs, etc. Can anyone get under 4% error reliably? How about 3%? :slight_smile:

5 Likes

OT: you need more RAM, at the very least twice.

IT: I’ll post my benchmarks ASAP :wink:

This is a pretty tiny system as and Jeremy noted less than 8Gb RAM will be frustrating no doubt. Still it got through everything it would seem, at least with a batch size of 12.
But I did have something weird happen to learn.lr_find() which stopped at 61.96% and set the thermometer counter widget to the colour red. No error messages, and everything else just continued to work, but perhaps the learning rate got miscalculated? I suspect this was due to to little resources (memory)? Perhaps @jeremy knows more on what may have gone wrong. Anyway I think I’ll move over to one of the online solutions which have worked great for me previously.

OS: Ubuntu 18.04.1 LTS on a 60 Gb partition (dualboot with Windows 10)
RAM: 16GB
CPU: Intel i5-8600K CPU @ 3.60GHz
HD: Samsung NVMe SSD Controller SM961/PM961
GPU: GeForce GTX 1060 6GB

Benchmarks:
Training: resnet34
learn.fit_one_cycle(4): Total time: 01:56 (single gpu)

after Unfreezing, fine-tuning, and learning rates
learn.fit_one_cycle(1): Total time: 00:38 (single gpu)

learn.fit_one_cycle(2, max_lr=slice(1e-6,1e-4)): Total time: 01:14 (single gpu)

Training: resnet50
Batch size (bs) set to 12
learn.fit_one_cycle(5): Total time: 09:02 (single gpu)

after Unfreeze:
learn.fit_one_cycle(1, max_lr=slice(1e-6,1e-4)): Total time: 02:30 (single gpu)

That’s great!

That’s the expected behavior. We’ll learn more next week.

3 Likes

The LR finder (at least the old one) did underperform on small batch sizes. It has/had to see enough data in order to give you a reliable estimate about LR vs. loss.
Note also that using different batch sizes for the finder and the actual training should NOT be done, since the optimal LR is a function (among other things) of the batch size.

2 Likes

If you want a more real world benchmark for dual GPU training time then you need to double your batch size. I realize you need to set parameters for benchmarking to try and get good comparisons but I can’t think of a reason you wouldn’t want to do this in.

Adding my own measurments to the list (the low end…)
and who needs spreadsheets, we have markdown :wink:
My 1060 measurements closely mirror those of @Edward (but I was able to use bs=32)

Single GPU Benchmarks

Task 1050 1060 1080 Ti 2080 Ti K80 V100*
GPU Mem 4 GB 6 GB 11 GB 11GB 12GB 16GB
CUDA - Driver 9.2 - 396.54 9.2 396.54
System Dell XPS-15 AWS p2.xl Sagemaker p3.2xl
CPU i7-7700HQ@2.8 i5-7600K@3.8 Intel 6850K i7-7700K E5-2686 ?
RAM 16GB 16GB 64GB 32GB 61GB 61GB
Storage f. training data Samsung NVMe SSD Samsung Nvme 960 Samsung SM961 NVMe SSD SSD
OS Ubuntu 18.04.1 Ubuntu 18.04.1 Ubuntu 18.04.1 Ubuntu 18.04.1 Ubuntu 16.04.5 ?
resnet34 (bs=64) fp
F learn.fit_one_cycle(4) 04:03 02:01 01:10 01:23 03:55 01:56
U learn.fit_one_cycle(1) 01:22 00:40 00:21 01:02 00:29
resnet50 (bs=48)
F learn.fit_one_cycle(5) 17:24¹ 09:02² 04:21 03:21 15:45 03:41
F learn.fit_one_cycle(5) fp16 02:46
U learn.fit_one_cycle(1) sl(1e-6,1e-4) 04:52¹ 02:24² 01:09 00:51 04:04 00:46
U learn.fit_one_cycle(1) sl(1e-6,1e-4) fp16 00:40

¹ bs=16
² bs=32
F = frozen, U = learn.unfreeze()

* V100 taken from the v3 lesson 1 (Jeremy runs ml.p3.2xl on sagemaker as can be seen in the video)

8 Likes

Substantial gains. But less than what I was expecting.

What about accuracy in FP16 vs 32?

1 Like

Sorry to be clear the 1080ti and 2080ti were taken from @FourMoBro and @digitalspecialists posts.
Would be interesting to add the acchieved accuracy to the comparison, because for me it would also be interesting if the batch size actually influences what the accuracy is after the same number of epochs…

Mhh, this would be an entirely different matter.

My question was more about if (and how much) FP16 taxes accuracy. Since the rounding/truncation-sensitive parts of mixed mode have been left in FP32, it should not impact on accuracy. But it would be useful to see some comparisons just to be sure. The hyperparameters should be left untouched, obviously.

agreed, but it would both be answered by adding the same metric :wink:

1 Like

Here are the times I got for the lowest end of the Salamander GPU servers (AWS cloud).

Server: Salamander 1x K80 GPU.
OS: Ubuntu 16.04.5 LTS
RAM: 61GB
CPU: 4x vCPU (Intel E5-2686)
HD: SSD (couldn’t find model)
GPU: K80 (12GB RAM)

Benchmarks, following the default steps and values in the notebook:

Training: resnet34
learn.fit_one_cycle(4): Total time: 03:55
learn.fit_one_cycle(1): Total time: 01:02
learn.fit_one_cycle(2, max_lr=slice(1e-6,1e-4)): Total time: 02:03

Training: resnet50
learn.fit_one_cycle(5): Total time: 15:45
learn.fit_one_cycle(1, max_lr=slice(1e-6,1e-4)): Total time: 04:04

Man i really need Markdown practice…at any rate…

Well, I started that challenge this evening but I didn’t get as far as I would have liked. I did a few experiments using both GPUs in parallel and watched the GPU mem usage as it worked. The first thing I did was simply increase the epoch# to 25 in both the initial fit and unfreeze operations.

I can get the error down to .051231 with the training loss ending at .038438 and val loss at .219566 for a resent34. resnet50 was better with the error down to .042848 and train/val losses at .018975 and .184779. Based upon memory usage and the downward trend in the training loss, I decided to try more epochs and adjusting the batch size.

So I increased the epochs to 50 and bs =256 for resnet34, and bs=64 for resnet50. I did not touch the learning rate, I should have because the graph made no sense to me. I need to check the code and see if the bs change is reflected in the lr code or not. Anyways…resnet34 had worse results for error(.059215) and val loss(.258286) but train loss was (.012579). resnet50 was also worse with train/val/error being (.004768/.215050/.048780). Again, I need to check out that lr code tomorrow.

But for now, the copper can cool and everything can rest for fresh eyes tomorrow.

1 Like

OK, I’ll check up on setting the LR BS-parameter. Thanks!

1 Like

Very interesting! So actually the K80 is closer to a 1050 than a 1060 speedwise, so if memory is not an issue, the 1060 is preferable?!
Added it to the table https://forums.fast.ai/t/lesson-1-pets-benchmarks/27681/9?u=marcmuc

my benchs, low end .
Ubuntu 16.04.1
Core i3 (4 core)
disk - hdd, raid edition
mem - 16g
in docker

resnet34:
learn.fit_one_cycle(4) - 03:16
unfreeze, learn.fit_one_cycle(1) - 00:46

resnet50, bs=48:
learn.fit_one_cycle(5) - 05:49
unfreeze, learn.fit_one_cycle(1, max_lr=slice(1e-6,1e-4)) - 01:17

1 Like

My dated CPU and/or memory are apparently bottlenecking my nvme drive and 2080 as I am getting resnet34 times similar to the 1060, and a wide range of batch sizes and fp32 v fp16 don’t affect the times.

What I can add to the benchmarks is that at fp32, my max batch size is 192 and at fp16 I can get at least 350. I get OOM error at 380, so not quite double the capacity. I’d find the precise max but I am already late to work ;).

1 Like

1:13 - resnet34 no changes

32 GB RAM
1080ti

AMD Rayzen 7

SSD cheap SATA drive

M