Fastai notebook computation time benchmarks

Today(1/28/18), I decided to do a little benchmarking of the computation time it takes to run certain parts of notebooks. With paperspace and other cloud options the preferred options for running this course, I wanted a place where people could compare and contrast the different cloud options as well as setups created by students running their own “servers”. Does it make sense to run the class in the cloud? What about building your own server? What about the hardware you already have? My hope is that this is place where we can simply share observations. It is not about being faster (yet) it is just about showing options and how things may be relative.

For me, I do not dabble in the cloud computing simply because I have a custom built machine which does DL/ML very well. It is currently setup as a dual-boot win10/ubuntu system with fastai running natively. I have also recently purchased a laptop with an nvidia card for another purpose. Today I thought it would be good to see how the setups compare, starting with the two main operations in the lesson1 notebook.

For setup purposes, I pulled the latest fastai repo, and also did a conda env update. Here are my results:

While not a complete apples to apples test, the laptop came in last place, even with its 6GB 1060 card.
The custom built desktop with the 1080ti was more than 50% faster than the laptop. I was quite surprised.
The same machine booted into Ubuntu was 40% faster than windows! Now it could be that ubuntu runs on an Nvme drive while Win10 runs on a SSD, or it could be drivers, but I was impressed.

As time allows, I will add tabs for each notebook with the computational operations identified. At any rate, I have put my results into google slides. If anyone would like to contribute to the slides send me a msg, and I will share the link.

Community Results:

1 Like

Thanks, I’ll post my results relative to that part by tomorrow. I imagine that with pytorch using the gpu I should attain something in between your laptop and your monster rig (given that you are using just one 1080ti). I’ll keep you posted.

For the moment, let me say that I obtained the result I posted on a nb I written down from scratch (as I usually do). The same code executed on the standard nb provided by fastai repo is carried out almost instantaneously, the first time you execute it. I’m clearly making some stupid mistake, having slept 6 hours in 3 days. My best guess is that the gpu is initialized by pytorch, but it is not released once it finishes its job. Either that, or I’m suffering memory leakages

  • GPU: Titan V

  • Rig: Desktop, i7

  • Operating System: Ubuntu 16.04

  • RAM: 32 G

  • User: @prairieguy

  • Date: 1/29/2018

  • Data Augmentation:, 3, cycle_len=1), no-load( only): 2:01 min

  • Fine Tuning:, 3, cycle_len=1, cycle_mult=2), no-load( only): 8:19 min

  • GPU: Titan Xp

  • Rig: Desktop, i7

  • Operating System: Ubuntu 16.04

  • RAM: 32 G

  • User: @prairieguy

  • Date: 1/31/2018

  • Data Augmentation:, 3, cycle_len=1), no-load( only): 2:51 min

  • Fine Tuning:, 3, cycle_len=1, cycle_mult=2), no-load( only): 13:16 min

Uhm, the Titan V is 30% faster than Titan Xp, even without using tensor cores.

However, great rig!

My benchmark on, 3, cycle_len=1)

i7 Haswell, GTX 1070, 16gb, Windows 10, no load on gpu, but a lot of load on the cpu.

3/3 [05:41<00:00, 113.99s/it]

Considering that the gtx 1080 ti did it in 4:16 when on win, my wall time seems ok.

The one thing I find most interesting is the big discrepancy of @FourMoBro with windows and Linux.
It cannot be entirely (or even partly) attributed to the nvme ssd, a fortiori because he has 64gb of ram (I exclude swapping operations).

GPU: Titan Xp

Rig: AMD 1950x

Operating System: Ubuntu 16.04, kernel 4.15

RAM: 64 G

Disk: Samsung 960 Pro NVME

Date: 2/19/2018

Data Augmentation:, 3, cycle_len=1) 1:56 min

Fine Tuning:, 3, cycle_len=1, cycle_mult=2): 10:27 min

How do you perform bench-marking? I would also like to perform benchmark for my 1080 ti / Samsung 960 Pro NVME. Thanks

What parameters in the lesson1 did you capture for benchmark? Thanks

My benchmark on, 3, cycle_len=1)

i7 Haswell, GTX 1070, 16gb, Windows 10, no load on gpu, but a lot of load on the cpu.

3/3 [05:41<00:00, 113.99s/it]

I can add for direct comparison that the same GTX 1070 8GB on a similar Windows 10 laptop worth about the same $2k executes that cycle in 7:37.

Mmhh… I think, but I’m not sure, that the mobile 1070 is less powerful than the desktop version.

I set up my own DL recently. Wanted to benchmark it to see if my setup is up to the mark.

Here is my system details. I’m running Ubuntu16.04

Mother board: Asus 320M-K
Processor: Ryzen 5
DDR4 RAM: Corsair 16GB RAM
SATA HDD: 2TB Seagate
SMPS Power supply 750W

Here is my bench mark data on different batch size for lesson1 cats and dog classification. I’ve used the epoch of 5. Not sure if this is the correct parameter to benchmark it against. Any inputs will help.

Batch Size trn_loss val_loss Accuracy Wall Time (seconds)
64 0.031134 0.028481 0.989 15.7
128 0.028619 0.029348 0.989 14.1
256 0.032689 0.022995 0.991 13.2
512 0.038162 0.025427 0.9895 12.7
1024 0.055639 0.02597 0.988 12.2
2048 0.08693 0.034631 0.987 11.4
4096 0.165338 0.048062 0.983 11.5
8192 0.303578 0.060767 0.9795 10.1
16384 0.346356 0.091748 0.98 6.15
32768 0.651255 0.262653 0.927 4.66
65536 0.676977 0.250999 0.9475 4.74
131072 0.56841 0.24005 0.9415 4.73

I stoped at batch size 131072 . Not sure how much more load my GPU can take. But before that I need to know, if I’m on the right track of benchmarking.