Lesson-1-pets Benchmarks

One doubt: We are running the notebook with np.random.seed(2). This doesn’t ensure the split in the DataBunch is always the same?

3 Likes

Is there a thread/doc somewhere detailing what part of the whole process is reproducible when using the same random seed and what isn’t? I didn’t dig into it yet but at first glance nothing is the same in multiple runs with the same seed.

I intentionally avoid using reproducible runs, because I don’t think it’s a good idea to hide the randomness. However, it is important to use the same validation set each time. That’s why I set the random seed before I create the validation set.

2 Likes

Ok that makes sense, and I agree with your choice of default behaviour. But is there an easy way to force all the (non-CUDA related) randomization to be consistent between runs ? For debug purposes for example.

Yes but you need to re-run this cell block np.random.seed(2) before you re-create the ImageDataBunch otherwise it doesn’t retain the seed on the next iteration.

Thank you to @sgugger for pointing this out in the dev channel!

3 Likes

Have had the same experience (outside of fastai context). Although the respective docs for sklearn say that they obay numpy random seeds, I found that definitely not to be the case (i.e. for random forrest). Not sure how fastai behaves/uses this. Never investigated this further, but I think it has to do with scoping of the variables / context

Well, that explains then. I definitely didn’t run this cell every time.

Thank you very much for the clarification.

Ronaldo

Well, I have finally been able to break the 4% barrier without having to touch “wd” or the like. The only “advanced” thing I did was to enable both of my GPUs using “DataParallel”. I will try to post a link to GitHub once I figure that out later tonight.

In short, of what I changed from original:

bs = 64
np.random.seed(265)
data = ImageDataBunch.from_name_re(path_img, fnames, pat, ds_tfms=get_transforms(), size=320, bs=bs)
learn = create_cnn(data, models.resnet50, metrics=error_rate)
learn.model = torch.nn.DataParallel(learn.model, device_ids=[0, 1])
learn.fit_one_cycle(10, max_lr=slice(1e-3))
learn.unfreeze()
learn.fit_one_cycle(10, max_lr=slice(10e-5, 5e-5))

yielded:
10 0.026167 0.131423 0.036760 (00:44)

learn.validate()
[0.13142301, 0.03675970295158838]

edit Here is the github link to my notebook

3 Likes

@FourMoBro - good job. Try the exact same thing with a single GPU, except make all your learning rates half as big. You should get about the same result. Let us know how you go!

learning rates or batch size to be halved?

Just the learning rates. (You can also try with the same learning rates - now I think of it, actually I think the particular approach you used won’t require any change; but worth experimenting!)

I guess I am running 3 more notebooks or so…

I can’t just “single GPU it” without doing other changes. I originally did a bs=64 with bs=bs in the IDB function. Leaving bs=64 will result in a CUDA out of memory error for a single GPU. Changing bs=32, or leaving it bs=64 but changing bs=bs//2 in the IDB function gives me that [1, 4096] type error. So, to get it to run on the single GPU, I have bs=48 and bs=bs in the IDB. It is just about done. Then I will halve the lrs in both single and dual GPU and post links. It may have to wait until morning for the official results.

1 Like

So that was a kind of data snooping, right?

For Dual GPUs:
Original Trial “Copy 1”: bs=64, error rate of 0.036760
Copy 4: dual GPU, bs=64, arbitrary halving of Copy 1 lrs gave me an error rate of 0.038802

For Single GPUs:
Copy 2: bs=48, error rate of 0.036079
Copy 5: bs=48, arbitrary halving of Copy 2 lrs gave me an error rate of 0.040163
Copy 5: bs=48, lr changed in fine tuning step based upon lr_plot gave an error rate of 0.035398

Links to notebooks:

3 Likes

Why did you use size=320? Thanks!

Good question! I was trying to replicate @ronaldokun results and in his one notebook link, he shows that size=320. However, I could have sworn that the official course nbs had that size value at one point or another, but I cannot find a source for that. So I guess I got lucky in the end.

Thank you.

If some version of the original nb had that value, I’d like to hear Jeremy’s opinion too.

I have a GCP instance:

Here is my installed library versions:

from fastai.utils import *
show_install()


=== Software === 
python version  : 3.7.0
fastai version  : 1.0.30
torch version   : 1.0.0.dev20181120
nvidia driver   : 410.72
torch cuda ver  : 9.2.148
torch cuda is   : available
torch cudnn ver : 7401
torch cudnn is  : enabled

=== Hardware === 
nvidia gpus     : 2
torch available : 1
  - gpu0        : 16130MB | Tesla V100-SXM2-16GB
  - gpu1        : 16130MB | Tesla V100-SXM2-16GB

=== Environment === 
platform        : Linux-4.9.0-8-amd64-x86_64-with-debian-9.6
distro          : #1 SMP Debian 4.9.130-2 (2018-10-27)
conda env       : base
python          : /opt/anaconda3/bin/python
sys.path        : 
/home/jupyter/fastai-course-v3/nbs/dl1
/opt/anaconda3/lib/python37.zip
/opt/anaconda3/lib/python3.7
/opt/anaconda3/lib/python3.7/lib-dynload
/opt/anaconda3/lib/python3.7/site-packages
/opt/anaconda3/lib/python3.7/site-packages/IPython/extensions
/home/jupyter/.ipython

Hardware specs:
OS: Debian 4.9.130-2 (2018-10-27)
RAM: 52GB
CPU: 8 vCPU (skylake)
HD: 200 GB hdd
GPU: V100 x 2

Benchmarks:
Training: resnet34
learn.fit_one_cycle(4): Total time: 01:47 (single gpu)
learn.fit_one_cycle(4): Total time: 01:56 (dual gpu)

after Unfreezing, fine-tuning, and learning rates
learn.fit_one_cycle(1): Total time: 00:27 (single gpu)
learn.fit_one_cycle(1): Total time: 00:27 (dual gpu)

learn.fit_one_cycle(2, max_lr=slice(1e-6,1e-4)): Total time: 00:53 (single gpu)
learn.fit_one_cycle(2, max_lr=slice(1e-6,1e-4)): Total time: 00:54 (dual gpu)

Training: resnet50
learn.fit_one_cycle(5): Total time: 03:11 (single gpu)
learn.fit_one_cycle(5): Total time: 03:16 (dual gpu)

after Unfreeze:
learn.fit_one_cycle(1, max_lr=slice(1e-6,1e-4)): Total time: 00:44 (single gpu)
learn.fit_one_cycle(1, max_lr=slice(1e-6,1e-4)): Total time: 00:41 (dual gpu)

As you can see in this example, running multiple gpus for resnet34 did not improve performance. It performed about the same as a single.

P.S. : I run the notebook “As-is”. For a single gpu, I change nothing. To test dual gpu, I simply added “learn.model = torch.nn.DataParallel(learn.model, device_ids=[0, 1])” before fitting.

1 Like

To test dual gpu, I think to be fair, we should double the batch size. Otherwise as expected, there will not be any speed difference in comparison with 1 GPU run.

I have tested GCP instance with 8 GPUs to see the speedup that I can get by running this nb on 1, 2, 4, 8 GPUs.

I will try to make graphs for the accuracies and training time.

I have increased bs by the same ration of increasing GPU count in the test.

What I found in sum:
2 GPUs model is doing much better job in training time (almost twice the speed), with minimal decrease in accuracy.
4 GPUs model is doing as fast as 1 GPU or a little bit better (waste of GPUs)
8 GPUs model is doing as fast as 1 GPU or a little worse (waste of GPUs)

However, trying to run 4 duplicates of notebooks, each with 2 GPUs model, they have run almost the same speed each like the 2 GPUs scenario above when ran alone.

This was surprising!
I tried to max out other specs, so I will not suspect any other bottleneck.
Specs of the GCP instance:
V100 x8
256GB RAM
64 vCPU Xeon skylake
500GB SSD

Especially if we note that the V100 GPUs should be using 300GB/s NVlink2.0, so there should not have any degradation of speedup in running 4 or 8 GPUs in parallel.

Digging deeper in the literature, I found that I can detect the topology of GPUs connected together using:
nvidia-smi topo -m
which revealed that not all GPUs are connected together with NVLinks. Each V100 comes with 6 channels of NVLink v2.0 (each link = 50GB/s). The 6 links are used to connect only 2 adjacent GPUs (3 links each with 150GB/s). Unfortunately, I think the 6 NVLinks is kind of wasted in GCP. Two links between 2 GPUs (or even 1 link) would be enough for DL applications. (Maybe there are other applications that need higher NVLink connections.) Nvidia DGX 2 has used a better topology IMHO for DL.

The GPUs that are not connected through NVlink, if they have to p2p transfer data, they are using PLX PCIe switch, which is obviously not fast enough to scale training of large models on 4 or 8 GPUs.

In a later post I will share the graphs and snapshots of my mini study on running fastai models in parallel GPUs.

5 Likes

Here are the charts:

As you can see, running 4 or 8 GPUs to train one model is waste of GPUs. However training it on 2 GPUs give you around x1.5 more than 1 GPU.
However running 4 models on 8 GPUs (each model on 2 parallel GPUs), will be fine. And this is because not all GPUs are connected together by NVlinks. PCI lanes seems not enough.

Here is the GCP V100 x 8 topology:

htop + nvidia-smi + iotop:

lscpu:

4 Likes