Reproducing "How to train your ResNet" using fastai

Seb · September 24, 2019, 3:38pm

In the demo notebook from cifar10-fast, I ran everything up to “network visualization”, then I ran (copied from the training cell):

epochs=24
lr_schedule = PiecewiseLinear([0, 5, epochs], [0, 0.4, 0])
batch_size = 512
transforms = [Crop(32, 32), FlipLR(), Cutout(8, 8)]
N_runs = 5

train_batches = Batches(Transform(train_set, transforms), batch_size, shuffle=True, >set_random_choices=True, drop_last=True)
test_batches = Batches(test_set, batch_size, shuffle=False, drop_last=False)

And then just added the following cell:

%%timeit
next(iter(train_batches))

For fastai I just ran the first few cells of your notebook to recreate the databunch, and ran this (used bs=512):

%%timeit
next(iter(data.train_dl))

Seb · September 24, 2019, 3:41pm

4.36s was with bs= 512
I get 2.81 s with bs=128

Depends on your machine too I guess. You’d have to compare with the cifar10-fast version on your own machine.

Edit: also I’m using your first dataloader, not the in memory cached/loaded version.

davidpfahler · September 24, 2019, 3:46pm

This is baffling me. I am running on the same colab runtime, first the following (for cifar10-fast):

epochs=24
lr_schedule = PiecewiseLinear([0, 5, epochs], [0, 0.4, 0])
batch_size = 128
transforms = [Crop(32, 32), FlipLR(), Cutout(8, 8)]
N_runs = 5
train_batches = Batches(Transform(train_set, transforms), batch_size, shuffle=True, set_random_choices=True, drop_last=True)
test_batches = Batches(test_set, batch_size, shuffle=False, drop_last=False)
%time next(iter(train_batches))

which outputs CPU times: user 13.9 ms, sys: 2.39 ms, total: 16.3 ms

then (for fastai):

data = ImageDataBunch.from_folder(path, valid='test', bs=128, ds_tfms=tfms)
cifar_stats = ([0.491, 0.482, 0.447], [0.247, 0.243, 0.261])
data = data.normalize(cifar_stats)
%time next(iter(data.train_dl))

which outputs CPU times: user 22 ms, sys: 81.3 ms, total: 103 ms.

That is more than 6x slower in total time, but not as drastic as what you @Seb observed.

TomB · September 24, 2019, 3:47pm

The iter will create the PyTorch dataloader which will create the worker processes. This is quite slow (and could be a lot slower on cloud machines than native, it’s also reportedly quite slow on windows). Try separaely doing it = iter(data.train_dl) and %time next(it). Also check a subsequent next as well, not sure if eveything is kicked off on creation or some stuff only on access.

On that, you might want to reduce num_workers in databunch creation. It uses the CPU count by default which includes hyperthreading CPUs so if not ideal there and may also not be the best using cloud providers with virtual CPUs where overall CPU usage may be throttled and the overhead of extra workers outweighs the benefits. Also worth trying num_workers=0 to elminate the multi-process overhead (though may slow things down a lot of CPU bound).

davidpfahler · September 24, 2019, 3:52pm

Thanks! This is great to know. Here are the results:

When running %time next(it) the first time, it outputs CPU times: user 15 ms, sys: 73 ms, total: 88 ms, when running it again, it outputs CPU times: user 1.65 ms, sys: 816 µs, total: 2.46 ms.

This is on a standard ImageDataBunch, not an in-memory dataset.

Do we know that we are comparing apples to apples now or is this giving fastai an unfair advantage?

TomB · September 24, 2019, 3:57pm

The original default to num_workers=0 so fair.

davidpfahler · September 24, 2019, 3:58pm

OK, so this means we are probably not going to find the bottleneck in the data loading. Good to know, though!

Seb · September 24, 2019, 4:04pm

Batch size = 128.

Cifar10-fast (didnt look into workers)

First iteration: 5-30ms
Second iteration: same order of magnitude

Fast.ai:

Workers = 0
First iteration: 150ms, up to 1.6s
Second iteration: xxx ms
Default workers (16?)
First iteration: 3s
Second: 2ms

Ranges are a bit guesstimated.

Seb · September 24, 2019, 4:11pm

Did you run this with cifar10-fast as well?

davidpfahler · September 24, 2019, 4:26pm

davidpfahler:

I am running on the same colab runtime, first the following (for cifar10-fast):

epochs=24
lr_schedule = PiecewiseLinear([0, 5, epochs], [0, 0.4, 0])
batch_size = 128
transforms = [Crop(32, 32), FlipLR(), Cutout(8, 8)]
N_runs = 5
train_batches = Batches(Transform(train_set, transforms), batch_size, shuffle=True, set_random_choices=True, drop_last=True)
test_batches = Batches(test_set, batch_size, shuffle=False, drop_last=False)
%time next(iter(train_batches))

which outputs CPU times: user 13.9 ms, sys: 2.39 ms, total: 16.3 ms

This should be comparable.

Seb · September 24, 2019, 4:28pm

I don’t believe this is comparable, you would need to do it = iter(train_batches) and then run next(it) just like you did with fastai’s databunch.

davidpfahler · September 24, 2019, 4:29pm

Sorry, you probably meant running only next timing on cifar10-fast. Here it is:

epochs=24
lr_schedule = PiecewiseLinear([0, 5, epochs], [0, 0.4, 0])
batch_size = 128
transforms = [Crop(32, 32), FlipLR(), Cutout(8, 8)]
N_runs = 5
train_batches = Batches(Transform(train_set, transforms), batch_size, shuffle=True, set_random_choices=True, drop_last=True)
test_batches = Batches(test_set, batch_size, shuffle=False, drop_last=False)
it = iter(train_batches)
%time next(it)

outputs

CPU times: user 7.95 ms, sys: 961 µs, total: 8.91 ms
Wall time: 8.71 ms

and running %time next(it) again:

CPU times: user 3 ms, sys: 4.73 ms, total: 7.73 ms
Wall time: 9.21 ms

davidpfahler · September 24, 2019, 4:31pm

So it seems like cifar10-fast is actually much slower on the data iterator speeds. All the more interesting why it is faster than fastai v1 overall.

Seb · September 24, 2019, 4:37pm

I’d say this is a bit inconclusive. Fastai’s first “next” was 88ms, which is much slower, but the second one was much faster.

I think we should compare a full loop of batches.

I wrote this ugly code

%%time

noerror = True
while noerror:
try:
next(it)
except:
noerror = False

cifar10: 4.44s
fastai (workers =0): 1min48s
fastai (workers = default): 11s

I wonder if your fastai epochs are realllly slow when using workers=0.

davidpfahler · September 24, 2019, 4:50pm

But I am not setting workers in any of my code.

davidpfahler · September 24, 2019, 4:51pm

Just fyi: I am going to take a few steps back and learn some general performance profiling skills and then come back to this.

Seb · September 24, 2019, 4:53pm

I believe fast.ai default to defaults.cpus if you don’t specify num_workers. 16 on my machine.

davidpfahler · September 27, 2019, 7:04am

@gkk I just saw your post about speeding up dataloaders and was wondering if you could have a look at the above to get your impression whether the performance slow down in fastai compared to the mytrle.ai PyTorch model could be related to dataloader speed. Thanks Greg!

gkk · September 27, 2019, 5:34pm

Have you looked at GPU utilization? Is it low? See my comment here:

If you see GPU utilization being low and CPU utilization being high, it’s easier to believe the training is CPU-bound.

TomB · October 1, 2019, 10:48am

Just came across an interesting post (and possibly thread) on the PyTorch forums while looking at another issue. Looking at small file performance so may not all apply but of note:

Myrtle used pin_memory=true, so not an explanation for comparison, but in terms of general performance things (and given above linking to categorical, I think this would likely especially help there, pin_memory=True .I think means it has to copy from the original memory to the pinned memory area, so may especially affect lots of small items).