Reproducing "How to train your ResNet" using fastai

TomB · September 24, 2019, 3:57pm

The original default to num_workers=0 so fair.

davidpfahler · September 24, 2019, 3:58pm

OK, so this means we are probably not going to find the bottleneck in the data loading. Good to know, though!

Seb · September 24, 2019, 4:04pm

Batch size = 128.

Cifar10-fast (didnt look into workers)

First iteration: 5-30ms
Second iteration: same order of magnitude

Fast.ai:

Workers = 0
First iteration: 150ms, up to 1.6s
Second iteration: xxx ms
Default workers (16?)
First iteration: 3s
Second: 2ms

Ranges are a bit guesstimated.

Seb · September 24, 2019, 4:11pm

Did you run this with cifar10-fast as well?

davidpfahler · September 24, 2019, 4:26pm

davidpfahler:

I am running on the same colab runtime, first the following (for cifar10-fast):

epochs=24
lr_schedule = PiecewiseLinear([0, 5, epochs], [0, 0.4, 0])
batch_size = 128
transforms = [Crop(32, 32), FlipLR(), Cutout(8, 8)]
N_runs = 5
train_batches = Batches(Transform(train_set, transforms), batch_size, shuffle=True, set_random_choices=True, drop_last=True)
test_batches = Batches(test_set, batch_size, shuffle=False, drop_last=False)
%time next(iter(train_batches))

which outputs CPU times: user 13.9 ms, sys: 2.39 ms, total: 16.3 ms

This should be comparable.

Seb · September 24, 2019, 4:28pm

I don’t believe this is comparable, you would need to do it = iter(train_batches) and then run next(it) just like you did with fastai’s databunch.

davidpfahler · September 24, 2019, 4:29pm

Sorry, you probably meant running only next timing on cifar10-fast. Here it is:

epochs=24
lr_schedule = PiecewiseLinear([0, 5, epochs], [0, 0.4, 0])
batch_size = 128
transforms = [Crop(32, 32), FlipLR(), Cutout(8, 8)]
N_runs = 5
train_batches = Batches(Transform(train_set, transforms), batch_size, shuffle=True, set_random_choices=True, drop_last=True)
test_batches = Batches(test_set, batch_size, shuffle=False, drop_last=False)
it = iter(train_batches)
%time next(it)

outputs

CPU times: user 7.95 ms, sys: 961 µs, total: 8.91 ms
Wall time: 8.71 ms

and running %time next(it) again:

CPU times: user 3 ms, sys: 4.73 ms, total: 7.73 ms
Wall time: 9.21 ms

davidpfahler · September 24, 2019, 4:31pm

So it seems like cifar10-fast is actually much slower on the data iterator speeds. All the more interesting why it is faster than fastai v1 overall.

Seb · September 24, 2019, 4:37pm

I’d say this is a bit inconclusive. Fastai’s first “next” was 88ms, which is much slower, but the second one was much faster.

I think we should compare a full loop of batches.

I wrote this ugly code

%%time

noerror = True
while noerror:
try:
next(it)
except:
noerror = False

cifar10: 4.44s
fastai (workers =0): 1min48s
fastai (workers = default): 11s

I wonder if your fastai epochs are realllly slow when using workers=0.

davidpfahler · September 24, 2019, 4:50pm

But I am not setting workers in any of my code.

davidpfahler · September 24, 2019, 4:51pm

Just fyi: I am going to take a few steps back and learn some general performance profiling skills and then come back to this.

Seb · September 24, 2019, 4:53pm

I believe fast.ai default to defaults.cpus if you don’t specify num_workers. 16 on my machine.

davidpfahler · September 27, 2019, 7:04am

@gkk I just saw your post about speeding up dataloaders and was wondering if you could have a look at the above to get your impression whether the performance slow down in fastai compared to the mytrle.ai PyTorch model could be related to dataloader speed. Thanks Greg!

gkk · September 27, 2019, 5:34pm

Have you looked at GPU utilization? Is it low? See my comment here:

If you see GPU utilization being low and CPU utilization being high, it’s easier to believe the training is CPU-bound.

TomB · October 1, 2019, 10:48am

Just came across an interesting post (and possibly thread) on the PyTorch forums while looking at another issue. Looking at small file performance so may not all apply but of note:

Myrtle used pin_memory=true, so not an explanation for comparison, but in terms of general performance things (and given above linking to categorical, I think this would likely especially help there, pin_memory=True .I think means it has to copy from the original memory to the pinned memory area, so may especially affect lots of small items).