Reproducing "How to train your ResNet" using fastai

In the demo notebook from cifar10-fast, I ran everything up to “network visualization”, then I ran (copied from the training cell):

epochs=24
lr_schedule = PiecewiseLinear([0, 5, epochs], [0, 0.4, 0])
batch_size = 512
transforms = [Crop(32, 32), FlipLR(), Cutout(8, 8)]
N_runs = 5

train_batches = Batches(Transform(train_set, transforms), batch_size, shuffle=True, >set_random_choices=True, drop_last=True)
test_batches = Batches(test_set, batch_size, shuffle=False, drop_last=False)

And then just added the following cell:

%%timeit
next(iter(train_batches))

For fastai I just ran the first few cells of your notebook to recreate the databunch, and ran this (used bs=512):

%%timeit
next(iter(data.train_dl))

4.36s was with bs= 512
I get 2.81 s with bs=128

Depends on your machine too I guess. You’d have to compare with the cifar10-fast version on your own machine.

Edit: also I’m using your first dataloader, not the in memory cached/loaded version.

This is baffling me. I am running on the same colab runtime, first the following (for cifar10-fast):

epochs=24
lr_schedule = PiecewiseLinear([0, 5, epochs], [0, 0.4, 0])
batch_size = 128
transforms = [Crop(32, 32), FlipLR(), Cutout(8, 8)]
N_runs = 5
train_batches = Batches(Transform(train_set, transforms), batch_size, shuffle=True, set_random_choices=True, drop_last=True)
test_batches = Batches(test_set, batch_size, shuffle=False, drop_last=False)
%time next(iter(train_batches))

which outputs CPU times: user 13.9 ms, sys: 2.39 ms, total: 16.3 ms

then (for fastai):

data = ImageDataBunch.from_folder(path, valid='test', bs=128, ds_tfms=tfms)
cifar_stats = ([0.491, 0.482, 0.447], [0.247, 0.243, 0.261])
data = data.normalize(cifar_stats)
%time next(iter(data.train_dl))

which outputs CPU times: user 22 ms, sys: 81.3 ms, total: 103 ms.

That is more than 6x slower in total time, but not as drastic as what you @Seb observed.

The iter will create the PyTorch dataloader which will create the worker processes. This is quite slow (and could be a lot slower on cloud machines than native, it’s also reportedly quite slow on windows). Try separaely doing it = iter(data.train_dl) and %time next(it). Also check a subsequent next as well, not sure if eveything is kicked off on creation or some stuff only on access.

On that, you might want to reduce num_workers in databunch creation. It uses the CPU count by default which includes hyperthreading CPUs so if not ideal there and may also not be the best using cloud providers with virtual CPUs where overall CPU usage may be throttled and the overhead of extra workers outweighs the benefits. Also worth trying num_workers=0 to elminate the multi-process overhead (though may slow things down a lot of CPU bound).

1 Like

Thanks! This is great to know. Here are the results:

When running %time next(it) the first time, it outputs CPU times: user 15 ms, sys: 73 ms, total: 88 ms, when running it again, it outputs CPU times: user 1.65 ms, sys: 816 µs, total: 2.46 ms.

This is on a standard ImageDataBunch, not an in-memory dataset.

Do we know that we are comparing apples to apples now or is this giving fastai an unfair advantage?

The original default to num_workers=0 so fair.

OK, so this means we are probably not going to find the bottleneck in the data loading. Good to know, though!

Batch size = 128.

Cifar10-fast (didnt look into workers)

First iteration: 5-30ms
Second iteration: same order of magnitude

Fast.ai:

  1. Workers = 0
    First iteration: 150ms, up to 1.6s
    Second iteration: xxx ms

  2. Default workers (16?)
    First iteration: 3s
    Second: 2ms

Ranges are a bit guesstimated.

Did you run this with cifar10-fast as well?

This should be comparable.

I don’t believe this is comparable, you would need to do it = iter(train_batches) and then run next(it) just like you did with fastai’s databunch.

Sorry, you probably meant running only next timing on cifar10-fast. Here it is:

epochs=24
lr_schedule = PiecewiseLinear([0, 5, epochs], [0, 0.4, 0])
batch_size = 128
transforms = [Crop(32, 32), FlipLR(), Cutout(8, 8)]
N_runs = 5
train_batches = Batches(Transform(train_set, transforms), batch_size, shuffle=True, set_random_choices=True, drop_last=True)
test_batches = Batches(test_set, batch_size, shuffle=False, drop_last=False)
it = iter(train_batches)
%time next(it)

outputs

CPU times: user 7.95 ms, sys: 961 µs, total: 8.91 ms
Wall time: 8.71 ms

and running %time next(it) again:

CPU times: user 3 ms, sys: 4.73 ms, total: 7.73 ms
Wall time: 9.21 ms
1 Like

So it seems like cifar10-fast is actually much slower on the data iterator speeds. All the more interesting why it is faster than fastai v1 overall.

I’d say this is a bit inconclusive. Fastai’s first “next” was 88ms, which is much slower, but the second one was much faster.

I think we should compare a full loop of batches.

I wrote this ugly code :grimacing:

%%time

noerror = True
while noerror:
try:
next(it)
except:
noerror = False

cifar10: 4.44s
fastai (workers =0): 1min48s
fastai (workers = default): 11s

I wonder if your fastai epochs are realllly slow when using workers=0.

But I am not setting workers in any of my code.

Just fyi: I am going to take a few steps back and learn some general performance profiling skills and then come back to this.

1 Like

I believe fast.ai default to defaults.cpus if you don’t specify num_workers. 16 on my machine.

@gkk I just saw your post about speeding up dataloaders and was wondering if you could have a look at the above to get your impression whether the performance slow down in fastai compared to the mytrle.ai PyTorch model could be related to dataloader speed. Thanks Greg!

Have you looked at GPU utilization? Is it low? See my comment here:

If you see GPU utilization being low and CPU utilization being high, it’s easier to believe the training is CPU-bound.

Just came across an interesting post (and possibly thread) on the PyTorch forums while looking at another issue. Looking at small file performance so may not all apply but of note:

Myrtle used pin_memory=true, so not an explanation for comparison, but in terms of general performance things (and given above linking to categorical, I think this would likely especially help there, pin_memory=True .I think means it has to copy from the original memory to the pinned memory area, so may especially affect lots of small items).