Reproducing "How to train your ResNet" using fastai

By different reasons, what I mean is that it is not spending any measurable time on _local_scalar_dense, but is still slower. I will come back with details as soon as possible.

If you find insights with v1, chances are they will be useful for v2. I’d say keep digging with v1 until it feels appropriate to go to v2.

1 Like

Update: I decided to reproduce Part 1: Baseline with the code provided by myrtle.ai first and compare that to fastaiv1 to see where we stand. Both notebooks were run on the same Google Colab runtime:

Reproduce Original Part1: Baseline
Reproduce Baseline with fastaiv1

As you can see, an epoch takes around 67% longer with fastai v1. An obvious next step is to exactly reproduce the learning rate schedule (I was using OneCycle here) and see if that changes anything. I will keep you updated here.

Please have a look at “TODO” comments in the fastai v1 notebook. They point to things that I need to make sure are identical to the baseline, e.g. that we are using exactly the same loss function. If you can help by showing me how to do those things in fastai, I would very much appreciate it.

Nice work.

From a quick look it seems like the baseline is using an entirely in-memory dataset. It’s using torchvision.datasets.CIFAR10 which appears to be loading everything into memory in __init__.
That’s going to make a pretty big difference. Especially testing on Colab where I don’t think the disk performance is great (likely using network disks and with de-prioritising relative to other paid cloud users). Even on paid cloud servers which you also tested on the disk performance isn’t always great depending on how you set things up (e.g. IO performance tends to be tied to disk size so you need to provision a fairly large disk to get good performance).

FastAI doesn’t have any inbuilt support for in-memory image datasets. Though you should be able to pretty easily subclass an existing ItemList to do so). More in-built in-memory datasets is certainly something that would be nice in FastAI but there are also a lot of cases where it just won’t work as your dataset is too big. But that likely accounts for a fair chunk of the issues.
For testing I think the docs have some information on using a torch DataSet with fastai which might be easy enough. You might lose some things like show_batch but that should be fine as long as training works.

I’ll try to run your tests locally off my fastish NVMe disk to get some feel for the difference that makes (though might take a little while to find the GPU time).

1 Like

Thanks Tom, that is a very good point! I will make that my first priority.

Would this be a possible performance gain for datasets that don’t fit entirely in memory but where we could cache the next batches in memory and use idle CPU time to swap out batches in memory when the GPU is busy and only fall back to disk when necessary?

I’m pretty sure the PyTorch data loaders are pre-fetching the batches, at least when you use multiple workers which would be standard if disk IO/CPU preprocessing are a bottleneck. Not so sure in the case of a single worker. But pretty sure it already uses a queue to pull batches for multiple workers.

Optimising this area would also be a little complicated in v1 as the PyTorch dataloader handles moving data to the GPU. So workers prepare CPU tensors in separate processes, which are transferred to the main process and then collated and put on the GPU by the PyTorch dataloader (though you can provide a custom function for this). You probably don’t want to be playing around with the GPU tensors much as you risk causing CUDA memory issues. So there might be limited routes for optimisation here.
In v2 a lot of this is moved into fastai which has it’s own dataloader (just re-using the multi-processing worker stuff from PyTorch) so likely more routes for optimisation and more potential bits that aren’t yet optimised there.

Thanks, that makes sense, especially when looking at v2.

Interestingly, I didn’t see a big speed up using a memory cached dataset: https://gist.github.com/davidpfahler/8c397cffe57734a03183003129a108dd

Admittedly this time I only ran for 1 epoch, but times were 02:14 (standard dataset) but 02:09 and 02:11 for two different implementations of a memory cached ImageList I found on this forum. Even if we assume this was representative, this would at best be a 3.7% improvement.

Interesting, I did realise after posting that OS caching might mitigate a lot but surprised to see it so effective.
In spite of the slightly limited avenues the dataloader might still be a good place to look for improvements, as I’d expect delays there could have a pretty big impact. Given you can’t really pre-load your GPU tensors it’s pretty important to quickly get them up to start the GPU on the new batch. I would think the asynchronous nature of CUDA would cover a lot of other delays in the transform stuff once the GPU was working.
In v2 where everyhting is in fastai I’d particularly look at where the pin_memory() happens compared to the cuda()/to(). As I understand it pin_memory will move the tensor from paged memory into the non-paged area for transfer so having it in pinned memory before calling cuda would be good (assuming the ‘cuda’ is only done at the start of the batch to avoid OOM issues).

I’m about to look into data loading , but in the process I found a couple little things:

  1. You write data.normalize(cifar_stats), shouldn’t it be data=data.normalize(cifar_stats)

  2. Your batch size is 128, while the original one is 512. Maybe this helps a bit?

Note: when trying to run the code from the cifar10-fast repo, make sure to look at this pull request if you run into an error right after downloading the data.

If I use your original fastai databunch, next(iter(data.train_dl)) takes 4.36s, while in the cifar10-fast code,next(iter(train_batches)) takes 19.7ms.

I would keep digging into the data loading!

I am trying to reproduce Part 1: Baseline (first) which uses batch_size of 128. I’m not sure about the normalize API, but will try.

1 Like

That is very interesting! Can you please tell me how you measured that?

I don’t think the normalization thing will help speed, I just think it’s not an inplace operation but I could be wrong.

If I am running %time next(iter(data.train_dl)) on the fastai v1 databunch, it takes only 19ms:

CPU times: user 19.7 ms, sys: 78.9 ms, total: 98.6 ms
Wall time: 644 ms

How did you get 4.36s?

In the demo notebook from cifar10-fast, I ran everything up to “network visualization”, then I ran (copied from the training cell):

epochs=24
lr_schedule = PiecewiseLinear([0, 5, epochs], [0, 0.4, 0])
batch_size = 512
transforms = [Crop(32, 32), FlipLR(), Cutout(8, 8)]
N_runs = 5

train_batches = Batches(Transform(train_set, transforms), batch_size, shuffle=True, >set_random_choices=True, drop_last=True)
test_batches = Batches(test_set, batch_size, shuffle=False, drop_last=False)

And then just added the following cell:

%%timeit
next(iter(train_batches))

For fastai I just ran the first few cells of your notebook to recreate the databunch, and ran this (used bs=512):

%%timeit
next(iter(data.train_dl))

4.36s was with bs= 512
I get 2.81 s with bs=128

Depends on your machine too I guess. You’d have to compare with the cifar10-fast version on your own machine.

Edit: also I’m using your first dataloader, not the in memory cached/loaded version.

This is baffling me. I am running on the same colab runtime, first the following (for cifar10-fast):

epochs=24
lr_schedule = PiecewiseLinear([0, 5, epochs], [0, 0.4, 0])
batch_size = 128
transforms = [Crop(32, 32), FlipLR(), Cutout(8, 8)]
N_runs = 5
train_batches = Batches(Transform(train_set, transforms), batch_size, shuffle=True, set_random_choices=True, drop_last=True)
test_batches = Batches(test_set, batch_size, shuffle=False, drop_last=False)
%time next(iter(train_batches))

which outputs CPU times: user 13.9 ms, sys: 2.39 ms, total: 16.3 ms

then (for fastai):

data = ImageDataBunch.from_folder(path, valid='test', bs=128, ds_tfms=tfms)
cifar_stats = ([0.491, 0.482, 0.447], [0.247, 0.243, 0.261])
data = data.normalize(cifar_stats)
%time next(iter(data.train_dl))

which outputs CPU times: user 22 ms, sys: 81.3 ms, total: 103 ms.

That is more than 6x slower in total time, but not as drastic as what you @Seb observed.

The iter will create the PyTorch dataloader which will create the worker processes. This is quite slow (and could be a lot slower on cloud machines than native, it’s also reportedly quite slow on windows). Try separaely doing it = iter(data.train_dl) and %time next(it). Also check a subsequent next as well, not sure if eveything is kicked off on creation or some stuff only on access.

On that, you might want to reduce num_workers in databunch creation. It uses the CPU count by default which includes hyperthreading CPUs so if not ideal there and may also not be the best using cloud providers with virtual CPUs where overall CPU usage may be throttled and the overhead of extra workers outweighs the benefits. Also worth trying num_workers=0 to elminate the multi-process overhead (though may slow things down a lot of CPU bound).

1 Like

Thanks! This is great to know. Here are the results:

When running %time next(it) the first time, it outputs CPU times: user 15 ms, sys: 73 ms, total: 88 ms, when running it again, it outputs CPU times: user 1.65 ms, sys: 816 µs, total: 2.46 ms.

This is on a standard ImageDataBunch, not an in-memory dataset.

Do we know that we are comparing apples to apples now or is this giving fastai an unfair advantage?