How to create a callback using torch.multiprocessing (TPU)

TomB · October 27, 2019, 4:28am

Did you try a larger instance? Thinking especially CPUs so could try high-cpu, though would want enough memory to keep the dataset in cache (it’s not too big so shouldn’t need masses). The google examples use a highmem-96 instance (96 CPUs, 624Gb RAM) for imagenet training, presumably the highmem for caching the dataset.
I gather in your table num_workers is the number of data loader workers and that’s per TPU process. In that case, with num_workers=4, I think you’ll be starting 40 processes across only 16 CPUs likely leading to a lot of overhead.
The minimal increase in performance between 2 and 4 workers with bs32 may be because you don’t have enough CPUs not because that is a sufficient number of workers. With the inefficiency of extra processes resulting in an overall slowdown for 8 workers.

For the non-TPU tests is that distributed or parallel (in terms of whether that’s the total batch size as in parallel or the per-device batch size as in distributed)? Either way you’re using significantly smaller batches for the T4 I think, 128 if that’s distributed versus a minimum of 256 for the bs32 TPU config.

Alternatively (or probably ideally additionally), you may want to remove/reduce the transforms to minimise the CPU workload here.

Also, I noticed that according to the guide to maximise TPU performance you should use batch and feature sizes that are a multiple of 128, otherwise some of the parallelism on the TPU may not be fully used it looks.

Beyond that profiling may be needed to really know what’s happening. It doesn’t look especially straightforward to configure though. Also not sure if there’s any PyTorch specific issues, looks like it runs on the actual TPU so might be fine.

ilovescience · October 27, 2019, 5:48am

I selected n1-standard-16 because that was what was recommended on the GitHub guide:

Make sure the compute VM is within the same zone as the TPU node you created or else performance will suffer, also ideally create a VM that has at least 16 cores ( n1-standard-16 ) to not be VM compute/network bound.

I will try a larger instance then.

I thought it is for the CPU. I also used what this said.

I used distributed.

I am not sure if this is true. My understanding is it works similar to the TPU MP code, as it also spawns a different process for each GPU.

I also thought of that, but I originally thought then it wouldn’t be a good comparison. The advantage of fastai is that it comes with all these augmentations are other neat tricks, so I thought it was best to keep it in my benchmarking code. However, I will try it again without the augmentations.

I did try the batch size of 128, but I will try image size of 256, or maybe even 128.

Yeah I hope I don’t need to do any profiling

TomB · October 27, 2019, 6:08am

Looks like that recommendation is for the colab multi-threading examples, so that’s 8 workers total whereas if you’re using that for the multi-process then you’ll be using .

Yeah, I gather GPU distributed will work like TPU multi-process, with 1 process per device and whatever dataloader batch_size you set per device. Whereas GPU parallel will work like TPU multi-threaded, with the set batch_size divided by the number of devices. So yeah, it’s the same, but you have twice as many TPU devices as GPUs so twice as many workers and double the overall batch size but still the same number of CPUs.

Yeah, you may want augmentations in the comparison as they are a primary fastai feature, was just suggesting removing them to try and reduce CPU bottlenecking without using more expensive instances.

Yeah, I wasn’t meaning you hadn’t tried that, just noting that seems to be the optimal setting.

Image size shouldn’t need to be a multiple, just batch and number of features. I think the convolutions will effectively be done per pixel, so don’t think that needs to be optimised for the same reason you can freely scale the image size without needing to change your convolutions.

ilovescience · November 5, 2019, 7:26pm

@TomB just letting you know, I was able to obtain free cloud TPUs for 90 days so I will again ramp up efforts to get this working in fastai v1, and hopefully also fastai v2.

ilovescience · November 8, 2019, 2:52am

Hello Tom,

I did get to try without transforms and from 5:02 the time is now 4:53. So indeed it is probably CPU- or I/O-limited. Increasing batch size to 256 sometimes works. But sometimes bs=256 and especially bs=512 lead to MemoryError, which is honestly surprising because I thought TPUs could handle higher batch sizes.

I still was confused about the following points you mentioned:

Is there something missing here? Also, I’m fairly confident num_workers are for the CPU.

Sorry I don’t understand why this is true.

What do features mean in this context? The dimensions of the layers?

I am not really sure what else I could do apart from maybe testing with a better CPU and with a larger model/dataset. After all, it seems that it is currently limited by CPU and/or I/O. But if things don’t work out after that, I think the best option would be to move to fastai v2, and come up with a training loop that puts everything on the TPU at early as possible in the training loop.

What do you think?

TomB · November 8, 2019, 8:10am

Ooops, yes,didn’t complete that.

The num_workers setting will select a number of workers per dataloader. With multi-threading you have a single DataLoader shared by all threads so DataLoader.num_workers is the total number of workers. But using the distributed, multi-process method you have a separate DataLoader per device. So if you use the same setting for num_workers you actually have 8 times as many worker processes in the distributed case leading to a lot more contention among processes for the physical CPU cores which can lead to significant inefficiency. When I said TPU process I meant the CPU process started for a particular device as opposed to the DataLoader worker processes. I’ll use the less confusing distributed process.
So given 8 TPU devices and num_workers=4, you have 8 distributed processes, and each one starts 4 dataloader processes so a total of 40 processes. Whereas in the multi-threaded case this would just be 5 processes with num_workers=4.

You’re comparing 4 GPUs to 8 TPU cores. So with say num_workers=4 and bs=32 for the GPU you have 4 distributed workers and 16 dataloader workers with a total batch size across all devices of 128. While in the TPU case you have 8 distributed workers and 32 dataloader workers with a total batch size of 256.

Yeah, the in_channels/out_channels on the convolutions.

It could be CPU/IO. That’s certainly a fairly easy one to test.
I’d also probably try a straight PyTorch model from one of the TPU examples as various operations don’t perform well on the TPU so fastai model specific things could be causing issues.

Do you mean just optimising the loading or fully preloading the entire dataset onto the TPU? The latter would seem pretty limiting.
Not really sure how you’d optimise it. It should already be putting stuff on early as the torch_xla stuff is handling all of that, pulling from the DataLoader and putting it on device in the background. So as long as loading/transforms aren’t limiting that not sure how much you could optimise that.

You’d likely have to profile it to see what’s slowing it down to know what if anything could be optimised.

ilovescience · November 11, 2019, 9:44pm

Hey sorry I didn’t get a chance to look at this as I am traveling right now but I wanted to respond to couple of your points.

This makes sense. My follow-up question would be whether or not if I have 8 GPU devices, I will also get 40 processes. fastai.distributed uses DistributedDataParallel which seems to use multi-processing as well, so it seems that this would be true, right?

Yes I realize it isn’t a fair comparison. But I just wanted to show that even 4 GPUs can out-perform the 8 TPU cores and didn’t want to spend on the 8 GPUs in GCP just yet.

Yes, I am not sure if there is much else to optimize, but if there is, it will be easier to do so in fastai v2. I was mainly thinking about the transforms though.

Thanks again for clarifying these points.

TomB · November 12, 2019, 4:07am

Yeah I think the DistributedDataParallel should be the same with the same number of GPUs.
Also it wasn’t really that the 4 vs. 8 device comparison was unfair but rather that you could be seeing fairly significant slowdown as there aren’t enough actual CPUs available so a lot of time is wasted just switching between the many processes. Using too many processes can actually slowdown overall performance.
This is probably not likely to fully account for the poor performance of the TPU but it’s still probably worth trying extra CPUs or less workers to eliminate this issue. Less workers obviously making it more likely the CPU will be a bottleneck but might work in combination with minimal transforms.

ilovescience · November 12, 2019, 4:16am

Thanks for clearing it up. I will try to use a more powerful CPU with TPU, and compare with 8 GPUs. However, I will have to wait till some stuff regarding GCP credits and TPU usage gets worked out.

ilovescience · December 4, 2019, 6:08pm

A note about comparing TPU performance with GPU performance:

ilovescience · December 10, 2019, 2:46am

Here are some results @TomB:

TPUv3 with CPU n1-highmem32 (no augmentations):

Batch size	Num_workers	Time
128	4	3:54
256	4	3:38
256	8	4:00
256	2	3:46

Running fastai on TPU and CPU n1-highmem-64:

No augmentations:

3-epoch run:

Batch size	Num_workers	Time
256	2	3:30
256	4	3:33
256	8	3:52

4-epoch run:

Batch size	Num_workers	Time
256	2	3:23
256	4	3:30

With augmentations:

Batch size	Num_workers	Time
256	4	3:28

With augmentations and bfloat16 (4 epoch):

Batch size	Num_workers	Time
256	4	2:25

I accidentally used most of my GCP credits so I probably won’t work on this anymore unfortunately.

ilovescience · March 8, 2020, 12:11am

@sgugger Regarding TPU implementation in fastai2, I assume it is not allowed to create a new fit function or modify the fit function in any way, correct?

The only callback alternative I can think of is this:
Keep the fit function, nest it in a callback. Using the callback, cancel fit function as it stands (return CancelFitException ) and instead spawn 8 processes of fit within the after_cancel_fit function. Therefore, it is a nested fit function . Sure, it’s not clean (as opposed to PyTorch Lightning’s implementation). But it seems to be the only way to add TPU functionality through callbacks alone.

The only other alternative I can think of is a CLI-only interface like with distributed GPU training. However, I would prefer to have an implementation that would work fully in notebooks. While spawning processes for multiple GPUs did not work in notebooks, it works for TPUs in notebooks (as demonstrated in previous work). Therefore, I would prefer to keep it in a notebook environment (like PyTorch Lightning).

Any suggestions?

sgugger · March 8, 2020, 1:13pm

We haven’t looked at distributed training on TPUs for v2 yet as we had to finish the book. If there is a need to make small changes to the training loop to make it work, we will add those small changes.

bdubreu · May 14, 2020, 3:37pm

Hi @ilovescience ! I’d like to try using a TPU with fastai1. Did you put all your findings in one place, a repo maybe, where I could take a look at your approach ? I found some pieces of code here and there in the thread, but I am not sure what si still relevant 6 months later…