How to create a callback using torch.multiprocessing (TPU)

TomB · October 17, 2019, 5:04am

Yeah, not sure there’s much difference between the two (was kinda clutching to get any pros/cons when giving the two). Just doing it in the callback is probably a bit less boilerplate code, but perhaps a tiny bit harder to follow.
And yeah, think the key thing is getting something that addresses all the issues and then seeing how it looks to see if there’s a cleaner way. And to provide a clear base for looking at v2 support (without yet trying to hit the still somewhat moving target there). There’s likely not too much point focusing on v1 as I think there’s still enough kinks in the torch_xla stuff that any real use is more a v2 timespan thing.

ilovescience · October 17, 2019, 6:35am

Interestingly I got this error when running learn.fit(2). This error seems to come at the end of validation of epoch 1, so I am surprised why this error didn’t show up before when I was running learn.fit(1)

  File "/content/tpu_distributed_fastai.py", line 117, in train_loop
    learn.fit(2)
  File "/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py", line 200, in fit
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py", line 106, in fit
    cb_handler=cb_handler, pbar=pbar)
  File "/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py", line 66, in validate
    if average: return (to_np(torch.stack(val_losses)) * nums).sum() / nums.sum()
RuntimeError: stack expects a non-empty TensorList

TomB · October 17, 2019, 6:55am

Hrmm, something to do with the loss calulation in loss_batch I’d guess.
Maybe detach (I checked and as you’d hope isinstance(an_xla_tensor, Tensor)==True). Tried to add some printing there in a colab notebook but killed the kernel (might have been the wrong notebook, or left it in a bad state).
Maybe check the type of the returned losses. Either in on_backward_end or perhaps by wrapping learn.loss_func.
There’s also a bit of stuff around FlattenedLoss that might be breaking with XLA tensors.

sgugger · October 17, 2019, 1:36pm

We’re using a custom class GetAttr in v2 that handles this and get us the proper error messages without failing silently.

ilovescience · October 19, 2019, 5:22am

Hello @TomB
Unfortunately, I was busy yesterday and most of today with other work. However, with whatever time I had I was able to fix the error, mainly by putting print statements everywhere

Indeed I did not properly deal with creating the new ParallelLoader and I was having problems with no more iterations left for ParallelLoader.

The code now runs fine with 2 epochs. I will post the code soon (after I remove all the print statements and unnecessary debugging code).

TomB · October 19, 2019, 5:26am

Great work, sounds like you’ve got all, or at least most, issues sorted, ignoring any restructuring. Any performance gains now it’s all actually working, or still significantly slower?

ilovescience · October 19, 2019, 5:28am

Here is the code:

gist.github.com

https://gist.github.com/tmabraham/5c2ae1faaf5b864bfccd9236449aba8d

tpu_distributed_fastai.py

import torch_xla
import torch_xla.distributed.data_parallel as dp
import torch_xla.utils.utils as xu
import torch_xla.core.xla_model as xm
import torch_xla.distributed.parallel_loader as pl
import torch_xla.distributed.xla_multiprocessing as xmp
import torch

import fastai
from fastai import *

This file has been truncated. show original

Regarding performance gains, I will more carefully look into this during this weekend. I think the actual epoch times may be faster, most of the time going to setting up the TPU/putting things onto the TPU.

Also, I will look into creating a wrapper for PerDeviceLoader instead of having to re-initialize it every epoch.

Finally, the PyTorch XLA team will look into making MP work in Colab (currently doesn’t work)

Thanks again @TomB for all your help so far.

TomB · October 19, 2019, 6:02am

EDIT: There we go, looks like colab mutliprocessing might get sorted in torch_xla, if some changes get adopted in PyTorch anyway. Or see my notebook for the suggested quick hack until then. Note though that hack will probably break torch CUDA tensors in data loaders (but of course you can’t use both at once on colab).
Not sure this actually matters to the fastai callback though. The error happens if you try to define the function you spawn in the notebook. The callback should probably spawn a function in it’s code to train not have the user call xmp.spawn directly. Not quite sure how that works when some of the code is in your notebook, but given that dataloaders use multiprocessing and you can have code they run like a custom ItemList in a notebook, I think that should work with the callback.

TomB · October 19, 2019, 6:06am

Yeah, was going to note that the first epoch will always be really slow in TPU. It’s doing a lot of work there to compile stuff to XLA and set it up on the TPU, so even in tensorflow I think the first epoch is fairly slow. So given you’ve only ever really run the first epoch before now that could be a big part of it.

Nah, thanks for all your work. Glad to have been of some assistance.

ilovescience · October 19, 2019, 5:51pm

Thanks for sharing this fix! Currently it works fine, but there is an error when rerunning the code cell, which I will post about on the issue.
Let’s hope the PyTorch XLA team will implement a fix in their library soon.

Thanks to the fix, the progress bars are coming up properly, with two progress bars (one for epochs, one for batches) coming up for each progress. Unfortunately, I don’t see any improvement in time which is really odd. However, I have been only testing with MNIST_SAMPLE and a ResNet50. So I will test with more complex datasets and larger models.

MicPie · October 19, 2019, 6:04pm

Maybe this thread is interesting for you:

ilovescience · October 19, 2019, 6:06pm

@TomB ok I am starting to see a slight speed up, by changing the model from ResNet50 to ResNet152. With regular K80 GPU on Colab, it takes 44 sec, but with TPU, each of the processes take 30 sec (excluding the first epoch). Of course, this speed-up is not impressive, but I haven’t changed the batch size (which can probably be large on the TPU) or any aspect of the underlying data loading. And it is already better than the single TPU experiments I did where I saw no improvement under any case. Of course, which larger models and more complex datasets, we will probably see even better speed-up.

TomB · October 19, 2019, 6:32pm

Don’t know that TPU cores are individually that fast you just have access to a lot of them and with better scaling than using many GPUs. Looking at GCP prices it’s 0.45c for a K80 and 56.25c for a TPU core, $4.5 for a single device (both non-preemptible rates). To give some idea. Though I think cost-wise it’s fairly borderline between GPU/TPU.
Also going to depend a bit on your NN model. TPUs don’t handle any dynamically shaped tensors well. Was that a fastai model? They may not be well optimised for TPU.

ilovescience · October 20, 2019, 2:36am

I just used the same code from before.

Also, I tried with a ResNet152 and CIFAR10 and it was about 2x faster than K80, even with tuning the batch size. I got up to a batch size of 512 and could not go higher as then there would not be enough data points for a batch when the dataset is distributed. But it probably could allow higher batch sizes.

TomB · October 20, 2019, 3:23am

With Cifar10, given the small size, it might be interesting to see how much difference it makes to run it in memory. The PyTorch Cifar10 dataset is in-memory IIRC. This was tested in the performance testing thread and it didn’t make too much difference (caching makes up for it). But that was with the much lower batch sizes of a single GPU (and may have been on GCP not colab). Having it on-disk still means having to call OS functions before it hits the cache.
Think there’s an in-memory ItemList around. Or from the docs it looks like adding the .c and .loss_func to the PyTorch dataset might work for training.

ilovescience · October 20, 2019, 3:45am

Yes, however I don’t think this is a good representation of actual datasets.

I am trying Food101 right now (has 101,000 images!). I might also try in GCP, which has a better CPU, in case it is CPU limited.

I also have to make a wrapper for PerDeviceLoader.

TomB · October 20, 2019, 3:54am

Yeah, testing against in-memory datasets would be to eliminate some of the slowdown that the inadequate CPU power in colab likely leads to. As noted in that Kaggle thread a recommended pipeline is to feed straight from a google cloud storage bucket (managed key/value store) which would eliminate all data handling overhead (in terms of available CPU power).

Also allows for a better comparison with Cifar10 torch_xla samples. In that vein trying a PyTorch model would also identify differences there. Plus pretty easy to use a PyTorch TPU optimised model instead of a fastai model, so would be a reasonable recommendation to make for callback users if there is a difference.

ilovescience · October 22, 2019, 5:18am

Update:
I was able to set up TPU with GCP and will probably run some benchmarking experiments tomorrow.

ilovescience · October 26, 2019, 9:51pm

Sorry for the delay @TomB @sgugger as I was busy with work and trying to get the correct settings for GCP. Unfortunately, the results weren’t as promising as I hoped.

Here are the results for now.
Reported are the times of the third epoch (allowing the times to stabilize as the first epoch is usually
slower).

Food101 w/ ResNet152

Accelerator	CPU	size	batch size	num_workers	Time (min)	Notes
TPU v3-8	n1-standard-16	224	32	2	6:56
TPU v3-8	n1-standard-16	224	32	4	6:46
TPU v3-8	n1-standard-16	224	32	8	7:12
TPU v3-8	n1-standard-16	224	64	4	5:13
TPU v3-8	n1-standard-16	224	128	4	5:02
TPU v3-8	n1-standard-16	224	128	4	4:22	Using `bfloat16`
4xTesla T4	n1-standard-16	224	32	4	5:05
4xTesla T4	n1-standard-16	224	32	4	3:11	Using `fp16`

The cost of the 4xTeslaT4 VM instance is $1.323/hr while the cost of the TPU setup $2.563/hr so the TPU was not at all cost-efficient as I hoped.

ilovescience · October 27, 2019, 1:01am

If you guys have any ideas to improve the performance of fastai w/ TPU, please let me know. I am probably going to work on a Kaggle competition for now, but I will still work on this intermittently.