Yeah, not sure there’s much difference between the two (was kinda clutching to get any pros/cons when giving the two). Just doing it in the callback is probably a bit less boilerplate code, but perhaps a tiny bit harder to follow.
And yeah, think the key thing is getting something that addresses all the issues and then seeing how it looks to see if there’s a cleaner way. And to provide a clear base for looking at v2 support (without yet trying to hit the still somewhat moving target there). There’s likely not too much point focusing on v1 as I think there’s still enough kinks in the torch_xla stuff that any real use is more a v2 timespan thing.
Interestingly I got this error when running learn.fit(2). This error seems to come at the end of validation of epoch 1, so I am surprised why this error didn’t show up before when I was running learn.fit(1)
File "/content/tpu_distributed_fastai.py", line 117, in train_loop
learn.fit(2)
File "/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py", line 200, in fit
fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
File "/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py", line 106, in fit
cb_handler=cb_handler, pbar=pbar)
File "/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py", line 66, in validate
if average: return (to_np(torch.stack(val_losses)) * nums).sum() / nums.sum()
RuntimeError: stack expects a non-empty TensorList
Hrmm, something to do with the loss calulation in loss_batch I’d guess.
Maybe detach (I checked and as you’d hope isinstance(an_xla_tensor, Tensor)==True). Tried to add some printing there in a colab notebook but killed the kernel (might have been the wrong notebook, or left it in a bad state).
Maybe check the type of the returned losses. Either in on_backward_end or perhaps by wrapping learn.loss_func.
There’s also a bit of stuff around FlattenedLoss that might be breaking with XLA tensors.
We’re using a custom class GetAttr in v2 that handles this and get us the proper error messages without failing silently.
Hello @TomB
Unfortunately, I was busy yesterday and most of today with other work. However, with whatever time I had I was able to fix the error, mainly by putting print statements everywhere 
Indeed I did not properly deal with creating the new ParallelLoader and I was having problems with no more iterations left for ParallelLoader.
The code now runs fine with 2 epochs. I will post the code soon (after I remove all the print statements and unnecessary debugging code).
Great work, sounds like you’ve got all, or at least most, issues sorted, ignoring any restructuring. Any performance gains now it’s all actually working, or still significantly slower?
Here is the code:
Regarding performance gains, I will more carefully look into this during this weekend. I think the actual epoch times may be faster, most of the time going to setting up the TPU/putting things onto the TPU.
Also, I will look into creating a wrapper for PerDeviceLoader instead of having to re-initialize it every epoch.
Finally, the PyTorch XLA team will look into making MP work in Colab (currently doesn’t work)
Thanks again @TomB for all your help so far.
EDIT: There we go, looks like colab mutliprocessing might get sorted in torch_xla, if some changes get adopted in PyTorch anyway. Or see my notebook for the suggested quick hack until then. Note though that hack will probably break torch CUDA tensors in data loaders (but of course you can’t use both at once on colab).
Not sure this actually matters to the fastai callback though. The error happens if you try to define the function you spawn in the notebook. The callback should probably spawn a function in it’s code to train not have the user call xmp.spawn directly. Not quite sure how that works when some of the code is in your notebook, but given that dataloaders use multiprocessing and you can have code they run like a custom ItemList in a notebook, I think that should work with the callback.
Yeah, was going to note that the first epoch will always be really slow in TPU. It’s doing a lot of work there to compile stuff to XLA and set it up on the TPU, so even in tensorflow I think the first epoch is fairly slow. So given you’ve only ever really run the first epoch before now that could be a big part of it.
Nah, thanks for all your work. Glad to have been of some assistance.
Thanks for sharing this fix! Currently it works fine, but there is an error when rerunning the code cell, which I will post about on the issue.
Let’s hope the PyTorch XLA team will implement a fix in their library soon.
Thanks to the fix, the progress bars are coming up properly, with two progress bars (one for epochs, one for batches) coming up for each progress. Unfortunately, I don’t see any improvement in time which is really odd. However, I have been only testing with MNIST_SAMPLE and a ResNet50. So I will test with more complex datasets and larger models.
Maybe this thread is interesting for you:
@TomB ok I am starting to see a slight speed up, by changing the model from ResNet50 to ResNet152. With regular K80 GPU on Colab, it takes 44 sec, but with TPU, each of the processes take 30 sec (excluding the first epoch). Of course, this speed-up is not impressive, but I haven’t changed the batch size (which can probably be large on the TPU) or any aspect of the underlying data loading. And it is already better than the single TPU experiments I did where I saw no improvement under any case. Of course, which larger models and more complex datasets, we will probably see even better speed-up.
Don’t know that TPU cores are individually that fast you just have access to a lot of them and with better scaling than using many GPUs. Looking at GCP prices it’s 0.45c for a K80 and 56.25c for a TPU core, $4.5 for a single device (both non-preemptible rates). To give some idea. Though I think cost-wise it’s fairly borderline between GPU/TPU.
Also going to depend a bit on your NN model. TPUs don’t handle any dynamically shaped tensors well. Was that a fastai model? They may not be well optimised for TPU.
I just used the same code from before.
Also, I tried with a ResNet152 and CIFAR10 and it was about 2x faster than K80, even with tuning the batch size. I got up to a batch size of 512 and could not go higher as then there would not be enough data points for a batch when the dataset is distributed. But it probably could allow higher batch sizes.
With Cifar10, given the small size, it might be interesting to see how much difference it makes to run it in memory. The PyTorch Cifar10 dataset is in-memory IIRC. This was tested in the performance testing thread and it didn’t make too much difference (caching makes up for it). But that was with the much lower batch sizes of a single GPU (and may have been on GCP not colab). Having it on-disk still means having to call OS functions before it hits the cache.
Think there’s an in-memory ItemList around. Or from the docs it looks like adding the .c and .loss_func to the PyTorch dataset might work for training.
Yes, however I don’t think this is a good representation of actual datasets.
I am trying Food101 right now (has 101,000 images!). I might also try in GCP, which has a better CPU, in case it is CPU limited.
I also have to make a wrapper for PerDeviceLoader.
Yeah, testing against in-memory datasets would be to eliminate some of the slowdown that the inadequate CPU power in colab likely leads to. As noted in that Kaggle thread a recommended pipeline is to feed straight from a google cloud storage bucket (managed key/value store) which would eliminate all data handling overhead (in terms of available CPU power).
Also allows for a better comparison with Cifar10 torch_xla samples. In that vein trying a PyTorch model would also identify differences there. Plus pretty easy to use a PyTorch TPU optimised model instead of a fastai model, so would be a reasonable recommendation to make for callback users if there is a difference.
Update:
I was able to set up TPU with GCP and will probably run some benchmarking experiments tomorrow.
Sorry for the delay @TomB @sgugger as I was busy with work and trying to get the correct settings for GCP. Unfortunately, the results weren’t as promising as I hoped.
Here are the results for now.
Reported are the times of the third epoch (allowing the times to stabilize as the first epoch is usually
slower).
Food101 w/ ResNet152
| Accelerator | CPU | size | batch size | num_workers | Time (min) | Notes |
|---|---|---|---|---|---|---|
| TPU v3-8 | n1-standard-16 | 224 | 32 | 2 | 6:56 | |
| TPU v3-8 | n1-standard-16 | 224 | 32 | 4 | 6:46 | |
| TPU v3-8 | n1-standard-16 | 224 | 32 | 8 | 7:12 | |
| TPU v3-8 | n1-standard-16 | 224 | 64 | 4 | 5:13 | |
| TPU v3-8 | n1-standard-16 | 224 | 128 | 4 | 5:02 | |
| TPU v3-8 | n1-standard-16 | 224 | 128 | 4 | 4:22 | Using bfloat16
|
| 4xTesla T4 | n1-standard-16 | 224 | 32 | 4 | 5:05 | |
| 4xTesla T4 | n1-standard-16 | 224 | 32 | 4 | 3:11 | Using fp16
|
The cost of the 4xTeslaT4 VM instance is $1.323/hr while the cost of the TPU setup $2.563/hr so the TPU was not at all cost-efficient as I hoped.
If you guys have any ideas to improve the performance of fastai w/ TPU, please let me know. I am probably going to work on a Kaggle competition for now, but I will still work on this intermittently.