Fastai v2 TPU support

ilovescience · July 24, 2020, 7:07pm

fastai v2 TPU support development thread

This is a thread documenting my efforts adding TPU support to fastai v2. This GitHub repository will be updated with the necessary code.

History

Sometime in October, I had discovered the existence of PyTorch XLA (even before the public announcement at PyTorch DevCon 2019). Since then, I had been working on trying to add fastai v1 TPU support. See here for original discussion. Originally, I had decided to work on fastai v1 first and then move to fastai v2. I documented my efforts working on fastai v1 over here. While I successfully developed code for single-core and multi-core TPU training with fastai v1, it was much slower than expected and not more efficient than a multi-GPU setup. I obtained a lot of help from @TomB, @sgugger, and people from the PyTorch XLA team.

After a while, I got busy with classes and research. At this point I had decided to switch to fastai v2, since it was becoming much more popular and since everybody was likely going to migrate over anyway. Thankfully, much of the code was transferrable. However, I ran into some issues due to some changes in the PyTorch XLA API and changes between fastai v1 and fastai v2. If I remember correctly, the next thing I had to do is create a new type of DataLoader (similar to DistributedDL) that is compatible with PyTorch XLA. The last time I was able to work on this was in April, since I was busy with classes, research, and more.

I had some discussions with @TomB, which unfortunately we kept private since we weren’t sure about the interest of the community in such discussion and since Jeremy and Sylvain were busy with other work. But now, the community has showed much more interest (ex: some discussion here and recent discussion in Discord channel), I figured I will keep the discussion open again and document my efforts, as well as get help from the community and maybe discuss the best route (ex: a complicated callback vs. a different training loop) to include TPU support in fastai v2.

I look forward to working with the fast.ai community in adding TPU support to fastai v2, in order to make it one of the very few deep learning libraries with such capabilities!

NOTE: I will add later today or tomorrow details about the kinds of tasks that are needed and what are the next steps.

tyoc213 · July 24, 2020, 9:49pm

Hi there, Im collaborating with @butchland in fastai_xla_extensions which originated from the invitation Global pytorch hackatoh and we get to know each other from the SF Study group. After a lot of trial and error about how to do the optimizer step we have found that doing an optimizer that just do the required step makes it work on TPU. if you like the next week we can (g)meet, even now we can use some like the discord channel if needed and allowed if you did like to change “notes” on the different approaches.

github.com

butchland/fastai_xla_extensions/blob/master/fastai_xla_extensions/core.py#L32


    def fake_device():
        gpu_available = torch.cuda.is_available()
        return torch.device(torch.cuda.current_device()) if gpu_available else torch.device('cpu')
    xm = SimpleNamespace(
        optimizer_step = fake_opt_step,
        xla_device = fake_device
    )




# Cell
class XLAOptimProxy:
    "Proxy optimizer to override `opt.step` with Pytorch XLA sync method `xm.optimizer_step` "
    def __init__(self,opt):
        self.opt = opt


    def xla_step(self):
        xm.optimizer_step(self.opt,barrier=True) # sync on gradient update


    def __getattr__(self,name):
        if name == 'step': # override proxying for step method
                return getattr(self,'xla_step')

The other things, are mostly just things that need to be done. But as jeremy once said some like “it should be easy”.

Currently it works on single TPU, but we have found some “problems” or slow parts that is run on TPU, so if anybody out there reading knows about TPUs and can link some optimization documents, or how to track specific performance issues on TPU, it would be great. So later we can start with distributed trainning.

We havent yet asked people making fastai2 for help, but hopefully now that we have more attention and jeremy is back we can start to ask a lot of things.

ilovescience · July 26, 2020, 4:13am

Thank you for sharing. @butchland also mentioned the project in the discord channel. I reviewed your code and it seems like it is only for a single core while I am currently working on multiple cores. Additionally, I would like to point out that the desired approach is to keep everything in a callback unless it’s truly necessary to monkey-patch something or change the existing classes/functionality. That is how multi-GPU training and mixed precision training is implemented, and hopefully most of TPU training will be that way as well. See this, my fastai v1 single-core implementation, for inspiration…

Let me know if you have any questions or ideas!

ilovescience · July 26, 2020, 4:33am

So I started looking into stuff yesterday and today and found out a couple things.

First, there was a change in PyTorch 1.6 that would break fastai2, as discussed here. I think I just need to somehow add a generator attribute to DataLoader and _FakeLoader but I have to investigate this further.

In the meantime, I just decided to use torch-xla v1.5 (as opposed to the nightly version, which seems to require PyTorch 1.6) which is missing some features but the core features are there. I had a minor bug where the DataLoader is not put on the TPU device. So that was pretty easy to fix. But I cannot find where in the code fastai2 puts the DataLoader on the GPU if present.

Next, I discovered a problem with pickling. See the below error:

2020-07-26 01:20:48.054366: E    2888 tensorflow/compiler/xla/xla_client/tf_logging.cc:11] XLA tensors do not have 
storage                                                                                                            
Exception in thread Thread-2:                                                                                      
Traceback (most recent call last):                                                                                 
  File "/anaconda3/envs/torch-xla-1.5/lib/python3.6/threading.py", line 916, in _bootstrap_inner                   
    self.run()                                                                                                     
  File "/anaconda3/envs/torch-xla-1.5/lib/python3.6/threading.py", line 864, in run                                
    self._target(*self._args, **self._kwargs)                                                                      
  File "/anaconda3/envs/torch-xla-1.5/lib/python3.6/site-packages/torch_xla/distributed/parallel_loader.py", line $
65, in _worker                                                                                                     
    batch = xm.send_cpu_data_to_device(batch, device)                                                              
  File "/anaconda3/envs/torch-xla-1.5/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 518, in send_$
pu_data_to_device                                                                                                  
    return ToXlaTensorArena(convert_fn, select_fn).transform(data)                                                 
  File "/anaconda3/envs/torch-xla-1.5/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 291, in trans$
orm                                                                                                                
    return self._replace_tensors(inputs)                                                                           
  File "/anaconda3/envs/torch-xla-1.5/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 285, in _repl$
ce_tensors                                                                                                         
    convert_fn)                                                                                                    
  File "/anaconda3/envs/torch-xla-1.5/lib/python3.6/site-packages/torch_xla/utils/utils.py", line 167, in for_each$
instance_rewrite                                                                                                   
    return _for_each_instance_rewrite(value, select_fn, fn, rwmap)                                                 
  File "/anaconda3/envs/torch-xla-1.5/lib/python3.6/site-packages/torch_xla/utils/utils.py", line 153, in _for_eac$
_instance_rewrite                                                                                                  
    result.append(_for_each_instance_rewrite(x, select_fn, fn, rwmap))                                             
  File "/anaconda3/envs/torch-xla-1.5/lib/python3.6/site-packages/torch_xla/utils/utils.py", line 153, in _for_eac$
_instance_rewrite                                                                                                  
    result.append(_for_each_instance_rewrite(x, select_fn, fn, rwmap))                                             
  File "/anaconda3/envs/torch-xla-1.5/lib/python3.6/site-packages/torch_xla/utils/utils.py", line 155, in _for_eac$
_instance_rewrite                                                                                                  
    result = copy.copy(value)                                                                                      
  File "/anaconda3/envs/torch-xla-1.5/lib/python3.6/copy.py", line 96, in copy                                     
    rv = reductor(4)                                                                                               
  File "/home/tmabraham/fastai2/fastai2/torch_core.py", line 252, in __reduce_ex__
    args = (type(self), self.storage(), self.storage_offset(), tuple(self.size()), self.stride())
  File "/home/tmabraham/fastai2/fastai2/torch_core.py", line 272, in _f
    res = getattr(super(TensorBase, self), fn)(*args, **kwargs)
RuntimeError: torch_xla/csrc/tensor_impl.cpp:142 : XLA tensors do not have storage

For multiprocessing, the training loop function needs to be pickled, and TensorBase implements the appropriate pickling function __reduce_ex__. However, it passes the self.storage() argument, but XLA tensors do not have storage. Looking through the PyTorch Tensor code (TensorBase code is similar), you can see here that there’s separate pickling functionality for XLA tensors. It looks like this might be needed for proper TPU functionality?

So I will likely have to make such changes to the fastai2 codebase, but I have not contributed to fastai2 before. I have contributed to fastai, following the git guide. Given the nbdev approach to fastai2, what are the major differences in library development? I assume I again clone the repository and make changes in a different branch, but in the notebooks?

I will try to make the necessary changes tomorrow…

butchland · July 26, 2020, 12:20pm

Hi @ilovescience,

Yes we’re currently focused on single TPU core and trying to see where the bottlenecks are before implementing multiple cores.

Thanks for your suggestions to reduce the amount of monkey-patching – I’ve since updated it to now use callbacks as well.

As for some monkey-patching, I had to do some of that (especially on getting a default_device to return a TPU) because I don’t think Sylvain or Jeremy initially considered an environment where if a GPU is not available, that the default would be anything other than a CPU… (which would be the case where a TPU was available)

We’ll probably have to provide a PR in the fastai2 codebase to handle this.

In any case, our goal is to make it so that using a TPU on fastai would require minimal changes to your existing fastai notebooks or code.

Best regards and keep us updated on your work with multiple TPU cores!

Butch

cc: @tyoc213

kcturgutlu · July 26, 2020, 2:32pm

This is a very exciting thread thanks for creating it!

I personally experienced a lot speed improvements ~ x10 - x20 using TPUs in several Kaggle competitions so far (8 cores) . So it’s definitely a must have in our native fastai2 code. I believe it will be more widely used by the community, even suppressing GPU usage if it has an easy to use interface similar to to_distributed().

Would it make sense to systematically tackle this problem and perhaps divide the workload? I would be more than happy to help. I previously attempted to created a similar Learner class in here for fastai-v1 to work with mutlicore TPU. A callback is definitely much better as noted in this thread.

I think it would be more important to get multicore working since most TPU devices offered publicly (Kaggle, Colab) are of that kind and it would allow us to use TPUs for the main reason - speed.

You seem to be far ahead in terms of exploration done so far, so please let me know if there are any areas that I can help with.

tyoc213 · July 26, 2020, 2:54pm

Indeed TPUs are little monsters but we have found some caveats about performance in particular places. We have tried to keep track of what we are testing in nbs, for example we used a callback doing only the required optimizer step and it allowed it to run with nothing more, but we left that behind because we didnt understand much at the moment (still dont ).

I meet with butch on the week, perhaps at about 8 to 10 CT or so, maybe we will share the link for anyone who wants to hang out (and know with what we are stuck), and now that the discord server has respawned we can enter a channel and just talk.

You can fork the 2 repos, you will get it because you already know how to do it on your own. And yes, we can use some help .

ilovescience · July 26, 2020, 7:30pm

@kcturgutlu @tyoc213 @butchland

If we want to do this, I think it would be best to discuss with @jeremy and potentially even the PyTorch XLA team. I would be happy to lead such discussions.

Also tagging @TomB who had been involved with preliminary work and had demonstrated great expertise in the field so would love to have him join our discussions if he’s available.

ilovescience · July 26, 2020, 8:22pm

You’ll probably have to ask Jeremy about that. Personally, I don’t think this is something that’s strictly necessary, but it’s likely a decision that Jeremy will need to make about whether or not to include TPUs as a default. But I guess I didn’t really have a problem with that and instead was talking about the separate optimizer classes you created, which is not necessary.

Yep, my goal the is same!

I see you are using my kernel (developed with the help of the PyTorch XLA team and @abhi1thakur) .
Which version of the kernel is the working one? The latest one is just a quick save, and while there is an older working one, I am not sure if that’s your final fastai version or if there’s more to it?

Exactly my thoughts! I have tried single-core TPU training with very little benefit. Hence, I have been focusing on multi-core TPU training.

Anyway, I will work on it more today and keep you guys updated in this thread!

kcturgutlu · July 26, 2020, 8:35pm

Yeah right thanks a lot, sorry for not crediting you guys here lol Cool @abhi1thakur is also here! My late thanks to you for great kernels on many recent competition for TPU. Let me edit my post to add the working version. I really liked @abhi1thakur’s approaches on how to use either multicore for speeding a single experiment or running multiple parallel single core experiments, e.g different hyperparams or paralle cross validation. Kaggle is a great place for learning TPUs IMO. I agree on moving forward with the help of @jeremy and Pytorch XLA team, at least with their guidance not if full support.

jeremy · July 27, 2020, 8:05pm

TPU support is my first priority after getting fastai2 and course-v4 out the door. I haven’t looked at it at all yet. Goal would be to try to have it working without changing the training loop if possible - i.e. make it a callback.

ilovescience · July 27, 2020, 8:19pm

That sounds great!

I guess the best approach @butchland @tyoc213 @kcturgutlu is that we could work separately, and when fastai2/course-v4 is released, we could meet up and discuss design decisions and systematically approach the remaining tasks? Since @butchland and @tyoc213 are already working on single-core TPU while I am working on multi-core TPU, we could keep working on this for the next couple of weeks till the fastai2 release.

I agree this is optimal, but because of the multiprocessing approach to TPU training, it requires the training loop to be spawned 8 times on each core. There may be an approach to change the fit function under the hood without much change to the code by the user. But the current approach I am working on is as follows:

def train_loop(index):
    train_df = ...
    food = DataBlock(...)
    dls = food.dataloaders()
    learn = cnn_learner(dls, model, metrics).to_tpu_distributed() #adds the TPU callback
    learn.fit(3)
if __name__ = "__main__":
    xmp.spawn(train_loop,nprocs=8,args=())

The full code is here. How we can make this even easier for the user is an example of what we may need to discuss.

ilovescience · August 3, 2020, 3:56am

I realize another thing that needs to be discussed and fixed later down the line is the progress bars, which are repeated 8 times for each process. Anyway I’ll look into that once I progress further.

I fixed some pickling problems. I monkey-patched TensorBase and Optimizer to be pickled correctly and be accessed correctly by PyTorch XLA. Now I have some error 4 batches into training, for which I raised an issue in PyTorch XLA repository (since many of the errors are not very clear ).

ilovescience · August 6, 2020, 7:53pm

I see @butchland and @tyoc213 discovered an issue with batch transforms on the TPU:

Keep us updated on the progress of this issue!

ilovescience · August 8, 2020, 4:37am

TPU wiki (at Jeremy’s suggestion):

butchland · August 8, 2020, 6:33am

I just wanted to document a proposed workaround the problem of the slow batch transforms while waiting for the Pytorch-XLA team to find a solution on the affine grid sample calls (affine_grid_generator and grid_sample2d) which if @tyoc213’s interpretation of the debug is correct - generates an aten which I believes means its transferred into cpu, executed in the cpu and transferred from the cpu back to the tpu…

The idea is that if the dataloader is running on a tpu, the dataloader should execute all the batch transforms on the CPU and move it to the TPU afterwards…

This is much faster than the current process (since my performance profiling shows that running the batch transforms on TPUs is even slower than running the batch transforms on the CPU – most probably because of the aten calls.

I’ve made a github enhancement issue to track this implementation (in case anyone is interested)

tyoc213 · November 29, 2020, 5:19am

Just wanted to let people here know that we are still on that, we now can develop on our own computers which make debugin some things more easy https://tyoc213.github.io/blog/xla/fastai/2020/11/28/compiling-xla-locally.html

tyoc213 · December 13, 2020, 11:35pm

Little update https://tyoc213.github.io/blog/xla/fastai/2020/12/13/finding-nemo-a-bug-journey.html of something missing

butchland · December 15, 2020, 9:33am

I think this is the right link? : https://tyoc213.github.io/blog/xla/fastai/2020/12/13/finding-nemo-a-bug-journey.html