How to create a callback using torch.multiprocessing (TPU)

I did note that the recorder.silent option Sylvain doesn’t seem to actually stop the progress bars, just the message printing. The progrsss bars are done in fit and validate (the base ones, not the Learner ones). So can’t control them without re-writing the whole training loop which is obviously not a good option. It may not be an issue to have the progress in the subprocesses, given you don’t see that with %run. Though it will mean they are spewing output which could cause issues if you look to display that to users in case there’s error information there. Though I think the main error reporting was coming through catching exceptions in subprocesses and sending them to the main process.
So might be fine to just stick some sort of recorder in the subprocesses that sends progress back to the main process for reporting (using standard multiprocessing communications) and ignore the standard progress reporting. Or I guess you could monkey-patch master_bar in the subprocesses, or maybe add stuff to fastprogress to attach your own custom progress bar type. I can imagine this might be useful in other cases (e.g. I’d been interested in remote progress tracking which could work that way, which would get around the current issue where losing your web connection to the notebook kills progress even though training is still working).

Indeed I noticed this as well. However, I wanted @sgugger to confirm this as there could be something we are missing. I will look into the options you listed.
I think the distributed callback saves the results from each GPU into a numpy array file after each epoch. I will probably do something like that.

Looks like the multi-process samples are printing from within each process for within-epoch stats. Then at the end of each epoch they print and log to a tensorboard summary. Doesn’t look like it ever sends the results back to the main process.
In the threading samples it prints within each epoch and the results are returned to the main thread after each epoch.
The numpy writing I’ve seen was just in the colab examples which are threaded so you get worker result collation from the xla_torch API unlike in MP. But could be looking at different samples so if you see NP logging in MP then send a link.

I was talking about the fastai distributed callback and I think it seems you are talking about the TPU code.

Ah, OK, yeah TPU code. Yeah, you could use the fastai distributed method and just save a file from each process. Or shouldn’t be too hard to do multiprocessing comms. There’s multiprocessing.queues.SimpleQueue that does multi-producer. With torch.multiprocessing you can even send Tensors over it and it used shared memory so no copies needed (would have to be CPU tensors, don’t imagine it will like XLA ones, CUDA tensors are already fiddly).

On the %run/!python thing, looks like %run is messing up spawning because it’s doing some weird module loading. In jupyterlab I get an error trying to %run a multiprocessing script. Doing !python works and actually displays console progress bars, but they don’t work correctly. It creates one bar per process, but then only the last bar counts up. I guess all sub-processes end up overwritting that last bar because there just printing (and sending shell escapes to move back to the beginning of the last line to update).
The error is similar to what you get trying to spawn from functions within a notebook. It needs to find the file with the function you want to spawn but if the module isn’t setup right it can’t do that. That might be easier to fix for %prun as at least the code actually exists as a python file whereas in a notebook it doesn’t (I was actually playing around a bit with how to do this and think it might be possible with a few tricks).
One thing you might try in colab is using subprocess to launch the script. Then you can hook the console output from the script to process it and display to the user in the main worker. Then you could have a run_tpu_train_script function for colab (outside of colab you’d just run the script).

Here’s the testing script I used.

1 Like

Thanks for this gist! I see you have also run into the __spec__ issue!

I will look into this option of using subprocess to launch the script. Right now, though, I will just use !python and worry about this issue when I get to displaying TPU distributed training progress.

1 Like

Yeah, it’s only for progress, should run the same.

Oh, and are you sure you’ve actually run the model with the fastai setup you gave in linked post? I ask because everytime I tried to run a fastai cnn_learner (or even a simple_cnn) it crashed, killing my colab runtime (was using threading), without any related error logs. I thought it was related to the AdaptiveAveragePool as when I changed a resnet model from TPU colab samples from AveragePool to AdaptiveAveragePool it also crashed. Though not sure I actually tested with only that change. Might have also changed the dataset and transforms (I was trying to run a model on something other than Cifar).

The same model and fitting works with GPU. Is that what you were asking?

No, was wondering whether you actually got a fastai model running on TPU? That’s what I thought was causing me issues, could only get there models to run on TPU. Ran the model manually on CPU to check I wasn’t messing up sizes (but didn’t try GPU).

Oh yeah I also have a single TPU callback and it was running, but now I am not completely sure if it’s actually right.

Here’s the callback:

from fastai import *
from fastai.core import *
from fastai.torch_core import *
from fastai.vision import *
from fastai.basic_train import *

def to_device(b:Collection[Tensor],device:torch.device)->Collection[Tensor]:
    "Recursively map lists of tensors in `b ` to FP16."
    return recurse(lambda x: x.to(device), b)

def batch_to_device(b:Collection[Tensor],device:torch.device)->Collection[Tensor]:
    "Move the input of batch `b` to TPU."
    return [to_device(b[0],device), to_device(b[1],device)]

class SingleTPUTraining(LearnerCallback):
  def __init__(self, learn:Learner):
    super().__init__(learn)

  def on_train_begin(self, **kwargs:Any)->None:
    self.device = xm.xla_device()
    self.learn.model = self.learn.model.to(self.device)
    self.learn.data.add_tfm(partial(batch_to_device,device=self.device))
  
  def on_backward_end(self, **kwargs:Any)->None:
    xm.optimizer_step(self.learn.opt.opt, barrier=True)

def _to_tpu(learn:Learner) -> Learner:
    learn.callback_fns.append(SingleTPUTraining)
    return learn

Learner.to_tpu = _to_tpu
1 Like

Cool will try that when I have some more time. One thing to be careful of is that the torch_xla stuff will (possibly fairly silently) use CPU mode if it can’t find a TPU device (based on environment variables I think). So that’s one to watch out for. Should be pretty obvious if you try to run 8 parallel CPU models, but maybe you could not notice with one given all the logging that comes out. Though I out-of-memoried system RAM just trying to run one model, but with various test tensors around (interestingly colab then let me upgrade to to 32Gb of RAM, that was a nice find, not sure if you’ve found that).

I am unsure if it is running on CPU. It does take longer or sometimes about the same time on the TPU than on the GPU. This was true even while increasing model and dataset complexity (larger ResNet models, MNIST, CIFAR10, CIFAR100).

If you do find an error, please let me know!

For GPU or CPU?

I do not see this option for both cases though. Probably because I use Colab so much :joy:
I am 90% sure there is a deep learning algorithm to make certain resources available! I haven’t gotten access to a Tesla T4 for like 2 weeks now :frowning:

Also, for monkey patching __len__ it should be just fine to return the total length evenly distributed over all the TPUs, right?

Also, for monkey patching dl.dataset it makes to just return the entire dataset I think.

Though these should be added to the XLA library and I will raise an issue tomorrow about these methods/attributes for ParallelLoader.

That was a TPU instance. And yeah, wouldn’t be surprised if they limit, I presume like Kaggle they get a lot of use from a subset of users.

OK, not sure why I got those crashes before, just ran a cnn_learner fine, with otherwise everything from the cifar10 colab sample (so PyTorch in-memory dataset and 8-core threading). Though it didn’t work properly. First epoch OK:[xla:7](0) Loss=2.47026 Rate=32.41 GlobalRate=32.41
But then: [xla:2](40) Loss=nan Rate=295.93 GlobalRate=312.05
And once it got the nan it killed the model. Looks like it got a nan loss on first epoch and then the step kills the whole model (familiar with this from making the CUDA version of Mish).
That was using:

class ResNet18(nn.Module):
    def __init__(self, nc:int=10):
        super().__init__()
        self.resnet = create_cnn_model(base_arch=models.resnet18, pretrained=False, nc=nc)

    def forward(self, x):
        x = self.resnet(x)
        return F.log_softmax(x, dim=1)

as they had a softmax at the end of the model. So not sure what happened there. I think the softmax should mean the output ranges are OK, and shapes should both be (batch_size, n_classes). Only thing I can think of is fastai might be using some layers that aren’t stable on TPU. No idea how you debug a TPU model, whether you can even hook. Though if it’s just because of the weird mixing then not a big issue.

In terms of performance Completed 20 epochs in 455.39s, 22.77s/epoch.
Notebook is here.

EDIT: Hmm, changing from from fastai.vision import * to just importing cnn_learner and models seemed to make it better, maybe luck but I’d tried a couple of times and it went to nan on second epoch (so one step). Got through a few epochs this time, with loss rapidly increasing until it overflowed to nan.

ParallelLoader should probably be len(self._loader), so not divided by devices while PerDeviceLoader would be len(self._loader)/len(self._loader._devices) (though probably self._loader.per_device_len or some such). The PerDeviceLoader should be what the Learner uses.
But yeah, probably a good chance of getting them included upstream, unless there was some reason for this, which I can’t see.

Yeah, think this should be OK generally. Tensorflow 2.0 has an Eager mode mirroring the PyTorch way (though not sure if this is used for TPU or just GPU). This isn’t actually as different from PyTorch as it appears because in PyTorch while it’s eager, the results still aren’t immediate for GPU, they get started immediately but are basically just added to a queue (in the GPU drivers/runtime). So this is more an issue the torch_xla stuff has to deal with.

I’d think the main thing is potential performance issues because it takes longer to get results for TPU than GPU. So the pause of calling .cpu() on a GPU tensor may be less than on a TPU tensor and the access patterns in fastai may be particularly non-optimal for TPU. In fact I have a bit of a hunch the pretty big performance gap between fastai and PyTorch training is due to fastai accessing results too quickly. In particular after calculating batch losses fastai immediately accesses the item for smoothing and callbacks making everything block until the forward is complete (and probably similar with gradients). I plan to have a look at this though hard to know what can be done (and almost certainly at best a v2 thing, if not v3 or some).

1 Like

Turns out my current monkey-patched code will now work fine because it takes the length from the sampler, which has now been assigned to the distributed sampler.

Also, I raised an issue

@TomB
They did not want to fix this issue unfortunately. I will keep the monkey-patched solution for now.

Interestingly, I have noticed that if I print out which XLA device that is being used, I always get xla:0 so there’s probably still a bug even though I added DistributedSampler