How to create a callback using torch.multiprocessing (TPU)

Unfortunately, this did not work. I added self.recorder.silent = True under on_train_begin but I still have the script printing <IPython.core.display.HTML object> from the progress bar.

Sorry just to clarify - this is fastai v2’s syntax.

1 Like

Ah ok. Right now I am working on fastai v1, but when I try to implement TPU support for fastai v2, I will keep this in mind. Thanks!

Not sure ifg you already discovered this.
Any reason why you didn’t follow the standard example practice of putting all the initialisation code in functions? This isn’t just a stylistic choice, it alters the way the spawn works. If you have code outside of a function then those in-memory variables are inherited in the spawned (via OS fork) process, as it gets a copy of the parent processes memory. This can cause issues as things like semaphores can’t be copied to the child process (you can also deadlock the child process as if a Lock is held then it is weirdly partially inherited as being locked in the child, but will not unlock when the parent does). If instead you put it in a function then when the child process is created it imports the module you spawned from and runs it. So this actually creates it from scratch in the child process.

So this might help with some of your semaphore issues.

Which initialization code?

I think when writing the code, I was going for isolating the fit function and sending just the fit function to TPU.

The code that actually creates the learner. You create it outside a function. I’m suggesting instead:

def train_loop():
  path = untar_data(URLs.MNIST_SAMPLE)
  data = ImageDataBunch.from_folder(path)
  learn = cnn_learner(data, models.resnet50, metrics=accuracy).to_tpu_distributed()
  learn.fit(1)

if __name__ == '__main__':
  mp.spawn(train_loop)

So not having the learner copied.

You’ll note the samples create dataloaders etc inside a function which isn’t called from the main process which only calls mp.spawn.

But now you create a separate dataset and learner for each core. That doesn’t make sense anything to me. Is there something I am missing?

Yes, you’re supposed to create completely separate model, dataloader and optimiser (so a Learner) for each core. This applies to both multiprocess and threaded, it’s inherent to the TPU operation not an implementation thing for multiprocessing.
I think a key bit you’re missing is the DistributedSampler. That’s what means each core runs on a different selection of indexes from the dataloader, you give it the number of cores and the ‘rank’ (i.e. index) of the core (so presumably each core basically does only those samples where index % num_cores == rank). As for the optimiser, the xm.optimizer_step is what aggregates the gradients from the optimiser of each core to perform a unified update. For the model I think there’s some TPU magic there which automatically keeps the models in sync between the various cores.

Not also that with multiprocessing even if you passed the learner to each spawned process it wouldn’t actually be the same learner. They each get a different copy. This being torch.multiprocessing which is just a wrapper around python’s multiprocessing with tnesor handling, so not TPU/PyTorch specific. That’s just how it works. If you do:

a = [1,2,3]

def mp_fn(a):
  a.insert(my_mp_id)

spawn(mp_fn(a))

Where my_mp_id is assumed to be created by spawn (the xm.get_device() in TPU does this) then the spawned processes will each have different as. They’ll get a copy of the initial value but then it diverges. This also applies if you didn’t pass a to mp_fn but just used the existing value, not specific to spawn arguments.

Indeed I did miss the DistributedSampler. I missed it since I was looking in the API guide but it’s not there. The actual example code seems to be a better guide as to what needs to be done.

Here, the model and the dataset are initialized in each process too.

I looked more into how PyTorch and fastai do distributed training with GPUs. I see here we also need to initialize the model and the dataset in each process. However, I realized there are two important components (correct me if I am wrong):

  1. DistributedSampler to send the right data to the right device
  2. DistributedDataParallel to replicate the model on each of the GPUs and keep parameters, gradients, and updates in sync.

So for XLA we still do 1, but what’s doing 2? the optimizer stepping function?

Also, I still am a complete beginner to multiprocessing and parallelized training, but I find it odd that we don’t explicitly specify that a model is the supposed to be same between processes in terms of parameter updates. Without the XLA specific code, we would just be spawning processes with different Learners with no communication with each other. How does the XLA functions know that the models are the same?

Did you see DistributedDataParallelused with TPU? It’s not used in the repo examples. t might be used for distributed TPU training. In the GCP examples they setup multiple machines to feed different cores (or I think sets of cores with the multi-process loading used on each machine). Looks like DistributedDataParallel is used for keeping the data and model synced across machines (and is not TPU specific). While within a machine you use either ParallelLoader for multi-process or DataParallel for multiple threads (both TPU specific).

Yeah, I was a bit confused by how you don’t seem to actually explicitly parallelise the model (compared to DistiributedDataParallel where it returns a parallelised model). Though actually I guess that kind of makes sense. You don’t really need to parallelise it. You init the model before moving to the device, so model.to() can put it on each device. Then the only thing that alters it is the optimiser which is handled in the XLA optimizer_step. Or I guess looked at another way the model.to(an_xla_device) is what parallelises it and optimizer_step keeps it in sync.
In the threading DataParallel there are actually multiple copies of the model, because being in-memory you have to create multiple copies, in multi-process they are isolated in the different process address spaces.

I think there might also be some TPU magic there too, because you don’t really want to keep doing transfers between host and TPU, so parts of that syncing may well be on the actual TPU.

1 Like

I brought up DistributedDataParallel because I thought it was the GPU-equivalent of multi-TPU core training. I wanted to understand how that works to hopefully have an idea of how TPU training would work. Indeed ParallelLoader and DistributedSampler is the way to parallelize the dataset, with ParallelLoader being TPU-specific. Also, the xm.optimizer_step is keeping the gradients is in sync as you mentioned. This is confirmed over here. I guess xm.optimizer_step sees that it’s the same model across the cores that is being updated and is able to figure out how to sync them up.

I will probably raise an issue on the TPU github page asking any additional questions regarding how exactly the parallelizing of the model training is occurring for the TPU

I will add DistributedSampler and fix the training loop now.

However, I am still not happy with the current interface that I have. I still wish we could do learn = learn.to_tpu_distributed() and all the training happens on the TPU. This is what we do with distributed training and fp16 training already in fastai. If you have any ideas to be able to do this, please let me know.

1 Like

Another thing I wanted to note is that PyTorch XLA under the hood works like TensorFlow. It actually does the operations lazily by constructing a static graph and executing the graphs when the results are needed, which is just like how TensorFlow does it. I assume this is because the underlying C++ code for TPUs was based on the TensorFlow workflow?

I point this out because I was wondering if this could affect use of TPUs with fastai in any way. I cannot think of anything right now, but please let me know if you find something that could be affected by this.

@sgugger do you know why the library has its own distributed sampler? It seems very similar to PyTorch’s own sampler.

I think it wasn’t the same in each version (the doc string let me think they didn’t always provide the option to not shuffle).

Thanks! You are right, making shuffling optional was only added in July after the distributed training callback was added in May.

I will use PyTorch’s distributed sampler then as they are equivalent.

@sgugger I also wanted to ask about the progress bar. Is the silencing somehow controlled by this file? I am asking because it seems like it didn’t work (as I mentioned earlier) and I want to figure out how to get rid of it.

Is it also possible to get a command-line tqdm-like progress bar rather than the tqdm_notebook-like progress bar that is typically used by fastai. I understand fastai has their fastprogress library so how can this be controlled in fastprogress?

Also, if you have any ideas for improving the current TPU interface that I am working on such that a callback is all that is neccessary, please let me know!

It does this automatically. How are you running it, through %run in colab?

You can find the check fastprogress uses here. Here it has a specific check for colab by tring to import google.colab so even if running a script that’s going to be in your python import path so it will run in notebook mode.
This should generally be fine, if you %run a script it will find the notebook shell name (or the colab import) and will use notebook mode, and when doing %run I get a notebook style progress bar while running the same script from the console gives a console one.

So I guess either colab can’t handle the notebook out with %run, unlike Jupyter, or it’s related to the specifics of the script and the fact the progress bars are presumably in the spawned processes.

However I don’t think just getting it to show a console progress bar will work. Then you’ll just have 8 different console progress bars in the 8 different processes (or how ever many cores you run on). That’s not likely to be great. The progress likely needs to be reported from the child processes to the main process for unified display there.

Thanks for correcting me. I had been using !python instead. With %run the progress bar isn’t showing at all. I don’t know if this is a bug or something to do with my TPU code.

Actually I am not sure if the processes are spawned correctly. Because it should print hello for each process but it didn’t print anything… However it ends with the same process terminated error.

Ah OK, running !python will actually launch another python process while %run will re-use the same kernel as the notebook. If I use !python in a notebook then I get console output rather than the notebook output I get with %run because now the check doesn’t find an IPython shell. But the google colab check will still think it’s a notebook. Then you have notebook output being sent to the shell, then through to the notebook kernel. This may be garbling the tqdm output somehow, or the colab kernel doesn’t try to interpret it in the same way.
Could also be an interaction with the sub-process stuff, that’s going to involve either torch.multiprocessing or python’s multiprocessing echoing the subprocess stdout to the main process stdout.

1 Like

Yeah I am still wondering though why it’s not printing out a hello statement I keep in the function that I send for spawning. I think somehow with %run the output from the processes are “lost”? And also if it is using the IPython kernel, I am unsure if the processes are even being spawned properly. This might be a dumb question, but I am wondering if using IPython kernel restricts it to one process? But it is still finishing with the same error from multiprocessing so I am unsure.