Unfortunately, this did not work. I added self.recorder.silent = True
under on_train_begin
but I still have the script printing <IPython.core.display.HTML object>
from the progress bar.
Sorry just to clarify - this is fastai v2âs syntax.
Ah ok. Right now I am working on fastai v1, but when I try to implement TPU support for fastai v2, I will keep this in mind. Thanks!
Not sure ifg you already discovered this.
Any reason why you didnât follow the standard example practice of putting all the initialisation code in functions? This isnât just a stylistic choice, it alters the way the spawn works. If you have code outside of a function then those in-memory variables are inherited in the spawned (via OS fork) process, as it gets a copy of the parent processes memory. This can cause issues as things like semaphores canât be copied to the child process (you can also deadlock the child process as if a Lock is held then it is weirdly partially inherited as being locked in the child, but will not unlock when the parent does). If instead you put it in a function then when the child process is created it imports the module you spawned from and runs it. So this actually creates it from scratch in the child process.
So this might help with some of your semaphore issues.
Which initialization code?
I think when writing the code, I was going for isolating the fit function and sending just the fit function to TPU.
The code that actually creates the learner. You create it outside a function. Iâm suggesting instead:
def train_loop():
path = untar_data(URLs.MNIST_SAMPLE)
data = ImageDataBunch.from_folder(path)
learn = cnn_learner(data, models.resnet50, metrics=accuracy).to_tpu_distributed()
learn.fit(1)
if __name__ == '__main__':
mp.spawn(train_loop)
So not having the learner copied.
Youâll note the samples create dataloaders etc inside a function which isnât called from the main process which only calls mp.spawn
.
But now you create a separate dataset and learner for each core. That doesnât make sense anything to me. Is there something I am missing?
Yes, youâre supposed to create completely separate model, dataloader and optimiser (so a Learner) for each core. This applies to both multiprocess and threaded, itâs inherent to the TPU operation not an implementation thing for multiprocessing.
I think a key bit youâre missing is the DistributedSampler. Thatâs what means each core runs on a different selection of indexes from the dataloader, you give it the number of cores and the ârankâ (i.e. index) of the core (so presumably each core basically does only those samples where index % num_cores == rank
). As for the optimiser, the xm.optimizer_step
is what aggregates the gradients from the optimiser of each core to perform a unified update. For the model I think thereâs some TPU magic there which automatically keeps the models in sync between the various cores.
Not also that with multiprocessing even if you passed the learner to each spawned process it wouldnât actually be the same learner. They each get a different copy. This being torch.multiprocessing
which is just a wrapper around pythonâs multiprocessing with tnesor handling, so not TPU/PyTorch specific. Thatâs just how it works. If you do:
a = [1,2,3]
def mp_fn(a):
a.insert(my_mp_id)
spawn(mp_fn(a))
Where my_mp_id is assumed to be created by spawn (the xm.get_device()
in TPU does this) then the spawned processes will each have different a
s. Theyâll get a copy of the initial value but then it diverges. This also applies if you didnât pass a
to mp_fn
but just used the existing value, not specific to spawn arguments.
Indeed I did miss the DistributedSampler. I missed it since I was looking in the API guide but itâs not there. The actual example code seems to be a better guide as to what needs to be done.
Here, the model and the dataset are initialized in each process too.
I looked more into how PyTorch and fastai do distributed training with GPUs. I see here we also need to initialize the model and the dataset in each process. However, I realized there are two important components (correct me if I am wrong):
- DistributedSampler to send the right data to the right device
- DistributedDataParallel to replicate the model on each of the GPUs and keep parameters, gradients, and updates in sync.
So for XLA we still do 1, but whatâs doing 2? the optimizer stepping function?
Also, I still am a complete beginner to multiprocessing and parallelized training, but I find it odd that we donât explicitly specify that a model is the supposed to be same between processes in terms of parameter updates. Without the XLA specific code, we would just be spawning processes with different Learners with no communication with each other. How does the XLA functions know that the models are the same?
Did you see DistributedDataParallel
used with TPU? Itâs not used in the repo examples. t might be used for distributed TPU training. In the GCP examples they setup multiple machines to feed different cores (or I think sets of cores with the multi-process loading used on each machine). Looks like DistributedDataParallel
is used for keeping the data and model synced across machines (and is not TPU specific). While within a machine you use either ParallelLoader
for multi-process or DataParallel
for multiple threads (both TPU specific).
Yeah, I was a bit confused by how you donât seem to actually explicitly parallelise the model (compared to DistiributedDataParallel
where it returns a parallelised model). Though actually I guess that kind of makes sense. You donât really need to parallelise it. You init the model before moving to the device, so model.to()
can put it on each device. Then the only thing that alters it is the optimiser which is handled in the XLA optimizer_step
. Or I guess looked at another way the model.to(an_xla_device)
is what parallelises it and optimizer_step
keeps it in sync.
In the threading DataParallel
there are actually multiple copies of the model, because being in-memory you have to create multiple copies, in multi-process they are isolated in the different process address spaces.
I think there might also be some TPU magic there too, because you donât really want to keep doing transfers between host and TPU, so parts of that syncing may well be on the actual TPU.
I brought up DistributedDataParallel
because I thought it was the GPU-equivalent of multi-TPU core training. I wanted to understand how that works to hopefully have an idea of how TPU training would work. Indeed ParallelLoader
and DistributedSampler
is the way to parallelize the dataset, with ParallelLoader
being TPU-specific. Also, the xm.optimizer_step
is keeping the gradients is in sync as you mentioned. This is confirmed over here. I guess xm.optimizer_step
sees that itâs the same model across the cores that is being updated and is able to figure out how to sync them up.
I will probably raise an issue on the TPU github page asking any additional questions regarding how exactly the parallelizing of the model training is occurring for the TPU
I will add DistributedSampler and fix the training loop now.
However, I am still not happy with the current interface that I have. I still wish we could do learn = learn.to_tpu_distributed()
and all the training happens on the TPU. This is what we do with distributed training and fp16 training already in fastai. If you have any ideas to be able to do this, please let me know.
Another thing I wanted to note is that PyTorch XLA under the hood works like TensorFlow. It actually does the operations lazily by constructing a static graph and executing the graphs when the results are needed, which is just like how TensorFlow does it. I assume this is because the underlying C++ code for TPUs was based on the TensorFlow workflow?
I point this out because I was wondering if this could affect use of TPUs with fastai in any way. I cannot think of anything right now, but please let me know if you find something that could be affected by this.
@sgugger do you know why the library has its own distributed sampler? It seems very similar to PyTorchâs own sampler.
I think it wasnât the same in each version (the doc string let me think they didnât always provide the option to not shuffle).
Thanks! You are right, making shuffling optional was only added in July after the distributed training callback was added in May.
I will use PyTorchâs distributed sampler then as they are equivalent.
@sgugger I also wanted to ask about the progress bar. Is the silencing somehow controlled by this file? I am asking because it seems like it didnât work (as I mentioned earlier) and I want to figure out how to get rid of it.
Is it also possible to get a command-line tqdm-like progress bar rather than the tqdm_notebook-like progress bar that is typically used by fastai. I understand fastai has their fastprogress library so how can this be controlled in fastprogress?
Also, if you have any ideas for improving the current TPU interface that I am working on such that a callback is all that is neccessary, please let me know!
It does this automatically. How are you running it, through %run
in colab?
You can find the check fastprogress uses here. Here it has a specific check for colab by tring to import google.colab
so even if running a script thatâs going to be in your python import path so it will run in notebook mode.
This should generally be fine, if you %run
a script it will find the notebook shell name (or the colab import) and will use notebook mode, and when doing %run
I get a notebook style progress bar while running the same script from the console gives a console one.
So I guess either colab canât handle the notebook out with %run, unlike Jupyter, or itâs related to the specifics of the script and the fact the progress bars are presumably in the spawned processes.
However I donât think just getting it to show a console progress bar will work. Then youâll just have 8 different console progress bars in the 8 different processes (or how ever many cores you run on). Thatâs not likely to be great. The progress likely needs to be reported from the child processes to the main process for unified display there.
Thanks for correcting me. I had been using !python
instead. With %run
the progress bar isnât showing at all. I donât know if this is a bug or something to do with my TPU code.
Actually I am not sure if the processes are spawned correctly. Because it should print hello for each process but it didnât print anything⌠However it ends with the same process terminated error.
Ah OK, running !python
will actually launch another python process while %run
will re-use the same kernel as the notebook. If I use !python
in a notebook then I get console output rather than the notebook output I get with %run
because now the check doesnât find an IPython shell. But the google colab check will still think itâs a notebook. Then you have notebook output being sent to the shell, then through to the notebook kernel. This may be garbling the tqdm output somehow, or the colab kernel doesnât try to interpret it in the same way.
Could also be an interaction with the sub-process stuff, thatâs going to involve either torch.multiprocessing
or pythonâs multiprocessing
echoing the subprocess stdout to the main process stdout.
Yeah I am still wondering though why itâs not printing out a hello statement I keep in the function that I send for spawning. I think somehow with %run
the output from the processes are âlostâ? And also if it is using the IPython kernel, I am unsure if the processes are even being spawned properly. This might be a dumb question, but I am wondering if using IPython kernel restricts it to one process? But it is still finishing with the same error from multiprocessing so I am unsure.