As per the distributed training tutorial, it seems that each process will load the pre-processed data separately and then send it to GPU for training. Is it possible to use only a single copy of data which can be split to random subsets by different processes like this, so that CPU memory utilization is reduced?
Are you sure? Have you collected stats showing that the host memory usage is indeed multiplied, as you might have surmised? The source code of
ImageDataBunch.normalize() seems to suggest otherwise.
AFAIK, data loaders instantiation don’t load the data into memory at that the point, and actual data item is accessed lazily, on-demand.
to_distributed() creates a new data loader which makes sure each process will only fetch non-overlapping slices of the original data.
The variable name
data in the tutorial maybe slightly misleading. It’s not a copy, it’s a “data loading object”.
Yes, memory utilization seems to multiply and I am using it for ULMFiT. I have a separate script to preprocess and save the data and a distributed training script that loads this data and trains the model (for image processing I believe transformations are done on the fly)
Ah, glad you clarify it’s language modeling tasks and not image processing.
I’m not familiar with ULMFiT’s pipeline to advise if preprocessing can be done on the fly there.
I wanted the data to be preprocessed because once it is done same data can be used to train models with different hyperparameters (preprocessing takes a bit of time so I use a CPU-heavy environment for it and then for actual training I use GPU instances)
After digging around Pytorch distributed launch utility now I understand that it uses subprocess.Popen() hence memory sharing may not be possible, instead torch.multiprocessing is required which is built on top of Python’s multiprocessing module (and hence something like
multiprocessing.shared_memory is feasible)
Of course, to_parallel can be used, but I did not find any docs saying this is officially supported and when I tried to use there was some warning:
UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greately increasing memory usage. To compact weights again call flatten_parameters().
Pytorch docs also doesn’t recommend this due to GIL.
Perhaps then, you’ll need to create your own dataloader(s) from a preprocessed source to use shared_memory. Then you can conceivably reuse that in distributed training or just multiple training on multiple GPUs…
- in the main process, load the data into memory, and convert that to a shared_memory object.
- spawn new processes and pass that object as a parameter/argument through
mp.Process(target=entry_func, args=(..., shared_mem_object...)). Note, not
- then in
entry_func, use your custom dataloader to pick up shared_mem_object, and create learner from that, then initialize distributed training group, and train…
yes it’s somewhat involved. But a dataloader which can take shared_memory object as source, is nonetheless an interesting idea.
Coincidentally I’m working on a
mp.Process() based distributed training launcher, which uses an alternative multiprocessing library (with compatible API) called
multiprocess instead of python’s standard
torch.multiprocessing, because the latter cannot handle lambda and don’t work in interactive IPython or Jupyter notebook. If you’re interested, I can point u at a prototype notebook I’m doodling/experimenting with.
Shared memory is one of the tricks to significantly boost training speed in reinforcement learning, as a group @ USC successfully attempted, and announced this week: