ULMFiT - Distributed training

nandakumar212 · June 5, 2020, 12:44pm

How does ULMFiT maintain hidden state across consecutive bptt sequence chunks when multiple GPUs are used?

This is what I assume:

Training data is concatenated and tokenized
Data is divided into n_gpus and each subset is sent to a separate GPU
Within each GPU, it is further divided by batch size and bptt (thus hidden states are maintained separately in each GPU)