ULMFiT - Distributed training

How does ULMFiT maintain hidden state across consecutive bptt sequence chunks when multiple GPUs are used?

This is what I assume:

  • Training data is concatenated and tokenized
  • Data is divided into n_gpus and each subset is sent to a separate GPU
  • Within each GPU, it is further divided by batch size and bptt (thus hidden states are maintained separately in each GPU)