How does ULMFiT maintain hidden state across consecutive bptt sequence chunks when multiple GPUs are used?
This is what I assume:
- Training data is concatenated and tokenized
- Data is divided into n_gpus and each subset is sent to a separate GPU
- Within each GPU, it is further divided by batch size and bptt (thus hidden states are maintained separately in each GPU)