If you have a batch of 64, does that mean each process of the GPU is processing a batch of 64 items or that the GPU is processing 64 items in parallel? Or something else?
I’m looking at the gradient accumulation in the notebook, scaling to the top part 3, Scaling Up: Road to the Top, Part 3 | Kaggle. I’m trying to understand more about why gradient accumulation reduces peak memory usage of the GPU.
I can try answering your question about gradient accumulation. The way to think is that GPU has a memory area where tensors reside, and are copied and removed. So, when a batch of 64 samples is processed, those 64 samples are loaded into GPU memory, gradients are computed, and then those 64 samples are removed to make space for the next batch.
Now let’s say we choose to accumulate gradients over 2 batches and each batch has size 32. When processing each batch, only 32 samples are loaded in GPU memory; once they’re processed, they’re removed. Hence, the peak memory consumption comes from the space taken by 32 samples only. It’s almost half of the original case.
The result of the second case, that is adding up gradients over 2 batches, is almost identical to computing gradient with batch of 64 items; hence the comparison of memory usage.
Thank you. I think I understand. My understanding now is that you can split a batch of 64 into 2 batches of 32 sent to the GPU sequentially, and then by accumulating the gradients from both batches, you end up adjusting the weights in the same way you would have if you had sent a batch of 64 to the GPU and then adjusted parameter weights from the result of that batch.
In the context of the notebook “Scaling Up: Road to the Top, Part 3” from Kaggle, they might be discussing how gradient accumulation allows for effective training of large models on GPUs with limited memory. By accumulating gradients over multiple forward and backward passes, the GPU can handle larger models or larger datasets than it could if it had to process the entire batch at once.