A queries re: cell 5 of Kaggle notebook “Scaling Up: Road to the Top, Part 3”
-
It seems strange that only one leg of the if/else does a return.
-
I remember from transcribing there was a question about whether
count>64
should becount>=64
. -
While testing the GPU memory required by larger model, when a teach() call succeeds, i.e fine-tuning without error,
thengc.collect()
andtorch.cuda.empty_cache()
work fine.
i.e. subsequently running gpu_report() says: 15039.000 MB GPU memory
and then running gpu_report() again says: 1541.000 MB GPU memory
.
But when a “CUDA out of memory” error occurs, running gpu_report() is not releasing memory, and I can’t find a way to release mem without a restart the kernel restart. I thought maybe the traceback context might be holding some reference preventing GC, but deleting the cell so the pink outptu cell disappears doesn’t help. Any hints how to release memory after an error, without a kernel restart?
.
p.s. From here I got a small script to print memory resident tensors (with counts done in Excel), but I don’t know enough to analyse it.
Resident Tensor | Count |
---|---|
<class ‘torch.Tensor’> torch.Size([1024]) | 449 |
<class ‘torch.nn.parameter.Parameter’> torch.Size([1024]) | 149 |
<class ‘torch.Tensor’> torch.Size([32, 197, 1024]) | 73 |
<class ‘torch.Tensor’> torch.Size([3072, 1024]) | 72 |
<class ‘torch.Tensor’> torch.Size([4096, 1024]) | 72 |
<class ‘torch.Tensor’> torch.Size([3072]) | 72 |
<class ‘torch.Tensor’> torch.Size([1024, 4096]) | 72 |
<class ‘torch.Tensor’> torch.Size([1024, 1024]) | 72 |
<class ‘torch.Tensor’> torch.Size([4096]) | 72 |
<class ‘torch.Tensor’> torch.Size([32, 197, 4096]) | 46 |
<class ‘torch.Tensor’> torch.Size([32, 16, 197, 197]) | 24 |
<class ‘torch.nn.parameter.Parameter’> torch.Size([4096]) | 24 |
<class ‘torch.nn.parameter.Parameter’> torch.Size([3072, 1024]) | 24 |
<class ‘torch.nn.parameter.Parameter’> torch.Size([1024, 4096]) | 24 |
<class ‘torch.nn.parameter.Parameter’> torch.Size([3072]) | 24 |
<class ‘torch.nn.parameter.Parameter’> torch.Size([1024, 1024]) | 24 |
<class ‘torch.nn.parameter.Parameter’> torch.Size([4096, 1024]) | 24 |
<class ‘torch.Tensor’> torch.Size([]) | 9 |
<class ‘torch.Tensor’> torch.Size([512]) | 8 |