Just as a sanity check: I am using fastai as part of a bigger model and I am moving tensors between different GPUs. This is the status of my VRAM
Just loading ResNet18: 8.2 GB
Moving the [24, 35, 256, 256] Tensor onto the same GPU: 8.2 GB
Clearing torch cache: 7.6 GB
Running infernce: 16.1 GB
I was really hoping to fit a bigger batchsize onto 1 GPU, but it seems that the whole ResNet18 takes 8.2 GB VRAM and then when I run predict, (because I have to keep track of the gradient) it effectively doubles the size. Is there anything I can do to fit more?
I have tried clearing torch cache afterwards too, but unfortunately this only leads to reduction to 14.8 GB (?), this number is really confusing but I know that pytorch seems to deallocate VRAM in a very strange way. Any thoughts?
I was able to get when I call e.g. torch.cuda.memory_stats('cuda:0')
:
OrderedDict([('active.all.allocated', 397), ('active.all.current', 294), ('active.all.freed', 103), ('active.all.peak', 296), ('active.large_pool.allocated', 176), ('active.large_pool.current', 107), ('active.large_pool.freed', 69), ('active.large_pool.peak', 109), ('active.small_pool.allocated', 221), ('active.small_pool.current', 187), ('active.small_pool.freed', 34), ('active.small_pool.peak', 188), ('active_bytes.all.allocated', 75536306176), ('active_bytes.all.current', 7809503232), ('active_bytes.all.freed', 67726802944), ('active_bytes.all.peak', 12895096832), ('active_bytes.large_pool.allocated', 75522199552), ('active_bytes.large_pool.current', 7801064448), ('active_bytes.large_pool.freed', 67721135104), ('active_bytes.large_pool.peak', 12886658048), ('active_bytes.small_pool.allocated', 14106624), ('active_bytes.small_pool.current', 8438784), ('active_bytes.small_pool.freed', 5667840), ('active_bytes.small_pool.peak', 10425856), ('allocated_bytes.all.allocated', 75536306176), ('allocated_bytes.all.current', 7809503232), ('allocated_bytes.all.freed', 67726802944), ('allocated_bytes.all.peak', 12895096832), ('allocated_bytes.large_pool.allocated', 75522199552), ('allocated_bytes.large_pool.current', 7801064448), ('allocated_bytes.large_pool.freed', 67721135104), ('allocated_bytes.large_pool.peak', 12886658048), ('allocated_bytes.small_pool.allocated', 14106624), ('allocated_bytes.small_pool.current', 8438784), ('allocated_bytes.small_pool.freed', 5667840), ('allocated_bytes.small_pool.peak', 10425856), ('allocation.all.allocated', 397), ('allocation.all.current', 294), ('allocation.all.freed', 103), ('allocation.all.peak', 296), ('allocation.large_pool.allocated', 176), , ...
But it is unclear to me whether e.g. the network has been loaded up several times or what exactly happened. The best profile I got was this:
Module | Self CPU total | CPU total | CUDA total | Number of Calls
-------------------|----------------|-----------|------------|----------------
DynamicUnet | | | |
└── layers | | | |
├── 0 | | | |
│├── 0 | 2.741ms | 10.859ms | 12.501ms | 1
│├── 1 | 1.726ms | 5.110ms | 5.562ms | 1
│├── 2 | 73.121us | 73.121us | 73.152us | 1
│├── 3 | 78.982us | 140.970us | 319.712us | 1
│├── 4 | | | |
││├── 0 | | | |
│││├── conv1 | 157.844us | 551.133us | 1.376ms | 1
│││├── bn1 | 1.478ms | 4.353ms | 4.451ms | 1
│││├── relu | 87.206us | 87.206us | 135.104us | 2
│││├── conv2 | 1.423ms | 5.609ms | 6.421ms | 1
│││└── bn2 | 1.440ms | 4.254ms | 4.308ms | 1
││├── 1 | | | |
│││├── conv1 | 1.475ms | 5.817ms | 6.651ms | 1
│││├── bn1 | 1.577ms | 4.666ms | 4.749ms | 1
│││├── relu | 86.952us | 86.952us | 86.624us | 2
│││├── conv2 | 1.446ms | 5.704ms | 6.514ms | 1
│││└── bn2 | 1.479ms | 4.372ms | 4.458ms | 1
│├── 5 | | | |
││├── 0 | | | |
│││├── conv1 | 148.000us | 512.436us | 1.405ms | 1
│││├── bn1 | 1.475ms | 4.361ms | 4.389ms | 1
│││├── relu | 72.473us | 72.473us | 72.448us | 2
│││├── conv2 | 1.552ms | 6.127ms | 6.903ms | 1
│││├── bn2 | 1.354ms | 3.999ms | 4.026ms | 1
│││├── downsample | | | |
││││├── 0 | 1.433ms | 5.650ms | 5.770ms | 1
││││└── 1 | 1.554ms | 4.596ms | 4.604ms | 1
││├── 1 | | | |
│││├── conv1 | 144.136us | 494.567us | 1.277ms | 1
│││├── bn1 | 1.490ms | 4.407ms | 4.433ms | 1
│││├── relu | 72.034us | 72.034us | 71.680us | 2
│││├── conv2 | 1.017ms | 3.984ms | 4.770ms | 1
│││└── bn2 | 2.620ms | 7.796ms | 7.806ms | 1
│├── 6 | | | |
││├── 0 | | | |
│││├── conv1 | 124.746us | 418.001us | 1.454ms | 1
│││├── bn1 | 909.269us | 2.664ms | 2.668ms | 1
│││├── relu | 70.754us | 70.754us | 70.880us | 2
│││├── conv2 | 1.062ms | 4.173ms | 4.892ms | 1
│││├── bn2 | 129.418us | 328.596us | 331.456us | 1
│││├── downsample | | | |
││││├── 0 | 174.019us | 615.275us | 769.600us | 1
││││└── 1 | 129.678us | 326.513us | 324.672us | 1
││├── 1 | | | |
│││├── conv1 | 1.015ms | 3.979ms | 4.763ms | 1
│││├── bn1 | 147.766us | 382.280us | 388.000us | 1
│││├── relu | 88.495us | 88.495us | 87.584us | 2
│││├── conv2 | 140.468us | 482.008us | 1.276ms | 1
│││└── bn2 | 164.807us | 417.227us | 415.392us | 1
│├── 7 | | | |
││├── 0 | | | |
│││├── conv1 | 191.440us | 689.014us | 1.743ms | 1
│││├── bn1 | 131.265us | 330.271us | 340.544us | 1
│││├── relu | 82.368us | 82.368us | 82.528us | 2
│││├── conv2 | 979.864us | 3.836ms | 4.666ms | 1
│││├── bn2 | 148.646us | 348.405us | 361.120us | 1
│││├── downsample | | | |
││││├── 0 | 115.742us | 384.877us | 529.152us | 1
││││└── 1 | 128.043us | 323.512us | 340.320us | 1
││├── 1 | | | |
│││├── conv1 | 122.644us | 409.273us | 1.262ms | 1
│││├── bn1 | 125.543us | 316.654us | 328.416us | 1
│││├── relu | 71.609us | 71.609us | 70.624us | 2
│││├── conv2 | 139.532us | 476.890us | 1.276ms | 1
│││└── bn2 | 127.482us | 320.873us | 334.560us | 1
├── 1 | 143.091us | 367.767us | 367.840us | 1
├── 2 | 59.685us | 59.685us | 59.904us | 1
├── 3 | | | |
│├── 0 | | | |
││├── 0 | 1.102ms | 4.311ms | 5.933ms | 1
││└── 1 | 59.741us | 59.741us | 59.936us | 1
│├── 1 | | | |
││├── 0 | 147.827us | 502.397us | 2.185ms | 1
││└── 1 | 98.236us | 98.236us | 45.824us | 1
├── 4 | | | |
│├── shuf | | | |
││├── 0 | | | |
│││├── 0 | 149.902us | 518.996us | 960.096us | 1
│││└── 1 | 44.511us | 44.511us | 44.224us | 1
││└── 1 | 157.236us | 369.259us | 372.832us | 1
│├── bn | 131.621us | 329.947us | 336.576us | 1
│├── conv1 | | | |
││├── 0 | 1.140ms | 4.481ms | 7.528ms | 1
││└── 1 | 44.099us | 44.099us | 44.192us | 1
│├── conv2 | | | |
││├── 0 | 152.624us | 531.870us | 3.707ms | 1
││└── 1 | 45.796us | 45.796us | 46.560us | 1
│└── relu | 41.246us | 41.246us | 40.992us | 1
├── 5 | | | |
│├── shuf | | | |
││├── 0 | | | |
│││├── 0 | 164.613us | 580.142us | 2.344ms | 1
│││└── 1 | 42.791us | 42.791us | 42.592us | 1
││└── 1 | 150.595us | 365.887us | 463.872us | 1
│├── bn | 143.109us | 368.781us | 405.504us | 1
│├── conv1 | | | |
││├── 0 | 1.239ms | 4.876ms | 11.391ms | 1
││└── 1 | 60.048us | 60.048us | 59.968us | 1
│├── conv2 | | | |
││├── 0 | 1.306ms | 5.144ms | 11.662ms | 1
││├── 1 | 45.441us | 45.441us | 45.280us | 1
││├── 2 | | | |
│││├── query | | | |
││││└── 0 | 185.265us | 636.181us | 941.568us | 1
│││├── key | | | |
││││└── 0 | 175.794us | 604.437us | 911.008us | 1
│││├── value | | | |
││││└── 0 | 173.481us | 597.576us | 2.081ms | 1
│└── relu | 42.526us | 42.526us | 42.208us | 1
├── 6 | | | |
│├── shuf | | | |
││├── 0 | | | |
│││├── 0 | 166.451us | 583.985us | 4.087ms | 1
│││└── 1 | 965.870us | 965.870us | 967.488us | 1
││└── 1 | 1.008ms | 2.928ms | 3.214ms | 1
│├── bn | 143.730us | 335.790us | 434.560us | 1
│├── conv1 | | | |
││├── 0 | 1.114ms | 4.378ms | 16.913ms | 1
││└── 1 | 886.677us | 886.677us | 887.712us | 1
│├── conv2 | | | |
││├── 0 | 1.038ms | 4.070ms | 16.564ms | 1
││└── 1 | 931.969us | 931.969us | 932.928us | 1
│└── relu | 970.279us | 970.279us | 971.488us | 1
├── 7 | | | |
│├── shuf | | | |
││├── 0 | | | |
│││├── 0 | 1.200ms | 4.715ms | 11.392ms | 1
│││└── 1 | 1.050ms | 1.050ms | 1.051ms | 1
││└── 1 | 1.195ms | 3.496ms | 4.294ms | 1
│├── bn | 127.695us | 320.865us | 762.752us | 1
│├── conv1 | | | |
││├── 0 | 156.475us | 549.483us | 15.258ms | 1
││└── 1 | 1.049ms | 1.049ms | 1.052ms | 1
│├── conv2 | | | |
││├── 0 | 1.904ms | 7.512ms | 15.644ms | 1
││└── 1 | 990.908us | 990.908us | 991.808us | 1
│└── relu | 1.163ms | 1.163ms | 1.164ms | 1
├── 8 | | | |
│├── 0 | | | |
││├── 0 | 1.674ms | 6.616ms | 17.126ms | 1
││└── 1 | 1.562ms | 1.562ms | 1.564ms | 1
│└── 1 | 1.665ms | 4.906ms | 7.591ms | 1
├── 9 | 0.000us | 0.000us | 0.000us | 1
├── 10 | 1.581ms | 1.581ms | 1.582ms | 1
├── 11 | | | |
│├── convpath | | | |
││├── 0 | | | |
│││├── 0 | 1.618ms | 6.391ms | 51.245ms | 1
│││└── 1 | 2.639ms | 2.639ms | 2.642ms | 1
││└── 1 | | | |
││ └── 0 | 1.697ms | 6.623ms | 51.456ms | 1
│├── idpath | 0.000us | 0.000us | 0.000us | 1
│└── act | 36.439us | 36.439us | 36.192us | 1
└── 12 | | | |
└── 0 | 158.487us | 540.154us | 7.335ms | 1
But even that is reasonably unclear to me even though I can at least read the lines.