GPU Profiling (aka why does resnet18 take 8 GB of VRAM?)

Just as a sanity check: I am using fastai as part of a bigger model and I am moving tensors between different GPUs. This is the status of my VRAM
Just loading ResNet18: 8.2 GB
Moving the [24, 35, 256, 256] Tensor onto the same GPU: 8.2 GB
Clearing torch cache: 7.6 GB
Running infernce: 16.1 GB
I was really hoping to fit a bigger batchsize onto 1 GPU, but it seems that the whole ResNet18 takes 8.2 GB VRAM and then when I run predict, (because I have to keep track of the gradient) it effectively doubles the size. Is there anything I can do to fit more?
I have tried clearing torch cache afterwards too, but unfortunately this only leads to reduction to 14.8 GB (?), this number is really confusing but I know that pytorch seems to deallocate VRAM in a very strange way. Any thoughts?

I was able to get when I call e.g. torch.cuda.memory_stats('cuda:0') :

OrderedDict([('active.all.allocated', 397), ('active.all.current', 294), ('active.all.freed', 103), ('active.all.peak', 296), ('active.large_pool.allocated', 176), ('active.large_pool.current', 107), ('active.large_pool.freed', 69), ('active.large_pool.peak', 109), ('active.small_pool.allocated', 221), ('active.small_pool.current', 187), ('active.small_pool.freed', 34), ('active.small_pool.peak', 188), ('active_bytes.all.allocated', 75536306176), ('active_bytes.all.current', 7809503232), ('active_bytes.all.freed', 67726802944), ('active_bytes.all.peak', 12895096832), ('active_bytes.large_pool.allocated', 75522199552), ('active_bytes.large_pool.current', 7801064448), ('active_bytes.large_pool.freed', 67721135104), ('active_bytes.large_pool.peak', 12886658048), ('active_bytes.small_pool.allocated', 14106624), ('active_bytes.small_pool.current', 8438784), ('active_bytes.small_pool.freed', 5667840), ('active_bytes.small_pool.peak', 10425856), ('allocated_bytes.all.allocated', 75536306176), ('allocated_bytes.all.current', 7809503232), ('allocated_bytes.all.freed', 67726802944), ('allocated_bytes.all.peak', 12895096832), ('allocated_bytes.large_pool.allocated', 75522199552), ('allocated_bytes.large_pool.current', 7801064448), ('allocated_bytes.large_pool.freed', 67721135104), ('allocated_bytes.large_pool.peak', 12886658048), ('allocated_bytes.small_pool.allocated', 14106624), ('allocated_bytes.small_pool.current', 8438784), ('allocated_bytes.small_pool.freed', 5667840), ('allocated_bytes.small_pool.peak', 10425856), ('allocation.all.allocated', 397), ('allocation.all.current', 294), ('allocation.all.freed', 103), ('allocation.all.peak', 296), ('allocation.large_pool.allocated', 176), , ... 

But it is unclear to me whether e.g. the network has been loaded up several times or what exactly happened. The best profile I got was this:

Module             | Self CPU total | CPU total | CUDA total | Number of Calls
-------------------|----------------|-----------|------------|----------------
DynamicUnet        |                |           |            |
└── layers         |                |           |            |
 ├── 0             |                |           |            |
 │├── 0            | 2.741ms        | 10.859ms  | 12.501ms   | 1
 │├── 1            | 1.726ms        | 5.110ms   | 5.562ms    | 1
 │├── 2            | 73.121us       | 73.121us  | 73.152us   | 1
 │├── 3            | 78.982us       | 140.970us | 319.712us  | 1
 │├── 4            |                |           |            |
 ││├── 0           |                |           |            |
 │││├── conv1      | 157.844us      | 551.133us | 1.376ms    | 1
 │││├── bn1        | 1.478ms        | 4.353ms   | 4.451ms    | 1
 │││├── relu       | 87.206us       | 87.206us  | 135.104us  | 2
 │││├── conv2      | 1.423ms        | 5.609ms   | 6.421ms    | 1
 │││└── bn2        | 1.440ms        | 4.254ms   | 4.308ms    | 1
 ││├── 1           |                |           |            |
 │││├── conv1      | 1.475ms        | 5.817ms   | 6.651ms    | 1
 │││├── bn1        | 1.577ms        | 4.666ms   | 4.749ms    | 1
 │││├── relu       | 86.952us       | 86.952us  | 86.624us   | 2
 │││├── conv2      | 1.446ms        | 5.704ms   | 6.514ms    | 1
 │││└── bn2        | 1.479ms        | 4.372ms   | 4.458ms    | 1
 │├── 5            |                |           |            |
 ││├── 0           |                |           |            |
 │││├── conv1      | 148.000us      | 512.436us | 1.405ms    | 1
 │││├── bn1        | 1.475ms        | 4.361ms   | 4.389ms    | 1
 │││├── relu       | 72.473us       | 72.473us  | 72.448us   | 2
 │││├── conv2      | 1.552ms        | 6.127ms   | 6.903ms    | 1
 │││├── bn2        | 1.354ms        | 3.999ms   | 4.026ms    | 1
 │││├── downsample |                |           |            |
 ││││├── 0         | 1.433ms        | 5.650ms   | 5.770ms    | 1
 ││││└── 1         | 1.554ms        | 4.596ms   | 4.604ms    | 1
 ││├── 1           |                |           |            |
 │││├── conv1      | 144.136us      | 494.567us | 1.277ms    | 1
 │││├── bn1        | 1.490ms        | 4.407ms   | 4.433ms    | 1
 │││├── relu       | 72.034us       | 72.034us  | 71.680us   | 2
 │││├── conv2      | 1.017ms        | 3.984ms   | 4.770ms    | 1
 │││└── bn2        | 2.620ms        | 7.796ms   | 7.806ms    | 1
 │├── 6            |                |           |            |
 ││├── 0           |                |           |            |
 │││├── conv1      | 124.746us      | 418.001us | 1.454ms    | 1
 │││├── bn1        | 909.269us      | 2.664ms   | 2.668ms    | 1
 │││├── relu       | 70.754us       | 70.754us  | 70.880us   | 2
 │││├── conv2      | 1.062ms        | 4.173ms   | 4.892ms    | 1
 │││├── bn2        | 129.418us      | 328.596us | 331.456us  | 1
 │││├── downsample |                |           |            |
 ││││├── 0         | 174.019us      | 615.275us | 769.600us  | 1
 ││││└── 1         | 129.678us      | 326.513us | 324.672us  | 1
 ││├── 1           |                |           |            |
 │││├── conv1      | 1.015ms        | 3.979ms   | 4.763ms    | 1
 │││├── bn1        | 147.766us      | 382.280us | 388.000us  | 1
 │││├── relu       | 88.495us       | 88.495us  | 87.584us   | 2
 │││├── conv2      | 140.468us      | 482.008us | 1.276ms    | 1
 │││└── bn2        | 164.807us      | 417.227us | 415.392us  | 1
 │├── 7            |                |           |            |
 ││├── 0           |                |           |            |
 │││├── conv1      | 191.440us      | 689.014us | 1.743ms    | 1
 │││├── bn1        | 131.265us      | 330.271us | 340.544us  | 1
 │││├── relu       | 82.368us       | 82.368us  | 82.528us   | 2
 │││├── conv2      | 979.864us      | 3.836ms   | 4.666ms    | 1
 │││├── bn2        | 148.646us      | 348.405us | 361.120us  | 1
 │││├── downsample |                |           |            |
 ││││├── 0         | 115.742us      | 384.877us | 529.152us  | 1
 ││││└── 1         | 128.043us      | 323.512us | 340.320us  | 1
 ││├── 1           |                |           |            |
 │││├── conv1      | 122.644us      | 409.273us | 1.262ms    | 1
 │││├── bn1        | 125.543us      | 316.654us | 328.416us  | 1
 │││├── relu       | 71.609us       | 71.609us  | 70.624us   | 2
 │││├── conv2      | 139.532us      | 476.890us | 1.276ms    | 1
 │││└── bn2        | 127.482us      | 320.873us | 334.560us  | 1
 ├── 1             | 143.091us      | 367.767us | 367.840us  | 1
 ├── 2             | 59.685us       | 59.685us  | 59.904us   | 1
 ├── 3             |                |           |            |
 │├── 0            |                |           |            |
 ││├── 0           | 1.102ms        | 4.311ms   | 5.933ms    | 1
 ││└── 1           | 59.741us       | 59.741us  | 59.936us   | 1
 │├── 1            |                |           |            |
 ││├── 0           | 147.827us      | 502.397us | 2.185ms    | 1
 ││└── 1           | 98.236us       | 98.236us  | 45.824us   | 1
 ├── 4             |                |           |            |
 │├── shuf         |                |           |            |
 ││├── 0           |                |           |            |
 │││├── 0          | 149.902us      | 518.996us | 960.096us  | 1
 │││└── 1          | 44.511us       | 44.511us  | 44.224us   | 1
 ││└── 1           | 157.236us      | 369.259us | 372.832us  | 1
 │├── bn           | 131.621us      | 329.947us | 336.576us  | 1
 │├── conv1        |                |           |            |
 ││├── 0           | 1.140ms        | 4.481ms   | 7.528ms    | 1
 ││└── 1           | 44.099us       | 44.099us  | 44.192us   | 1
 │├── conv2        |                |           |            |
 ││├── 0           | 152.624us      | 531.870us | 3.707ms    | 1
 ││└── 1           | 45.796us       | 45.796us  | 46.560us   | 1
 │└── relu         | 41.246us       | 41.246us  | 40.992us   | 1
 ├── 5             |                |           |            |
 │├── shuf         |                |           |            |
 ││├── 0           |                |           |            |
 │││├── 0          | 164.613us      | 580.142us | 2.344ms    | 1
 │││└── 1          | 42.791us       | 42.791us  | 42.592us   | 1
 ││└── 1           | 150.595us      | 365.887us | 463.872us  | 1
 │├── bn           | 143.109us      | 368.781us | 405.504us  | 1
 │├── conv1        |                |           |            |
 ││├── 0           | 1.239ms        | 4.876ms   | 11.391ms   | 1
 ││└── 1           | 60.048us       | 60.048us  | 59.968us   | 1
 │├── conv2        |                |           |            |
 ││├── 0           | 1.306ms        | 5.144ms   | 11.662ms   | 1
 ││├── 1           | 45.441us       | 45.441us  | 45.280us   | 1
 ││├── 2           |                |           |            |
 │││├── query      |                |           |            |
 ││││└── 0         | 185.265us      | 636.181us | 941.568us  | 1
 │││├── key        |                |           |            |
 ││││└── 0         | 175.794us      | 604.437us | 911.008us  | 1
 │││├── value      |                |           |            |
 ││││└── 0         | 173.481us      | 597.576us | 2.081ms    | 1
 │└── relu         | 42.526us       | 42.526us  | 42.208us   | 1
 ├── 6             |                |           |            |
 │├── shuf         |                |           |            |
 ││├── 0           |                |           |            |
 │││├── 0          | 166.451us      | 583.985us | 4.087ms    | 1
 │││└── 1          | 965.870us      | 965.870us | 967.488us  | 1
 ││└── 1           | 1.008ms        | 2.928ms   | 3.214ms    | 1
 │├── bn           | 143.730us      | 335.790us | 434.560us  | 1
 │├── conv1        |                |           |            |
 ││├── 0           | 1.114ms        | 4.378ms   | 16.913ms   | 1
 ││└── 1           | 886.677us      | 886.677us | 887.712us  | 1
 │├── conv2        |                |           |            |
 ││├── 0           | 1.038ms        | 4.070ms   | 16.564ms   | 1
 ││└── 1           | 931.969us      | 931.969us | 932.928us  | 1
 │└── relu         | 970.279us      | 970.279us | 971.488us  | 1
 ├── 7             |                |           |            |
 │├── shuf         |                |           |            |
 ││├── 0           |                |           |            |
 │││├── 0          | 1.200ms        | 4.715ms   | 11.392ms   | 1
 │││└── 1          | 1.050ms        | 1.050ms   | 1.051ms    | 1
 ││└── 1           | 1.195ms        | 3.496ms   | 4.294ms    | 1
 │├── bn           | 127.695us      | 320.865us | 762.752us  | 1
 │├── conv1        |                |           |            |
 ││├── 0           | 156.475us      | 549.483us | 15.258ms   | 1
 ││└── 1           | 1.049ms        | 1.049ms   | 1.052ms    | 1
 │├── conv2        |                |           |            |
 ││├── 0           | 1.904ms        | 7.512ms   | 15.644ms   | 1
 ││└── 1           | 990.908us      | 990.908us | 991.808us  | 1
 │└── relu         | 1.163ms        | 1.163ms   | 1.164ms    | 1
 ├── 8             |                |           |            |
 │├── 0            |                |           |            |
 ││├── 0           | 1.674ms        | 6.616ms   | 17.126ms   | 1
 ││└── 1           | 1.562ms        | 1.562ms   | 1.564ms    | 1
 │└── 1            | 1.665ms        | 4.906ms   | 7.591ms    | 1
 ├── 9             | 0.000us        | 0.000us   | 0.000us    | 1
 ├── 10            | 1.581ms        | 1.581ms   | 1.582ms    | 1
 ├── 11            |                |           |            |
 │├── convpath     |                |           |            |
 ││├── 0           |                |           |            |
 │││├── 0          | 1.618ms        | 6.391ms   | 51.245ms   | 1
 │││└── 1          | 2.639ms        | 2.639ms   | 2.642ms    | 1
 ││└── 1           |                |           |            |
 ││ └── 0          | 1.697ms        | 6.623ms   | 51.456ms   | 1
 │├── idpath       | 0.000us        | 0.000us   | 0.000us    | 1
 │└── act          | 36.439us       | 36.439us  | 36.192us   | 1
 └── 12            |                |           |            |
  └── 0            | 158.487us      | 540.154us | 7.335ms    | 1

But even that is reasonably unclear to me even though I can at least read the lines.