Torch.cuda.empty_cache() very very slow performance

abentov · February 22, 2021, 3:08pm

In short my issue is: super slow performance with NVIDIA, CUDA freeing GPU memory

In detail:

I’ve trained a transformer NLP classifier, which I have to use for inference. I have the following project requirements, leading to the performance issue:

I have an average of about 20K unique titles (e.g. Nike running shoes for women), on which I should run inference (classify the title into one out of ~5k categories) - in real time, using the said trained model.
I have a server with K80 pair of GPUs. Each K80 GPU core has 12 GB memory capacity
after trial-error tuning, I’ve found out that I can use a max of about 4.5K titles (embedding) at a time/batch/iteration on one of the GPUs. Anything over it exceeds the single GPU core memory
below is a sample demo code snippet, further explaining the issue (the test below was made on a p100 GPU on google colab - at least twice faster than the K80. I had to be sure, the issue is not related to cuda driver installation on my server):

for i, batch in enumerate(self.test_dataloader):

    self.dump('start empty cache...', i, 1)
    # torch.cuda.empty_cache()
    self.dump('end empty cache...', i, 1)

    self.dump('start to device...', i, 1)
    batch = tuple(t.to(device) for t in batch)  # to GPU (or CPU) when gpu
    self.dump('end to device...', i, 1)

    ...

    do stuff further in code

    ...

output

start empty cache... 0.00082
end empty cache... 1.9e-05
start to device... 3e-06
end to device... 0.001179 - HERE - time is good
start outputs... 8e-06
end outputs... 0.334536
logits... 6e-06
start detach... 1.7e-05
end detach... 0.004036

start empty cache... 0.335932
end empty cache... 4e-06
start to device... 3e-06
end to device... 16.553849 - HERE - time is ridiculously high - it's 16 seconds to move tensor to GPU
start outputs... 2.3e-05
end outputs... 0.020878
logits... 7e-06
start detach... 1.4e-05
end detach... 0.00036

start empty cache... 0.00082
end empty cache... 6e-06
start to device... 4e-06
end to device... 17.385204 - HERE - time is ridiculously high
start outputs... 2.9e-05
end outputs... 0.021351
logits... 4e-06
start detach... 1.3e-05
end detach... 1.1e-05

...

I’ve tried also using the remarked line
# torch.cuda.empty_cache() - it’s the same - then after the first iteration, when the GPU is full - it takes the same time (16 seconds on a p100 and over 30 seconds on K80 GPU) to dump/recycle the GPU memory, held by ~4.5K title embeddings tensors
in the code snippet above, the torch.cuda.empty_cache() is called internally, as part of the tensor.to(device) - set tensor on GPU. Since the GPU is full after the first iteration, pytorch internally calls torch.cuda.empty_cache() to free it, and then do the .to(GPU) - AKA move the next tensor batch to GPU
I’ve confirmed the behavior is ‘expected’ in Nvidia specs and other forums. It looks like they are ‘caching’ stuff in order to optimize training. And dumping/freeing the GPU memory is a serious, time consuming task
Before I engage in writing complex inference code to juggle between 2 GPUs and CPU - I am kindly asking for your opinion - Am I missing something obvious here?
Below is an email, answering a non-tech guy from my office - I hope I haven’t deceived him a lot

thank in advance,
Albert

hi, it’s not a system/driver or so issue,

it’s just the way GPUs behave (turns out strangely so). Our server guy has also made some investigation

I’ve confirmed that by running the same test code I ran on the server on a google colab cloud - same behavior - it takes a long, long time for the GPU to dump memory

this specific domain is called: GPU inference
I have 2 theories

a) it’s still a relatively new thing, and some aspects are yet under-developed - companies (Nvidia) mostly invest in higher calculation speed, mainly for training, not for inference

b) there are not many people actually using inference in real time, the way we want to do it. They usually use inference on small quantities in real time - say face recognition on snapchat or tiktalk or instagram - it’s just one or a couple of faces - it’s not tens of thousands of faces, that should be recognized in real time (as is our feed scenario - guess categories of say 20K titles in real time - do it in several seconds)

so I guess, in companies of the range of Facebook and Google, with unlimited GPU power (Google’s usage of electricity is about the power usage of a medium-sized city) - they just add more GPUs - like for the categories project and an expected average feed of 10K unique titles - they’d use 10-16 GPUs.

We have currently 2 K80 GPUs

So I should write code to do juggling between the 2 GPUs and the CPU. And if the project is successful, we can invest in more, modern GPUs

abentov · February 24, 2021, 9:42am

just wanted to let readers know - I’ve made progress with this issue on this stackoverflow thread: