In short my issue is: super slow performance with NVIDIA, CUDA freeing GPU memory
I’ve trained a transformer NLP classifier, which I have to use for inference. I have the following project requirements, leading to the performance issue:
I have an average of about 20K unique titles (e.g. Nike running shoes for women), on which I should run inference (classify the title into one out of ~5k categories) - in real time, using the said trained model.
I have a server with K80 pair of GPUs. Each K80 GPU core has 12 GB memory capacity
after trial-error tuning, I’ve found out that I can use a max of about 4.5K titles (embedding) at a time/batch/iteration on one of the GPUs. Anything over it exceeds the single GPU core memory
below is a sample demo code snippet, further explaining the issue (the test below was made on a p100 GPU on google colab - at least twice faster than the K80. I had to be sure, the issue is not related to cuda driver installation on my server):
for i, batch in enumerate(self.test_dataloader): self.dump('start empty cache...', i, 1) # torch.cuda.empty_cache() self.dump('end empty cache...', i, 1) self.dump('start to device...', i, 1) batch = tuple(t.to(device) for t in batch) # to GPU (or CPU) when gpu self.dump('end to device...', i, 1) ... do stuff further in code ...
start empty cache... 0.00082 end empty cache... 1.9e-05 start to device... 3e-06 end to device... 0.001179 - HERE - time is good start outputs... 8e-06 end outputs... 0.334536 logits... 6e-06 start detach... 1.7e-05 end detach... 0.004036 start empty cache... 0.335932 end empty cache... 4e-06 start to device... 3e-06 end to device... 16.553849 - HERE - time is ridiculously high - it's 16 seconds to move tensor to GPU start outputs... 2.3e-05 end outputs... 0.020878 logits... 7e-06 start detach... 1.4e-05 end detach... 0.00036 start empty cache... 0.00082 end empty cache... 6e-06 start to device... 4e-06 end to device... 17.385204 - HERE - time is ridiculously high start outputs... 2.9e-05 end outputs... 0.021351 logits... 4e-06 start detach... 1.3e-05 end detach... 1.1e-05 ...
I’ve tried also using the remarked line
# torch.cuda.empty_cache()- it’s the same - then after the first iteration, when the GPU is full - it takes the same time (16 seconds on a p100 and over 30 seconds on K80 GPU) to dump/recycle the GPU memory, held by ~4.5K title embeddings tensors
in the code snippet above, the
torch.cuda.empty_cache()is called internally, as part of the
tensor.to(device)- set tensor on GPU. Since the GPU is full after the first iteration, pytorch internally calls
torch.cuda.empty_cache()to free it, and then do the
.to(GPU)- AKA move the next tensor batch to GPU
I’ve confirmed the behavior is ‘expected’ in Nvidia specs and other forums. It looks like they are ‘caching’ stuff in order to optimize training. And dumping/freeing the GPU memory is a serious, time consuming task
Before I engage in writing complex inference code to juggle between 2 GPUs and CPU - I am kindly asking for your opinion - Am I missing something obvious here?
Below is an email, answering a non-tech guy from my office - I hope I haven’t deceived him a lot
thank in advance,
hi, it’s not a system/driver or so issue,
it’s just the way GPUs behave (turns out strangely so). Our server guy has also made some investigation
I’ve confirmed that by running the same test code I ran on the server on a google colab cloud - same behavior - it takes a long, long time for the GPU to dump memory
this specific domain is called: GPU inference
I have 2 theories
a) it’s still a relatively new thing, and some aspects are yet under-developed - companies (Nvidia) mostly invest in higher calculation speed, mainly for training, not for inference
b) there are not many people actually using inference in real time, the way we want to do it. They usually use inference on small quantities in real time - say face recognition on snapchat or tiktalk or instagram - it’s just one or a couple of faces - it’s not tens of thousands of faces, that should be recognized in real time (as is our feed scenario - guess categories of say 20K titles in real time - do it in several seconds)
so I guess, in companies of the range of Facebook and Google, with unlimited GPU power (Google’s usage of electricity is about the power usage of a medium-sized city) - they just add more GPUs - like for the categories project and an expected average feed of 10K unique titles - they’d use 10-16 GPUs.
We have currently 2 K80 GPUs
So I should write code to do juggling between the 2 GPUs and the CPU. And if the project is successful, we can invest in more, modern GPUs