Proper setup for running inference for a single example in a multi-threaded context

Hi,
I’m trying to setup a service which will run inference on single items(batches of size one), but in parallel. I am facing issues in achieving the parallel spread.
I’ve created a simple gRPC server to serve model predictions. The server is called by another service.
I’ve loaded the model via load_learner and I’m calling the model directly.

learner = load_learner(...)
model = learner.model
....
def predict(inp):
    input_tensor = ...
    res = model(input_tensor)

I’m trying to benchmark the inference service via Apache Jmeter. As I increase the number of concurrent requests, the response time grows linearly.
1 user -> 6ms
2 users -> 9 ms
3 users -> 13 ms
4 users -> 18 ms
5 users -> 24 ms
10 users > 53 ms
20 users -> 105 ms

I was expecting the performance to be constant at least up to the number of CPU cores(8). Is there any configuration I could tweak to achieve parallelism?

3 Likes