I’m trying to setup a service which will run inference on single items(batches of size one), but in parallel. I am facing issues in achieving the parallel spread.
I’ve created a simple gRPC server to serve model predictions. The server is called by another service.
I’ve loaded the model via
load_learner and I’m calling the model directly.
learner = load_learner(...) model = learner.model .... def predict(inp): input_tensor = ... res = model(input_tensor)
I’m trying to benchmark the inference service via Apache Jmeter. As I increase the number of concurrent requests, the response time grows linearly.
1 user -> 6ms
2 users -> 9 ms
3 users -> 13 ms
4 users -> 18 ms
5 users -> 24 ms
10 users > 53 ms
20 users -> 105 ms
I was expecting the performance to be constant at least up to the number of CPU cores(8). Is there any configuration I could tweak to achieve parallelism?