Noob question on maximising GPU utilization

Hi everyone,

I’m trying to apply the collaborative filtering model from lesson 4 to a real-world problem with 100 million examples.

If I try to set too many factors, I get an out of memory error but if I use say 10 factors, the model goes through the fitting process but when I run nvidia-smi it seems only around 2GB out of 12GB memory is used. I’ve tried fiddling with batch-size but it’s not increasing my GPU memory utilization…

My model is taking a while to fit with so many examples - I suspect it’s because the GPU isn’t being fully utilized but I’m not sure what I’m doing wrong?

Please help :slight_smile:


I guess a follow-up question here is why did we choose batch_size=64 in lesson 4?

In nvidia-smi , if the process ID of your python code shows up , it means the GPU is indeed being used.
Perhaps your model is too shallow to use all the 12 GB ?
The time it takes also depends on the speed of the cores, and not just the memory.
Do you have a maxwell card, or is it a pascal ?
There have been substantial speed improvements on pascal over maxwell.

Thanks @harveyslash,

I do believe the GPU is being used since nvidia-smi does show an increase in memory usage and indicates the PID of the python process.

The problem is I believe it’s being under-used.

I’m using the AWS P2 instance. CPU usage does spike on a single CPU to 100% so that may be an issue - I’m open to trying a bigger P* instance if that may be the issue. Memory usage is around 50% but that’s mostly reading the 10GB csv.

I believe the low GPU usage may be due to the architecture of the model since it is shallow (it’s just an embedding layer - I’m trying to fit the collaborative filtering example from lesson 4 but on new data).

I’ve got a lot of data (1.3 billion examples) but I’m sampling down to just 300m (which I’d rather avoid if possible but think I’ll need a higher memory instance - P2 only has 60GB) …I’ve tried various combinations of batch-size and learning rate and it’s slow and not getting much better. My hope is that’s because I’m under-utilizing the GPU somehow and I can fiddle with the batch size or something to increase GPU utilization but I’m worried if I increase it too much then that’s bad?

Details of the model below:

n_factors = 8
reg_strength = 1e-4
batch_size= 2**18 # 262144

user_in = Input(shape=(1,), dtype='int64', name='user_in')
u = Embedding(n_users, n_factors, input_length=1, W_regularizer=l2(reg_strength))(user_in)
target_in = Input(shape=(1,), dtype='int64', name='target_in')
m = Embedding(n_targets, n_factors, input_length=1, W_regularizer=l2(reg_strength))(target_in)
x = merge([u, m], mode='dot')
x = Flatten()(x)
model = Model([user_in, target_in], x)
model.compile(Adam(0.001), loss='binary_crossentropy', metrics=['accuracy'])

and here is the shape of the inputs:

n_users, n_targets # (7161769, 4503329)
trn.shape          # (240001071, 3) 

model summary below:
Layer (type) Output Shape Param # Connected to
user_in (InputLayer) (None, 1) 0
target_in (InputLayer) (None, 1) 0
embedding_1 (Embedding) (None, 1, 8) 57294152 user_in[0][0]
embedding_2 (Embedding) (None, 1, 8) 36026632 target_in[0][0]
merge_1 (Merge) (None, 1, 1) 0 embedding_1[0][0]
flatten_1 (Flatten) (None, 1) 0 merge_1[0][0]
Total params: 93320784

Thanks! :slight_smile:

For such a shallow model, your GPU is being left idle.
Some things that I would try to make the GPU do more work :

  • Increase batch size. It seems you’ve already done this your batch size is extremely large.
  • Make the CPU do as less work as possible during time of loading. Save any augmentation related work(if you are doing any), to a bcolz array first, so that while the model is reading data, it can send it off to the GPU as fast as possible.

Other than that its quite difficult to make such a shallow model use more GPU(it simply needs data to process). The CPU/Disk will start bottlenecking.

1 Like

Ah ok, thank you makes sense.

One more quick question - is there any reason not to increase batch size as much as possible? It’s faster but what’s the downside?

after a certain size, you won’t get much speed improvements. This will be because of other bottlenecks like cpu and Disk.

Most of the time models are large enough to take as much as 11 gb while backpropagating, so , only a batch size of ,say, 3 is possible.

Again, your model is really shallow to be able to use all of that 12 gb

Cool, makes sense - thanks!

The one thing that can help your GPU utilization is asynchronous mini-batch allocation.

Basically it means that you prepare (compute) next batch and upload the data to GPU while it is processing other batch. There is speciall support for that in cuda lib. See here: (great post BTW).

However I don’t know if Keras is actually supporting this.


fit_generator() indeed does the data generator in parallel to the GPU. I am attaching the docs of the fit_generator() method:

Fits the model on data generated batch-by-batch by a Python generator.
The generator is run in parallel to the model, for efficiency.
For instance, this allows you to do real-time data augmentation
on images on CPU in parallel to training your model on GPU.


Does it also uploads new batch to GPU asynchronously? Like this:

My GPU usage fluctuates a lot during training. It rises to almost 100% and goes down to, sometimes, 0%. A cycle is usually about 2 seconds. I tried different batch sizes but did not see any notable difference. Is this normal? I notice however that when I increase my batch size a lot my CPU usage goes down a bit (from almost 100%). Any suggestions?

I would say it’s pretty normal.

In my case, when I run on aws p3, the gpu utilization averages around 95%, but when I run on my local pc w/ 8700k & titan xp , it fluctuates between 60% to 100% and averages around 80%. Any ideas why is that and how I can improve? I am already using the largest batch size that fits the gpu memory.