Out of Memory Even When Attempting To Train on Small Batch of Large Dataset


#1

I’m attempting to use Lesson 4 principles on an old (2012) Kaggle competition. Specifically, I’m training a model to determine whether or not a user would have clicked on a given ad. The training file has over 140million rows, and I’m able to load all of them into a data-frame. However, I run out of memory when attempting to run learning rate finder (i.e., data loads fine in regular memory, but too much for GPU)

So I loaded only 10 million rows at once and was able to go through training end-end. Thinking batches of 10m rows at a time would be a safe bet for training my model, I loaded 20 million rows at once and attempted to train the model on the first 10m rows. However, I get an out of memory error (I also tried adjusting the batch down to only 1million rows, but still get an out of memory error). It’s as if the model attempts to load all 20m rows into the GPU’s memory anyway, instead of only the selected rows that I pass into it.

Would greatly appreciate any help (notebook attached).

Thank you

kddCup2012.pdf (56.9 KB)


#2

Couple of things - first, the dataload with all 350 embeddings is going to be much wider than the original dataframe. Second, you likely have more system memory than gpu memory.

I’ve built a structured model about twice as wide as yours, but I have been running batch size of 2048. I think 10 million rows x 350 columns is expecting a lot from your gpu.


#3

That’s not the issue, I’m able to run the model on 10million rows with 354 embeddings (see my 2nd attached notebook in this comment where I successfully load and train 10million rows).

Batch size isn’t the issue here either (4096 consumes ~7gb of GPU RAM)

The issue is when I load 20million rows but attempt to train only on the first 10million. Logically the GPU should have the same workload as if I loaded 10million rows and trained on all 10million, no? And if that’s the case, why am I getting an out of memory error?

kddCup2012_Load10millionRows.pdf (111.8 KB)


#4

Oh, OK. I read ‘adjusting the batch down to only 1million rows’ and jumped to the wrong interpretation of batch.