Problem with fit_generator in Keras 2 using TensorFlow

wgpubs · May 5, 2017, 7:05pm

Just attempting to build a simple linear classifier for the StateFarm competition against a sample dataset (training = 3000 examples, validation = 1000 examples). Using Keras 2 and TF.

The code below takes FOREVER to run (in fact, I can’t get it to complete). I’m not sure if I’m misunderstanding something and/or have set something wrong with the new API, but given that my model is a simple linear one, I would expect this to run very fast given the # of examples in my sample datasents.

batch_size = 4

lm.fit_generator(train_batches, steps_per_epoch=(train_batches.n/batch_size), epochs=2, 
                 validation_data=val_batches, validation_steps=(val_batches.n/batch_size))

Any ideas what is going on?

wgpubs · May 5, 2017, 7:40pm

Weird and counter-intuitive … but if I increase the batch_size to 64 it runs fast.

Why?

It is my understanding that a smaller batch size is computationally less intensive and so recommended when running on a machine with limited hardware (e.g., smaller GPU, less RAM). At least when using Keras 1.x and Theano, lowering the batch size was always the answer to speed up your model fitting and alleviate out of memory exceptions.

VishnuSubramanian · May 6, 2017, 6:00am

I think there could be 2 reasons why you are facing this issue.

If you are using smaller batches than the GPU is not utilising its full memory.
Another important problem could be CPU could be waiting for GPU to complete its task before processing the rest of the data.

You can try 2 different things . Increase the batch size like you did and try using this (pickle_safe=True and num_workers=4) in fit_generator method to increase your preprocessing speed.

sml0820 · May 6, 2017, 12:29pm

Think about this line. The training set stays the same size, but when you increase it to 64 you are making it run 16 times less steps.

wgpubs · May 8, 2017, 5:45pm

I understand that, but hasn’t that always been the case even with Keras 1.x? A smaller batch size necessarily requires more iterations through the training set to finish off a single epoch.

Interestingly enough, I found some posts on the web that seem to indicate the problem has something to do with Keras 2 and Jupyter notebooks; namely that the default “verbose” setting of 1 is creating some kind of race condition when there are a lot of steps.

Simply changing verboase = 2 fixes this problem for me at least.