I was not happy with deep learning performance, especially on an old Z800. It turns out that the CPU is not fast enough to feed the GPU; more importantly, I figured out how to achieve maximum performance. With data augmentation, an epoch now only requires 70 seconds on my system, a huge improvement.
Caveat: I am doing part 1 using a Python 3 environment and Keras 2, which has a slightly different API. This approach might work with Keras 1 (it’s really, really simple: just add workers=N to fit_generator), but I didn’t test it (don’t know whether Keras 1 provides threadsafe iterators).
My notebook contains more information on this approach: