Need troubleshooting tips for degraded validation speed

jp_beaudry · July 10, 2017, 4:48am

I humbly come to you looking for troubleshooting advice on how to correct degraded performance on my system, undoubtedly due to uncontrolled changes in my environment.

In short
Training (aka batch processing) is still as fast as I experienced with my early goings and on par with other posts from lesson 1 and 2. But validation now takes forever (about 2s per image). So even on sample size scale it’s unworkable.

Question: what happens during validation that I might have sabotaged?

In long
My setup: AWS P2, fastai AMI, Python 2.7, Keras 2

In my Jupyter notebook, the batches get processed very fast until the ultimate one where it hangs for a long time. For example:

I’m confident the GPU is being used for both training and validation. Below is a screenshot with nvidia-smi and htop output. I suspect the GPU is asked to do something inefficient/dumb, but I don’t know what it is.

Here is a notebook screenshot hinting that cudnn and cnmem are enabled:

I did some basic Python profiling, using lprun, to understand where the time is spent. I think it has confirm what is plain to see in Jupyter. That is, most of the time is spent in the test_on_batch() (keras/engine/training.py) and not very much in train_on_batch() (same source file)

Next steps

I understand I have the nuclear option: wipe out my EC2 instance and restart. I may still do that.

But this feels debuggable. I just don’t know where to go next. Thanks in advance for your advice.

jp_beaudry · July 13, 2017, 2:27am

I have finally found the issue with my setup. Despite the shame and hurt ego, I’m posting my solution here in case it helps someone else.

The issue was indeed with uncontrolled, thoughtless changes to the call to self.model.fit_generator() in vgg16.py

At one point, I got annoyed with all the Keras warnings about not abidding by the v2 API, so I started making changes. When I realized I had severely negatively impacted my training/testing performance, I rolled back my changes, but clumsily and incompletely.

At the root of my problem is the change not only in parameter name between Keras 1 and 2, but a change in the value they expect. Specifically, Keras 1 has sample_per_epoch and nb_val_samples that accept the number of training and validation samples one has. But in Keras 2, the parameters steps_per_epoch and validation_steps want to know in how many steps to go through the entirety of the batches and validation_data, respectively.

See the different signature with my parameters below.

#Class default
self.model.fit_generator(batches, samples_per_epoch=batches.nb_sample, nb_epoch=nb_epoch, validation_data=val_batches, nb_val_samples=val_batches.nb_sample)

#Compliant with Keras 2 API 
self.model.fit_generator(batches, steps_per_epoch=((batches.samples/batches.batch_size)+1), epochs=nb_epoch, validation_data=val_batches, validation_steps=((val_batches.samples/val_batches.batch_size)+1))

My error was that I was passing the number of validation samples to the validation_step parameter. The net effect was to test with an effective batch size of 1.

What makes this more difficult to troubleshoot is that Keras has no progress indicator or count for the validation part of fit_generator(). In the training portion, you at least see how many steps the routine will take. Not so for validation.

I probably would have caught my issue sooner if I had more carefully read this awesome post by kzuiderveld.

Not only does it show the right parameters to call fit_generator() with for Keras 2. But it shows a visual troubleshooting procedure.