Performance Issue during training [Keras + TF backend + GTX 1080 Ti]

I am trying to solve a performance problem I am facing when using the MNIST script (notebook mnist.ipynb).
The first “fit_generator” takes about 50 seconds on my computer compared to the 5s in the notebook. As this was surprising, I investigated more and found some benchmark (Keras Backend Performance Benchmark) giving the same kind of results as in the notebook (4 to 5 seconds for one epoch with my type of GPU).
The computer I am using is described below (I added the version of the python libraries installed)

o OS: Windows 10 64 bits
o CPU: Intel I7-7700K
o RAM: 16GB
o GPU: Nvidia GTX 1080 Ti
o SSD 500GB
o Python 3.5.2
o Keras 2.0.8
o Tensorflow 1.1.0
o Numpy 1.13.3

I tried the training on a different computer and found the same kind of performance (similar hardware specifications).
The keras configuration file “keras.json” contains :

{
“backend”: “tensorflow”,
“image_data_format”: “channels_last”,
“epsilon”: 1e-07,
“floatx”: “float32”
}

I am wondering if there is an hardware issue that could explain the difference in performance or if there is some trouble somewhere in the python/CUDA configuration.
If you have any information to share about it, please let me know.

Best regards.

Note: I am putting below the output of a MNIST training script when run in a command-line prompt.

D:\share>python MNIST.py
Using TensorFlow backend.
N train = 60000, N test = 10000, H = 28, W = 28
(60000, 28, 28)
(10000, 28, 28)
(60000, 28, 28, 1)
(10000, 28, 28, 1)
Train on 60000 samples, validate on 10000 samples
Epoch 1/1
2017-10-04 18:13:35.779914: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn’t compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations.
2017-10-04 18:13:35.780035: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn’t compiled to use SSE2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-04 18:13:35.780308: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn’t compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-04 18:13:35.780879: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn’t compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-04 18:13:35.781247: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn’t compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-04 18:13:35.781694: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn’t compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-04 18:13:35.782015: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn’t compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-04 18:13:35.782414: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn’t compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-04 18:13:36.013740: I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:887] Found device 0 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:01:00.0
Total memory: 11.00GiB
Free memory: 9.12GiB
2017-10-04 18:13:36.013881: I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:908] DMA: 0
2017-10-04 18:13:36.014910: I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:918] 0: Y
2017-10-04 18:13:36.015295: I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0)
60000/60000 [==============================] - 59s - loss: 0.1328 - acc: 0.9603 - val_loss: 0.0388 - val_acc: 0.9876
9920/10000 [============================>.] - ETA: 0s0.0388096875181 0.9876

Hi,
I have similar DL setup ( except I have Ubuntu) and had similar performance issue when training for Cats and dogs competition. I changed the backend from Tensorflow to Theano. This gave me performance similar to that mentioned in notebook. Although, I am not sure about the warnings that you are getting.

@himanshu
Thank you for your reply.
I tested a Chainer script with same MNIST model and got the right performance so I think the hardware has no problem and based on the information you gave, there is a problem with the tensorflow library.
I will try to update to last tensorflow version and see where it is bringing me.
Thank you again.