Why is my loss coming down very slowly?


I am training a Siamese network using a variant of AlexNet (I added some inception like features like Depth concatenation instead of having a plain 11x11 conv layer)

Hardware: K80 GPU on Amazon AWS

Agenda is to do face identification.

Dataset: Facescrub dataset duly requested and collected from the original Authors. Using only 1/2 of the dataset due to space constraint.

Augmentation: Eye detection and Face alignment, Zoom in, Zoom out, Gaussian Noise, Rotation clock and anti-clock and random combination of these

Training always starts with 50% of the samples coming from aug set(at least one of the images of the siamese triplet was from aug set)

L2 Loss is also added (L2 Loss starts with 15% of the total loss at start of training)

My layer arch looks like this:
NOTE: (conv layer ==> relu, maxpool 3x3 and strided by 2x2 )
c1a - (3x3) conv layer on grayscale input
LRN - (Local response normalization)
c1b - (5x5) conv layer on grayscale input
LRN - (Local response normalization)
c1 - Depth concat of c1a and c1b
c2 - 5x5 conv layer on c1
c3 - 3x3 conv on c2 but no maxpool
c4 - 3x3 conv on c3 but no maxpool
c5 - 3x3 conv layer on c4
Flatten C5
Dropout at 50%
Dense Layer with Sigmoid
Drop out at 50%
Dense Layer with Sigmoid to produce Facial Embedding

My problem is that even after 16,000 batches, the loss has come down only from 3.5 to 1.3. It has been running for more than 24 hours now.

Should I continue this training or abort it?

If you look at the Gradients below, the layers C1A, C1B and the likes have the best gradients (max of all gradients in the layer as well as mean values)…
And the layers closer to the output have lesser gradients…
It sounds counter-intuitive to me.

I thought the ones closer to the output will have the best gradients and the ones further down will worsen… But what I observe is counter-intuitive. What do you guys think?

I use Adam Optimizer. The gradients below are calculated using the “computeGradients” API call.

GRADIENTLOSS at iteration 16000
GRADIENTs for VECTOR/c1aFilter:0
Mean: 0.010530296,Std: 0.08831231,Median: 0.0037496341,Max: 0.34390756,Min: -0.20923898
Mean: 0.014000396,Std: 0.1359879,Median: 0.023553304,Max: 0.36028913,Min: -0.29602805
GRADIENTs for VECTOR/c1bFilter:0
Mean: 0.003996292,Std: 0.060440883,Median: -0.0041986424,Max: 0.22272776,Min: -0.14110538
Mean: 0.017228054,Std: 0.1255607,Median: 0.0052211885,Max: 0.3601057,Min: -0.23093922
GRADIENTs for VECTOR/c1cFilter:0
Mean: -0.030846026,Std: 0.13179947,Median: -0.0036058915,Max: 0.9246183,Min: -1.4013114
Mean: -0.19204378,Std: 0.4582537,Median: -0.20062844,Max: 1.1991782,Min: -1.1817778
GRADIENTs for VECTOR/c2Filter:0
Mean: 0.00085157086,Std: 0.015450722,Median: 1.1536713e-34,Max: 0.17678176,Min: -0.107637085
Mean: -0.00999434,Std: 0.17034349,Median: 0.01284295,Max: 0.5719912,Min: -0.5927037
GRADIENTs for VECTOR/c3Filter:0
Mean: -6.849167e-05,Std: 0.004842096,Median: -1.1726105e-34,Max: 0.07113129,Min: -0.09903844
Mean: 0.00089150603,Std: 0.04053916,Median: 0.0,Max: 0.15465239,Min: -0.12784223
GRADIENTs for VECTOR/c4Filter:0
Mean: 3.3387943e-05,Std: 0.0011260371,Median: 5.408392e-35,Max: 0.008018609,Min: -0.0108721405
Mean: 2.0310885e-05,Std: 0.008101722,Median: 0.0,Max: 0.0338539,Min: -0.024426486
GRADIENTs for VECTOR/c5Filter:0
Mean: -0.00012642296,Std: 0.0016604387,Median: -1.2935101e-34,Max: 0.012831819,Min: -0.016537711
Mean: -0.0005427574,Std: 0.0049887337,Median: -0.00052600756,Max: 0.011803339,Min: -0.013275687
GRADIENTs for VECTOR/dense/kernel:0
Mean: 2.623862e-06,Std: 0.00025185716,Median: 3.61513e-35,Max: 0.0032541258,Min: -0.0039389245
GRADIENTs for VECTOR/dense/bias:0
Mean: 5.296417e-06,Std: 0.0002104719,Median: 6.043442e-06,Max: 0.0006891279,Min: -0.0009231151
GRADIENTs for VECTOR/dense2/kernel:0
Mean: -1.4274097e-05,Std: 0.0006795507,Median: -7.807914e-06,Max: 0.004824273,Min: -0.0046136547
GRADIENTs for VECTOR/dense2/bias:0
Mean: -4.609408e-05,Std: 0.00067597226,Median: -0.00010044244,Max: 0.0018383545,Min: -0.0024985338
Testing Loss = 1.3011401