Strange behavior of batch size and batchnorm

hey team,

im working on a medical segmentation and the results look rather promising with dice scores of +96%. while tweaking the architecture i noticed 2 strange behaviors. reducing the batch size from 32 to 16 plummets the score to 92,5%. i havent seen something like this before and dont quite understand why it does this. any ideas? i have to turn back the batch size because of my ram limitation when adding batchnorm.

the second strange thing is that batchnorm does improve things a little bit, but it also starts to overfit in a much heavier way from the start.

No BN:
Epoch 1/100
1916/1916 [==============================] - 45s - loss: -0.6758 - val_loss: -0.7485
Epoch 2/100
1916/1916 [==============================] - 43s - loss: -0.8538 - val_loss: -0.7376
Epoch 3/100
1916/1916 [==============================] - 43s - loss: -0.8777 - val_loss: -0.8587

BN:
Epoch 1/100
1916/1916 [==============================] - 69s - loss: -0.8081 - val_loss: -0.5754
Epoch 2/100
1916/1916 [==============================] - 67s - loss: -0.9111 - val_loss: -0.3100
Epoch 3/100
1916/1916 [==============================] - 67s - loss: -0.9315 - val_loss: -0.6321

So yes, with BN it trains faster, but it also overfits magnitudes higher. in the end BN gives better results, but the intial epochs make we wonder if something else is going on.

I’m using the latest version of Keras on TF backend and the architecture is a variant of u-net. oh and the dice loss approaches -1 as the best possible score.