?: Cross entropy error: -log(1-1): INF

Edit:
Normalizing the input solved my problem.


Old post:


Situation:
My network is outputting a very confident guess of 1.0 when the target is 0. This leads to a numerical problem. This happens 40% of the time.

Question:
Is there a better way to avoid this than clipping some mass off of any guess that’s too confident?

Details:
@“Numerical problem”:
When 0 is the target, my loss function is -log(1-guess). And since in this case my guess is 1, I get an infinity:
-log(1-1) == -log(0) == -(-INF) == INF

@“Clipping”:
e.g. I could convert 1.0 into 0.99999 and 0.0 into 0.00001 to avoid the infinities.

Code:
Below is a network with no hidden layers and no backprop. It forward props one example and evaluates the loss.

import numpy as np
from math import sqrt

def softmax(before_activation):
    denom = sum([np.exp(val) for val in before_activation])
    return [np.exp(val) / denom for val in before_activation]

def average_cross_entropy(outputs, targets):
    cross_entropies = [] # one for each class
    for idx in range(len(targets)):
        output = outputs[idx]
        target = targets[idx]
        if target == 1:
            cross_entropy = -np.log(output)
        elif target == 0:
            cross_entropy = -np.log(1-output)
        cross_entropies.append(cross_entropy)
    return np.mean(cross_entropies)

NB_CLASSES = 10
DIM = 28

nb_inputs = DIM**2
weight_vectors = []
for idx in range(NB_CLASSES):
    weight_vector = np.random.randn(nb_inputs) / sqrt(nb_inputs)
    weight_vectors.append(weight_vector)
W_xy = np.stack(weight_vectors)
b_xy = np.zeros(NB_CLASSES)

inp = X[0] # flattened
before_activation = np.dot(W_xy, inp) + b_xy
outputs = softmax(before_activation)
targets = y[0] # one-hot encoded
loss = average_cross_entropy(outputs, targets)
print(loss)
print(before_activation)
print(outputs)

Output:

inf

[-43.0171411228,
-32.4964130702,
-69.6699137943,
61.9357325336,
-16.1263929412,
83.2185173319,
-65.9866026591,
69.332167027,
86.1949970892,
142.178409547]

[3.720439707675717e-81,
 1.379394245020582e-76,
 9.8956009181180396e-93,
 1.4159507887560133e-35,
 1.7745891439322245e-69,
 2.4776739641183605e-26,
 3.9362608740754598e-91,
 2.3082287497671358e-32,
 4.860857519397263e-25,
 1.0]

My plan:

  • Add a hidden layer with ReLUs, and see if it helps
  • Maybe the ReLUs and a second weight matrix will stop the extreme outputs
  • If it doesn’t, add batchnorm (with and without the hidden layer)
  • I think this will work, since dividing by the standard deviation seems to pull everything together aggressively enough to make 0.0 and 1.0 very unlikely.
  • However, I haven’t seen anyone use batchnorm on the output layer, and someone on reddit said it was a bad idea; and so maybe I should add a hidden layer and use batchnorm on it.
  • Normalize the input?
  • When Jeremy built his no-hidden-layer model in the statefarm-sample notebook, he used batchnorm on the input.