# ?: Cross entropy error: -log(1-1): INF

Edit:
Normalizing the input solved my problem.

Old post:

Situation:
My network is outputting a very confident guess of 1.0 when the target is 0. This leads to a numerical problem. This happens 40% of the time.

Question:
Is there a better way to avoid this than clipping some mass off of any guess that’s too confident?

Details:
@“Numerical problem”:
When 0 is the target, my loss function is `-log(1-guess)`. And since in this case my guess is 1, I get an infinity:
`-log(1-1) == -log(0) == -(-INF) == INF`

@“Clipping”:
e.g. I could convert 1.0 into 0.99999 and 0.0 into 0.00001 to avoid the infinities.

Code:
Below is a network with no hidden layers and no backprop. It forward props one example and evaluates the loss.

``````import numpy as np
from math import sqrt

def softmax(before_activation):
denom = sum([np.exp(val) for val in before_activation])
return [np.exp(val) / denom for val in before_activation]

def average_cross_entropy(outputs, targets):
cross_entropies = [] # one for each class
for idx in range(len(targets)):
output = outputs[idx]
target = targets[idx]
if target == 1:
cross_entropy = -np.log(output)
elif target == 0:
cross_entropy = -np.log(1-output)
cross_entropies.append(cross_entropy)
return np.mean(cross_entropies)

NB_CLASSES = 10
DIM = 28

nb_inputs = DIM**2
weight_vectors = []
for idx in range(NB_CLASSES):
weight_vector = np.random.randn(nb_inputs) / sqrt(nb_inputs)
weight_vectors.append(weight_vector)
W_xy = np.stack(weight_vectors)
b_xy = np.zeros(NB_CLASSES)

inp = X[0] # flattened
before_activation = np.dot(W_xy, inp) + b_xy
outputs = softmax(before_activation)
targets = y[0] # one-hot encoded
loss = average_cross_entropy(outputs, targets)
print(loss)
print(before_activation)
print(outputs)
``````

Output:

``````inf

[-43.0171411228,
-32.4964130702,
-69.6699137943,
61.9357325336,
-16.1263929412,
83.2185173319,
-65.9866026591,
69.332167027,
86.1949970892,
142.178409547]

[3.720439707675717e-81,
1.379394245020582e-76,
9.8956009181180396e-93,
1.4159507887560133e-35,
1.7745891439322245e-69,
2.4776739641183605e-26,
3.9362608740754598e-91,
2.3082287497671358e-32,
4.860857519397263e-25,
1.0]
``````

My plan:

• Add a hidden layer with ReLUs, and see if it helps
• Maybe the ReLUs and a second weight matrix will stop the extreme outputs
• If it doesn’t, add batchnorm (with and without the hidden layer)
• I think this will work, since dividing by the standard deviation seems to pull everything together aggressively enough to make 0.0 and 1.0 very unlikely.
• However, I haven’t seen anyone use batchnorm on the output layer, and someone on reddit said it was a bad idea; and so maybe I should add a hidden layer and use batchnorm on it.
• Normalize the input?
• When Jeremy built his no-hidden-layer model in the statefarm-sample notebook, he used batchnorm on the input.