** Edit**:

Normalizing the input solved my problem.

Old post:

**Situation:**

My network is outputting a very confident guess of 1.0 when the target is 0. This leads to a numerical problem. This happens 40% of the time.

**Question:**

Is there a better way to avoid this than clipping some mass off of any guess that’s too confident?

**Details:**

@“Numerical problem”:

When 0 is the target, my loss function is `-log(1-guess)`

. And since in this case my guess is 1, I get an infinity:

`-log(1-1) == -log(0) == -(-INF) == INF`

@“Clipping”:

e.g. I could convert 1.0 into 0.99999 and 0.0 into 0.00001 to avoid the infinities.

**Code:**

Below is a network with no hidden layers and no backprop. It forward props one example and evaluates the loss.

```
import numpy as np
from math import sqrt
def softmax(before_activation):
denom = sum([np.exp(val) for val in before_activation])
return [np.exp(val) / denom for val in before_activation]
def average_cross_entropy(outputs, targets):
cross_entropies = [] # one for each class
for idx in range(len(targets)):
output = outputs[idx]
target = targets[idx]
if target == 1:
cross_entropy = -np.log(output)
elif target == 0:
cross_entropy = -np.log(1-output)
cross_entropies.append(cross_entropy)
return np.mean(cross_entropies)
NB_CLASSES = 10
DIM = 28
nb_inputs = DIM**2
weight_vectors = []
for idx in range(NB_CLASSES):
weight_vector = np.random.randn(nb_inputs) / sqrt(nb_inputs)
weight_vectors.append(weight_vector)
W_xy = np.stack(weight_vectors)
b_xy = np.zeros(NB_CLASSES)
inp = X[0] # flattened
before_activation = np.dot(W_xy, inp) + b_xy
outputs = softmax(before_activation)
targets = y[0] # one-hot encoded
loss = average_cross_entropy(outputs, targets)
print(loss)
print(before_activation)
print(outputs)
```

Output:

```
inf
[-43.0171411228,
-32.4964130702,
-69.6699137943,
61.9357325336,
-16.1263929412,
83.2185173319,
-65.9866026591,
69.332167027,
86.1949970892,
142.178409547]
[3.720439707675717e-81,
1.379394245020582e-76,
9.8956009181180396e-93,
1.4159507887560133e-35,
1.7745891439322245e-69,
2.4776739641183605e-26,
3.9362608740754598e-91,
2.3082287497671358e-32,
4.860857519397263e-25,
1.0]
```

**My plan:**

- Add a hidden layer with ReLUs, and see if it helps
- Maybe the ReLUs and a second weight matrix will stop the extreme outputs
- If it doesn’t, add batchnorm (with and without the hidden layer)
- I think this will work, since dividing by the standard deviation seems to pull everything together aggressively enough to make 0.0 and 1.0 very unlikely.
- However, I haven’t seen anyone use batchnorm on the output layer, and someone on reddit said it was a bad idea; and so maybe I should add a hidden layer and use batchnorm on it.
- Normalize the input?
- When Jeremy built his no-hidden-layer model in the statefarm-sample notebook, he used batchnorm on the input.