Inverting the softmax without knowing input activations

rkingery · April 1, 2019, 8:57pm

So I got kind of a hack idea last week to use a pretrained model outputting softmax probabilities to instead output element-wise sigmoid probabilities (like would be done in multiple label classification). The problem is that the models are often sklearn models that don’t make it easy to get the input activations that go into the softmax. I’d like to be able to output the sigmoid probabilities for this model even though all I usually have are the softmax outputs. Reasons for this include being able to define thresholds that are independent of the number of classes present, and being able to easily repurpose a model trained on single label classification for multiple label classification, particularly in “gray area” situations where one might want to present the end user with multiple options.

If the softmax were fully invertible this wouldn’t be a problem, but it turns out that the softmax is only invertible up to a constant. Assuming the i^{th} component of the softmax output y is given by

y_i = \frac{1}{Z} e^{x_i},

where Z is the normalization constant, its inverse is given by

x_i = \log(y_i) + \log(Z).

The problem with this is that given only y it’s seemingly impossible to reverse engineer and figure out what Z is. Given y, basically all you know is that

\sum_i y_i = \frac{1}{Z} \sum_i e^{x_i} = 1.

You could try adopting conventions like setting Z=1, but if the input activations aren’t roughly already normalized, this would skew the sigmoid probabilities either unreasonably high or unreasonably low, which would defeat the whole point of using this approach to get class-size invariant thresholds.

Does anyone know of a reasonable, model-agnostic way to estimate Z given only a pretrained model (often not a neural net), raw input data, and softmax outputs?

Note: To be clear, I know that this isn’t the ideal way to do multiple label classification for several reasons. At least in the near future, production constraints force the situation of not being able to create training data with multiple labels, and not being able to significantly overhaul the current models in use.

cqfd · April 1, 2019, 9:23pm

Hmm, interesting question. My first guess is that, yeah, for the reasons you mention, this isn’t going to work… but it is an interesting question

I see you mention that production constraints won’t give you multiple-label training data, but now I’m curious how well a model could do that just takes those softmax outputs as its inputs, and tries to learn the sigmoid activations you’re looking for.

I’m not super familar with scikit-learn—could you just dig into/monkey-patch/hack their models and take the raw activations by force?

Pomo · April 2, 2019, 1:11am

Speaking mathematically, softmax is not invertible, so what you want to do is impossible.

I agree with Alan that you would have to extract the activations from inside the model.

The question whether multi-class activations would be equivalent to multi-label trained on one class at a time interests me too. (If that’s in fact your idea.) I even posted here about it once, but no responses.

Please let us know what you figure out.

rkingery · April 2, 2019, 9:12pm

Yeah the problem with trying to extract the input activations by force is that the approach needs to be model agnostic. Not all models in sklearn are neural nets. I’d like to be able to do this, for example, with random forest models too that output prediction probabilities that “look like softmax outputs” but aren’t (they’re just averaged from the tree predictions more or less). The suggested approach might work with softmax / logistic regression or MLPs, but that’s about it I think.

What I was hoping for was not to find Z exactly (which is impossible), but to use stochastic methods or something to estimate Z in a way that would put the element-wise sigmoid outputs at a reasonable scale. I’d have to think a good bit about how to do this though, which sort of violates the “quick hack” necessity.

At this point I think it’s probably better just to get class-size invariant “probabilities” by re-normalizing the softmax outputs using min-max scaling or something similar. As much as I’d love to get the multilabel hack to work, the normalization constant seems like too much of a pain to estimate.

derdogmus · March 4, 2020, 7:29pm

The function is not invertible because if vector x produces vector y, then x+c1 (where c is a scalar, 1 is a vector of 1s) also generates the same y. We need to eliminate this one degree of freedom of the input for a given y.

I arbitrarily suggest that we impose x^T 1 = 0 (i.e. require that the sum of entries of x will be 0. In this case, it is easy to derive the following solution for x under this restriciton.

x_i = ln y_i - (ln y_1 + … + ln y_n) / n
where n is the dimensionality of vectors x and y.

Leon_Blum · October 18, 2021, 2:26pm

I wrote this small function to invert softmax probabilities.
If you leave out the last row you will get the activations back.

def invert_softmax_prob(probs):
    probs = probs.log()
    probs = probs - probs.min() + 1
    return probs.div(probs.sum())

Here is a small example:

from fastbook import tensor
raw_activations = tensor([1.,2.,3.,6.]) # create some random activations
activations = raw_activations.exp() # softmax first uses exp()
softmax_probs = activations.div(activations.sum())  # then divides by the sum
print(softmax_probs)

# now we have generated artificial softmax probabilities 
# next let's revert them 

probs = softmax_probs .log()
probs = probs - probs.min() + 1
print(probs) # as you can see you will get back the same values as put in

If I’m missing something please correct me!

All the best, Leon

LayneSadler · December 24, 2021, 3:51pm

Perhaps the math is different in numpy, but I am not inverting it so easily:

probs = np.array([1.,2.,3.,6.])
probs = np.log(probs)
probs = probs - np.min(probs) + 1
probs/np.sum(probs)

array([0.1318649 , 0.22326669, 0.27673331, 0.3681351 ])

Pomo · December 24, 2021, 9:10pm

SoftMax is not invertible. Given a set of probabilities, you can always find many activations that map to them, but not a single particular set of activations. This is the difference between an invertible and an injective function.

Therefore, it’s impossible to recover a model’s activations from its output probabilities, without some additional constraint on the activations.

muellerzr · December 25, 2021, 1:59am

Actually it seems it should be if you know the constant

Have you tried Leon’s solution above?

Pomo · December 25, 2021, 4:58pm

Hi Zach. If you know the constant then it is invertible. That would serve as a constraint on the activations.

I did not try Leon’s solution, imagining that if you applied the right constant to his result you would get back the original activations.

Leon_Blum · December 26, 2021, 2:22pm

Maybe this is a bit more clear. But I am not sure why the second approach works.

from fastbook import tensor
raw_activations = tensor([1.,2.,5,.05]) # create some random activations
activations = raw_activations.exp() # softmax first uses exp()
softmax_probs = activations.div(activations.sum())  # then divides by the sum
print(softmax_probs)

# now we have generated artificial softmax probabilities 
# next let's revert them by simply reverting the math. This will always work for postiv numbers.
print(softmax_probs.mul(activations.sum()).log()) # tensor([1.,2.,5,.05])

# the following approach seems to work aswell. But I am not sure why.
probs = softmax_probs.log()
probs = probs - probs.min() + raw_activations.min()
print(probs) # as you can see you will get back the same values as put in # tensor([1.,2.,5,.05])

Pomo · December 26, 2021, 4:43pm

Hola Leon. You are using information from the original activations to invert the probabilities. That’s cheating!

Softmax is a many-to-one function, meaning that many inputs map to the same output. You might design an inverse which returns the correct input for a specific case. But if you try an equivalent input (with no added information), it will fail.

It is not a matter of diddling with the algorithm. We programmers tend to think that is always the answer. It is rather a law of the mathematical universe!

HTH to clarify,