Inverting the softmax without knowing input activations

So I got kind of a hack idea last week to use a pretrained model outputting softmax probabilities to instead output element-wise sigmoid probabilities (like would be done in multiple label classification). The problem is that the models are often sklearn models that don’t make it easy to get the input activations that go into the softmax. I’d like to be able to output the sigmoid probabilities for this model even though all I usually have are the softmax outputs. Reasons for this include being able to define thresholds that are independent of the number of classes present, and being able to easily repurpose a model trained on single label classification for multiple label classification, particularly in “gray area” situations where one might want to present the end user with multiple options.

If the softmax were fully invertible this wouldn’t be a problem, but it turns out that the softmax is only invertible up to a constant. Assuming the i^{th} component of the softmax output y is given by

y_i = \frac{1}{Z} e^{x_i},

where Z is the normalization constant, its inverse is given by

x_i = \log(y_i) + \log(Z).

The problem with this is that given only y it’s seemingly impossible to reverse engineer and figure out what Z is. Given y, basically all you know is that

\sum_i y_i = \frac{1}{Z} \sum_i e^{x_i} = 1.

You could try adopting conventions like setting Z=1, but if the input activations aren’t roughly already normalized, this would skew the sigmoid probabilities either unreasonably high or unreasonably low, which would defeat the whole point of using this approach to get class-size invariant thresholds.

Does anyone know of a reasonable, model-agnostic way to estimate Z given only a pretrained model (often not a neural net), raw input data, and softmax outputs?

Note: To be clear, I know that this isn’t the ideal way to do multiple label classification for several reasons. At least in the near future, production constraints force the situation of not being able to create training data with multiple labels, and not being able to significantly overhaul the current models in use.

3 Likes

Hmm, interesting question. My first guess is that, yeah, for the reasons you mention, this isn’t going to work… but it is an interesting question :slight_smile:

I see you mention that production constraints won’t give you multiple-label training data, but now I’m curious how well a model could do that just takes those softmax outputs as its inputs, and tries to learn the sigmoid activations you’re looking for.

I’m not super familar with scikit-learn—could you just dig into/monkey-patch/hack their models and take the raw activations by force?

Speaking mathematically, softmax is not invertible, so what you want to do is impossible.

I agree with Alan that you would have to extract the activations from inside the model.

The question whether multi-class activations would be equivalent to multi-label trained on one class at a time interests me too. (If that’s in fact your idea.) I even posted here about it once, but no responses.

Please let us know what you figure out.

Yeah the problem with trying to extract the input activations by force is that the approach needs to be model agnostic. Not all models in sklearn are neural nets. I’d like to be able to do this, for example, with random forest models too that output prediction probabilities that “look like softmax outputs” but aren’t (they’re just averaged from the tree predictions more or less). The suggested approach might work with softmax / logistic regression or MLPs, but that’s about it I think.

What I was hoping for was not to find Z exactly (which is impossible), but to use stochastic methods or something to estimate Z in a way that would put the element-wise sigmoid outputs at a reasonable scale. I’d have to think a good bit about how to do this though, which sort of violates the “quick hack” necessity.

At this point I think it’s probably better just to get class-size invariant “probabilities” by re-normalizing the softmax outputs using min-max scaling or something similar. As much as I’d love to get the multilabel hack to work, the normalization constant seems like too much of a pain to estimate.

The function is not invertible because if vector x produces vector y, then x+c1 (where c is a scalar, 1 is a vector of 1s) also generates the same y. We need to eliminate this one degree of freedom of the input for a given y.

I arbitrarily suggest that we impose x^T 1 = 0 (i.e. require that the sum of entries of x will be 0. In this case, it is easy to derive the following solution for x under this restriciton.

x_i = ln y_i - (ln y_1 + … + ln y_n) / n
where n is the dimensionality of vectors x and y.

I wrote this small function to invert softmax probabilities.
If you leave out the last row you will get the activations back.

def invert_softmax_prob(probs):
    probs = probs.log()
    probs = probs - probs.min() + 1
    return probs.div(probs.sum()) 

Here is a small example:

from fastbook import tensor
raw_activations = tensor([1.,2.,3.,6.]) # create some random activations
activations = raw_activations.exp() # softmax first uses exp()
softmax_probs = activations.div(activations.sum())  # then divides by the sum
print(softmax_probs)

# now we have generated artificial softmax probabilities 
# next let's revert them 

probs = softmax_probs .log()
probs = probs - probs.min() + 1
print(probs) # as you can see you will get back the same values as put in

If I’m missing something please correct me!

All the best, Leon