So I got kind of a hack idea last week to use a pretrained model outputting softmax probabilities to instead output element-wise sigmoid probabilities (like would be done in multiple label classification). The problem is that the models are often sklearn models that don’t make it easy to get the input activations that go into the softmax. I’d like to be able to output the sigmoid probabilities for this model even though all I usually have are the softmax outputs. Reasons for this include being able to define thresholds that are independent of the number of classes present, and being able to easily repurpose a model trained on single label classification for multiple label classification, particularly in “gray area” situations where one might want to present the end user with multiple options.

If the softmax were fully invertible this wouldn’t be a problem, but it turns out that the softmax is only invertible **up to a constant**. Assuming the i^{th} component of the softmax output y is given by

y_i = \frac{1}{Z} e^{x_i},

where Z is the normalization constant, its inverse is given by

x_i = \log(y_i) + \log(Z).

The problem with this is that given only y it’s seemingly impossible to reverse engineer and figure out what Z is. Given y, basically all you know is that

\sum_i y_i = \frac{1}{Z} \sum_i e^{x_i} = 1.

You could try adopting conventions like setting Z=1, but if the input activations aren’t roughly already normalized, this would skew the sigmoid probabilities either unreasonably high or unreasonably low, which would defeat the whole point of using this approach to get class-size invariant thresholds.

Does anyone know of a reasonable, model-agnostic way to estimate Z given only a pretrained model (often not a neural net), raw input data, and softmax outputs?

Note: To be clear, I know that this isn’t the ideal way to do multiple label classification for several reasons. At least in the near future, production constraints force the situation of not being able to create training data with multiple labels, and not being able to significantly overhaul the current models in use.