OK I’m not sure I’m thinking that clearly ATM, but here’s what I think is going on:
- The Excel version only makes sense as a probability when used with binarized features. The probabilities are calculating the proportion of documents in a class that have the word
w
- The python version works for both binarized and raw features. It calculates the proportion of words in a class that are word
w
I wrote the Excel version without looking up any references but just using (what I thought was) common sense. The python one I wrote months ago, and think I actually worked from a reference book. As @groverpr says, the python one appears to be more in line with the generally accepted approach.
However my Excel “common sense” approach makes more intuitive sense to me. It’s interesting that @kcturgutlu shows it’s actually better. (Although when used as a feature in the logistic regression it ends up identical, FYI).
Does this analysis seem about right? Has anyone come across materials that show something more like my Excel approach?
@yinterian any thoughts?