I read the paper and this lesson notes but something was missing. I hope this is right and if so it can help others to fill the gaps I had myself regarding this lesson.
With Naive Bayes we obtain probabilities [0…1]
Prob(class=1|document) = P(1|d)
or
Prob(class=0|document) = P(0|d)
If we divide them, we have
y = P(1|d) / P(0|d) with values [0…inf] * being
y>1 → class=1
y<1 → class=0
- It will never be inf because we will add something later
that will make P(0|d) > 0
That’s all for operations on the left side of the equation.
On the right side we have
P(1|d) = P(d|1)*P(c=1) / P(d)
and
P(0|d) = P(d|0)*P(c=0) / P(d)
Dividing them
y = P(1|d) / P(0|d)
y = P(d|1)*P(c=1) / P(d|0)*P(c=0)
being
P(d|1)= product(fi * P(fi|1)) = ∏ (fi * P(fi|1))
P(d|0)= product(fi * P(fi|0)) = ∏ (fi * P(fi|0))
The ∏ is across all the fi contained in d (doc=d)
However the P(fi|1) and P(fi|0) are across all documents (D) (vertically)
and
P(c=1)= sum(cases with c=1) / N_cases_c=1
P(c=0)= sum(cases with c=0) / N_cases_c=0 = 1-P(c=1)
The trick mentioned before is that we will introduce 2 rows in our data matrix D
c=0 f=[1,1,1,......]
c=1 f=[1,1,1,......]
Dividing them
P(1|d) / P(0|d) = ∏ ( (P(d|1)/P(d|0)) * (P(c=1) / P(c=0)) )
If we define
pi = P(fi|1) = ∑ (fi when c=1) / N_c=1 (across all documents)
qi = P(fi|0) = ∑ (fi when c=0) / N_c=0 (across all documents)
P(d|1) = ∏ fi * P(fi|1) = ∏ fi * pi
P(d|0) = ∏ fi * P(fi|0) = ∏ fi * qi
then
P(1|d) / P(0|d) = ∏ ( (pi/qi) * (P(c=1) / P(c=0)) )
Now we take logs
log(P(1|d) / P(0|d)) =
= log (∏ ( (pi/qi) * (P(c=1) / P(c=0)) ) )
= log (∏ ( (pi/qi) + log(P(c=1) / P(c=0))
= ∑ log(pi/qi) + log(P(c=1) / P(c=0))
= ∑ ri + b
for i in elements of d (for ∏ and ∑)
Which seems very much like
pre_prediction = wi @ ri + b
being:
- wi the fi elements in d
- ri across all documents
- @ is matrix multiplication
- b is an escalar
because pre_prediction is a log(P(1|d) / P(0|d))
if pre_prediction >0 → prediction=1
otherwise → prediction=0
So we have transformed
- probabilities into a ratio of probabilities (and then taken the logs)
- a product (∏) into a summation (∑) by taking the logs
- Naive Bayes equations into something very similar to Logistic Regression
I also created a public sheet (I was missing it on the course material) in order to calculate and play with all the stuff myself.
You can copy it & play from here
Corrections and additions are welcome.
Issue:
We have
pre_prediction = wi @ ri + b
pre_prediction = log(P(1|d) / P(0|d)) = ∑ ri + b
BUT, that is true ONLY if the fi
elements in
d = [ f1, f2, f3, …]
are binary, that is, either 1 or 0.
That may/would explain why using the binary form of the features matrix produces slightly better results. I’d say, it is not better, the problem is that wi @ ri + b
is not fully correct and therefore produces worse results.
I hope someone could explain this difference.