Lesson 10 and 11: Naive Bayes transformed into Logistic Regression

(Ruben) #1

I read the paper and this lesson notes but something was missing. I hope this is right and if so it can help others to fill the gaps I had myself regarding this lesson.

With Naive Bayes we obtain probabilities [0…1]

Prob(class=1|document) = P(1|d)
or
Prob(class=0|document) = P(0|d)

If we divide them, we have

y = P(1|d) / P(0|d) with values [0…inf] * being
y>1 -> class=1
y<1 -> class=0

  • It will never be inf because we will add something later
    that will make P(0|d) > 0

That’s all for operations on the left side of the equation.

On the right side we have

P(1|d) = P(d|1)*P(c=1) / P(d)
and
P(0|d) = P(d|0)*P(c=0) / P(d)

Dividing them

y = P(1|d) / P(0|d)

y = P(d|1)*P(c=1) / P(d|0)*P(c=0)

being

P(d|1)= product(fi * P(fi|1)) = ∏ (fi * P(fi|1))
P(d|0)= product(fi * P(fi|0)) = ∏ (fi * P(fi|0))

The ∏ is across all the fi contained in d (doc=d)
However the P(fi|1) and P(fi|0) are across all documents (D) (vertically)

and

P(c=1)= sum(cases with c=1) / N_cases_c=1
P(c=0)= sum(cases with c=0) / N_cases_c=0 = 1-P(c=1)

The trick mentioned before is that we will introduce 2 rows in our data matrix D

c=0  f=[1,1,1,......]
c=1  f=[1,1,1,......]

Dividing them

P(1|d) / P(0|d) = ∏ ( (P(d|1)/P(d|0)) * (P(c=1) / P(c=0)) )

If we define

pi = P(fi|1) = ∑ (fi when c=1) / N_c=1 (across all documents)
qi = P(fi|0) = ∑ (fi when c=0) / N_c=0 (across all documents)

P(d|1) = ∏ fi * P(fi|1) = ∏ fi * pi
P(d|0) = ∏ fi * P(fi|0) = ∏ fi * qi

then

P(1|d) / P(0|d) = ∏ ( (pi/qi) * (P(c=1) / P(c=0)) )

Now we take logs

log(P(1|d) / P(0|d)) =
= log (∏ ( (pi/qi) * (P(c=1) / P(c=0)) ) )
= log (∏ ( (pi/qi) + log(P(c=1) / P(c=0))
= ∑ log(pi/qi) + log(P(c=1) / P(c=0))
= ∑ ri + b

for i in elements of d (for ∏ and ∑)

Which seems very much like

pre_prediction = wi @ ri + b

being:

  • wi the fi elements in d
  • ri across all documents
  • @ is matrix multiplication
  • b is an escalar

because pre_prediction is a log(P(1|d) / P(0|d))
if pre_prediction >0 -> prediction=1
otherwise -> prediction=0

So we have transformed

  • probabilities into a ratio of probabilities (and then taken the logs)
  • a product (∏) into a summation (∑) by taking the logs
  • Naive Bayes equations into something very similar to Logistic Regression

I also created a public sheet (I was missing it on the course material) in order to calculate and play with all the stuff myself.


You can copy it & play from here

Corrections and additions are welcome.


Issue:
We have
pre_prediction = wi @ ri + b
pre_prediction = log(P(1|d) / P(0|d)) = ∑ ri + b

BUT, that is true ONLY if the fi elements in
d = [ f1, f2, f3, …]
are binary, that is, either 1 or 0.

That may/would explain why using the binary form of the features matrix produces slightly better results. I’d say, it is not better, the problem is that wi @ ri + b is not fully correct and therefore produces worse results.

I hope someone could explain this difference.

0 Likes

Wiki / Lesson Thread: Lesson 10