Lesson 10 and 11: Naive Bayes transformed into Logistic Regression

I read the paper and this lesson notes but something was missing. I hope this is right and if so it can help others to fill the gaps I had myself regarding this lesson.

With Naive Bayes we obtain probabilities [0…1]

Prob(class=1|document) = P(1|d)
or
Prob(class=0|document) = P(0|d)

If we divide them, we have

y = P(1|d) / P(0|d) with values [0…inf] * being
y>1 -> class=1
y<1 -> class=0

  • It will never be inf because we will add something later
    that will make P(0|d) > 0

That’s all for operations on the left side of the equation.

On the right side we have

P(1|d) = P(d|1)*P(c=1) / P(d)
and
P(0|d) = P(d|0)*P(c=0) / P(d)

Dividing them

y = P(1|d) / P(0|d)

y = P(d|1)*P(c=1) / P(d|0)*P(c=0)

being

P(d|1)= product(fi * P(fi|1)) = ∏ (fi * P(fi|1))
P(d|0)= product(fi * P(fi|0)) = ∏ (fi * P(fi|0))

The ∏ is across all the fi contained in d (doc=d)
However the P(fi|1) and P(fi|0) are across all documents (D) (vertically)

and

P(c=1)= sum(cases with c=1) / N_cases_c=1
P(c=0)= sum(cases with c=0) / N_cases_c=0 = 1-P(c=1)

The trick mentioned before is that we will introduce 2 rows in our data matrix D

c=0  f=[1,1,1,......]
c=1  f=[1,1,1,......]

Dividing them

P(1|d) / P(0|d) = ∏ ( (P(d|1)/P(d|0)) * (P(c=1) / P(c=0)) )

If we define

pi = P(fi|1) = ∑ (fi when c=1) / N_c=1 (across all documents)
qi = P(fi|0) = ∑ (fi when c=0) / N_c=0 (across all documents)

P(d|1) = ∏ fi * P(fi|1) = ∏ fi * pi
P(d|0) = ∏ fi * P(fi|0) = ∏ fi * qi

then

P(1|d) / P(0|d) = ∏ ( (pi/qi) * (P(c=1) / P(c=0)) )

Now we take logs

log(P(1|d) / P(0|d)) =
= log (∏ ( (pi/qi) * (P(c=1) / P(c=0)) ) )
= log (∏ ( (pi/qi) + log(P(c=1) / P(c=0))
= ∑ log(pi/qi) + log(P(c=1) / P(c=0))
= ∑ ri + b

for i in elements of d (for ∏ and ∑)

Which seems very much like

pre_prediction = wi @ ri + b

being:

  • wi the fi elements in d
  • ri across all documents
  • @ is matrix multiplication
  • b is an escalar

because pre_prediction is a log(P(1|d) / P(0|d))
if pre_prediction >0 -> prediction=1
otherwise -> prediction=0

So we have transformed

  • probabilities into a ratio of probabilities (and then taken the logs)
  • a product (∏) into a summation (∑) by taking the logs
  • Naive Bayes equations into something very similar to Logistic Regression

I also created a public sheet (I was missing it on the course material) in order to calculate and play with all the stuff myself.


You can copy it & play from here

Corrections and additions are welcome.


Issue:
We have
pre_prediction = wi @ ri + b
pre_prediction = log(P(1|d) / P(0|d)) = ∑ ri + b

BUT, that is true ONLY if the fi elements in
d = [ f1, f2, f3, …]
are binary, that is, either 1 or 0.

That may/would explain why using the binary form of the features matrix produces slightly better results. I’d say, it is not better, the problem is that wi @ ri + b is not fully correct and therefore produces worse results.

I hope someone could explain this difference.

1 Like

Also noticed this (“is true ONLY if the fi elements in are binary”), glad to find the confirmation from somebody else.

Unfortunately, I didn’t get the part where Jeremy combines Naive Bayes with Logistic Regression. Result is better with this approach and it is something that has to do with regularization. Can anybody explain this part in more detail, please?