Multiclass text classification with Naive Bayes and fastai

RamboFast · January 2, 2020, 10:31am

Hi

I am trying to follow Rachel Thomas path of sentiment classification with Naive Bayes. In the video she uses a binary dataset (pos. and neg. movie reviews). When it comes to apply Naive Bayes, this is what she does:

Defintion: log-count ratio r for each word f:

r = log (ratio of feature f in positive documents) / (ratio of feature f in negative documents)

where ratio of feature f in positive documents is the number of times a positive document has a feature divided by the number of positive documents.

p1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))
p0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))

pr1 = (p1+1) / ((y.items==positive).sum() + 1)
pr0 = (p0+1) / ((y.items==negative).sum() + 1)

r = np.log(pr1/pr0)

Problem:
My dataset is not binary! Lets assume I have 5 labels: label_1,…,label_5

How do I get the log-count ratio r for multilabel dataset?

My approach:

p4 = np.squeeze(np.asarray(x[y.items==label_5].sum(0)))
p3 = np.squeeze(np.asarray(x[y.items==label_4].sum(0)))
p2 = np.squeeze(np.asarray(x[y.items==label_3].sum(0)))
p1 = np.squeeze(np.asarray(x[y.items==label_2].sum(0)))
p0 = np.squeeze(np.asarray(x[y.items==label_1].sum(0)))

log-count-ratio:
pr1 = (p1+1) / ((y.items==label_2).sum() + 1)
pr1_not = (p1+1) / ((y.items!=label_2).sum() + 1)
r_1 = np.log(pr1/pr1_not)

log-count-ratio:
pr2 = (p2+1) / ((y.items==label_3).sum() + 1)
pr2_not = (p2+1) / ((y.items!=label_3).sum() + 1)
r_2 = np.log(pr2/pr2_not)
...

Is this correct? Does it mean I get multiple ratios?