The Naïve Bayes Classifier, introduced in Lesson 10, is perhaps the simplest machine learning classifier to build, train, and predict with. In this post we’ll see how and why it works.
Part 1 reveals that the muchcelebrated Bayes’ Rule is just a simple statement about joint and conditional probabilities. But its blandness belies astonishing power, as we’ll see in Parts 2 and 3, where we assemble the machinery of the Naïve Bayes Classifier. Part 4 is a brief discussion, and Parts 5 and 6 list the advantages and disadvantages of the Naïve Bayes Classifier. Part 7 is a summary, and part 8 lists a few references that I’ve found useful. Constructive comments, criticism, and suggestions are welcome!
1. Prelude: Bayes' Rule
Given two events A and B, the joint probability P(A,B) is the probability of A and B occurring together. It can be written in either of two ways:
The first way
\large P(A,B) = P(AB)P(B)
Here P(AB) is a conditional probability: the probability that A occurs, given that B has occurred. This says that the probability of A and B occurring together is (the probability that A occurs given that B has occurred) times (the probability that B has occurred).
The second way
Interchanging A and B in the previous equation gives an equivalent expression:
\large P(B,A) = P(BA)P(A)
Note that P(B,A) and P(A,B) are equal, since joint probability doesn’t depend on order. Equating the righthand sides of the two equations yields the famous Bayes’ Rule:
\large P(AB)P(B) = P(BA)P(A)
2. How to Build a Bayes Classifier
Suppose we have a data set composed of N data points, each of which has M features and is labeled as one of a set of K classes. It would be nice to have a classifier that can predict the class label, given the feature values, right? If you think for a moment, the conditional class probability P(CD) is just the ticket! We only need to add a decision rule to tell us how to choose the class label.
The conditional class probabilities come from an application of Bayes’ Rule:
\large P(CD) = \frac{\displaystyle P(C)P(DC)}{\displaystyle P(D)}
where:
C is a class, and D is a datapoint consisting of values for each of the M features.
The terms in the expression of Bayes’ Rule are, from left to right:
 The posterior P(CD) is the probability that the data point D is in the class C
 The prior P(C) is the probability of occurrence of the class C in the dataset, i.e. the proportion of data points that have class C
 The likelihood P(DC) is the relative probability that we would observe data point D given that the class is C
 The evidence P(D) is the probability that the data point D could occur in nature
In practice, P(D) is often hard or impossible to calculate. Not to worry! Since P(D) is independent of the class C, as far as we are concerned it’s just a proportionality constant that we don’t care about.
The only consequence of not knowing P(D) is that the posterior will be a relative probability rather than a true probability. But that’s just fine, since the relative values of the posterior suffice for class prediction.
To complete the specification of our classifier, we adopt the MAP (Maximum A Posteriori) decision rule, which assigns the label to the class with the highest posterior.
To apply the classifier, we
3. The Naïve Bayes Classifier
Adopting the “naïve” assumption that feature values are independent for members of a given class allows a marvelous simplification in computing the likelihoods. Let the M feature values of a datapoint D have values (x_1, x_2, .., x_M). Then under the “naïve” assumption, the likelihood becomes a product of the independent likelihoods for each feature:
\large P(DC) = P(x_1, x_2, .., x_MC) = \prod_{i=1}^{M} P(x_iC)
The independent likelihoods are easy to compute for discrete features: P(x_iC) is the proportion of all datapoints in class C that have a value of x_i in the ith feature.
With this simplification, Bayes’ Rule becomes
\large P(CD) \propto P(C)\prod_{i=1}^{M} P(x_iC)
The above formula is subject to underflow error, since it’s often a product of very small numbers. Taking the logarithm transforms a product of small numbers to a sum of ordinary sized numbers, while preserving the order of positive numbers (i.e. if A > B then logA > logB). With this remedy, we have a working, practicalforcomputation version of the Naïve Bayes Classifier that is easy to implement in code:
\large logP(CD) = logP(C) + \sum_{i=1}^{M} logP(x_iC)
We have neglected the log of the proportionality constant on the right hand side; since we are only interested in relative values of the posterior, this will not affect class assignments.
For features with discrete values (such as words) the ith term in the sum is the log of the proportion of datapoints of class C whose ith feature value equals x_i. What if it happens that there are none? That term becomes log(0), the equation blows up, and we can’t make a prediction. To ensure this never happens, we can add 1 to the numerator and denominator of each likelihood factor. This is a variant of Laplace Smoothing.
4. Discussion
If there are continuousvalued features, such as height and weight, Naïve Bayes can be used for regression. In this case, training data is partitioned into its classes. In each class, the values of each feature are assumed to follow a normal distribution; the parameters of the Gaussian likelihood are the mean and variance of the feature values. Alternatively, a continuous feature could be discretized by binning its values, but doing so throws away information, and results could be sensitive to the binning scheme.
The Naïve Bayes Classifier is optimal (meaning it’s the most accurate possible classifier) whenever the “naïve” assumption holds, and even in some cases when it doesn’t. In most cases the assumption does not hold, with the result that the posterior probabilities are not accurate. Surprisingly, classifications made using the MAP decision rule are often quite accurate even if the probabilities are not!
Naïve Bayes should work best when the training data is representative of the parent population, so that the priors are accurate.
5. Advantages of the Naïve Bayes Classifier

For problems with a small amount of training data, it can achieve better results than other classifiers because it has a low propensity to overfit. That’s Occam’s Razor at work!

Training is quick, and consists of computing the priors and the likelihoods.

Prediction on a new data point is quick. First calculate the posterior for each class. Then apply the MAP decision rule: the label is the class with the maximum posterior.

The RAM memory footprint is modest, since these operations do not require the whole data set to be held in RAM at once.

CPU usage is modest: there are no gradients or iterative parameter updates to compute, since prediction and training employ only analytic formulae.

Scales linearly with number of features and number of data points, and is easy to update with new training data.

Because of its linear scaling, fast execution, small memory requirement, and light CPU usage, may provide viable solutions for massive problems (many rows and columns) that are too computeintensive for other methods.

Easily handles missing feature values – by retraining and predicting without that feature!
6. Disadvantages of the Naïve Bayes Classifier

Cannot incorporate feature interactions.

For regression problems, i.e. continuous realvalued data, there may not be a good way to calculate likelihoods. Binning the data and assigning discrete classes to the bins is suboptimal since it throws away information. Assuming each feature is normally distributed is workable, but could impact performance if features are not normally distributed. On the other hand, with enough training data in each class, you could estimate the likelihood densities directly, permitting accurate likelihood calculations for new data.

Performance is sensitive to skewed data – that is, when the training data is not representative of the class distributions in the overall population. In this case, the prior estimates will be incorrect.
7. Summary
In this post, you’ve learned how to build a Naïve Bayes Classifier from scratch. You are now empowered to wield Occam’s Razor to infer truths about the universe from observational data. I hope you’ve seen a glimmer of why Bayes’ Rule ranks among the most important practical mathematical discoveries of all time. Now, go forth and discover…
8. References
Application to text classification (Stanford CS124)
Nice tutorial with worked example and code
Wikipedia article