AMH - 18__ml.18_naive

The Naive Bayes Algorithm

A Naive Bayes classifier is applicable when

you want to classify a data point \(d\) in a class \(C\) given information \(x_1, x_2, \dots, x_n,\) but what you know is the probabilities of \(x_1, x_2, \dots, x_n\) given a class \(C = c,\)
and you can assume the probabilities of \(x_1, x_2, \dots, x_n\) given \(C = c\) are independent.

Note that you do not need the probabilities of \(x_1, x_2, \dots, x_n\) given \(C = c\) to be independent, but assuming independence must give approximately correct results.

Given \(x_1, x_2, \dots, x_n,\) the probability that the class of \(d\) is \(c\) is \[P(C = c | x_1, x_2, \dots, x_n) = \frac{P(x_1, x_2, \dots, x_n|C = c)P(C = c)}{P(x_1, x_2, \dots, x_n)}\] by Bayes' Rule.

The Naive Bayes algorithm is called naive because we assume independence of \(x_1, x_2, \dots, x_n\) given \(C = c.\) By the independence assumption, we get \[\frac{P(x_1, x_2, \dots, x_n|C = c)P(C = c)}{P(x_1, x_2, \dots, x_n)} = \frac{P(x_1|C = c)P(x_2|C = c)\dots P(x_n|C = c)P(C = c)}{P(x_1, x_2, \dots, x_n)}\]

We will classify \(d\) as the mostly likely class \(c.\) So, the class is \[\text{argmax}_{c} \frac{P(x_1|C = c)P(x_2|C = c)\dots P(x_n|C = c)P(C = c)}{P(x_1, x_2, \dots, x_n)}\] Notice that the denominator does not depend on \(c,\) so dropping the denominator does not change the class. The class we assign to \(d\) is \[\hat{C} = \text{argmax}_{c} P(x_1|C = c)P(x_2|C = c)\dots P(x_n|C = c)P(C = c)\]

The values \(P(C = c)\) is a prior distribution. For example, if the application is spam filtering, then without knowing anything about a random email it is unlikely the email is spam. However, once you know words \(x_1, x_2, \dots x_n\) in the email the probability that it is spam may be more than the probability it is ham.

Example 1

Naive Bayes can be used to classify points in a plane. The points are one of two classes: red or blue. You can draw points on the board to add data, and then hit the classify button to make predictions about every point on the board. The regions will be colored based on which class the classifier believes is most likely. So, the classifier is estimating that if a new point is added to the red region, the point will be in the red class, and it estimates that if a point is added to the blue region then it will be in the blue class.

Left click in the box to add red points, right click for blue.
(Or left click + command key for blue). Then click the Classify!
Use left click + shift key to remove a point.

The algorithm works as follows:
Let \((x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\) be the coordinates of the red points. Define \(\mu_{x,r}\) to be the mean of the \(x\)-coordinates of the red points: \[\mu_{x,r} = \sum_{i=1}^n \frac{x_i}{n}\] Define \(\mu_{y,r}\) to be the mean of the \(y\)-coordinates of the red points: \[\mu_{y,r} = \sum_{i=1}^n \frac{y_i}{n}\] Define \(\sigma_{x,r}^2\) to be the variance of the \(x\)-coordinates of the red points: \[\sigma_{x,r}^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu_{x,r})^2\] Define \(\sigma_{y,r}^2\) to be the variance of the \(y\)-coordinates of the red points: \[\sigma_{y,r}^2 = \frac{1}{n-1}\sum_{i=1}^n (y_i - \mu_{y,r})^2\] Similar parameters can be found for the blue points.

The mean and variance are used to make an estimate using a normal distribution. Let \(R\) be the set of red points. The probability that a given data point \(d\) is in the red class can be determined once we know the coordinates of \(d.\) \[P(d \in R| d = (u,v)) = \frac{P(d = (u,v)| d \in R)P(d \in R)}{P(d = (u,v))}\] Let \(B\) be the set of blue data points. Without being given any information about a random data point, we can assume the probability a point is red or blue is equal to the proportion of red or blue points, respectively. So, letting \(|R|\) be the number of red points and \(|B|\) be the number of blue points, \begin{align} P(d \in R) = \frac{|R|}{|R|+|B|} P(d \in B) = \frac{|B|}{|R|+|B|} \end{align} Then, for a given data point \(d\) with coordinates \((u,v),\) we can predict whether its class as follows:

If \(P(d = (u,v)| d \in R)P(d \in R) > P(d = (u,v)| d \in B)P(d \in B)\) then \(P(d \in R| d = (u,v)) > P(d \in B| d = (u,v))\) so we should predict \(d\) will be red.
If \(P(d = (u,v)| d \in R)P(d \in R) < P(d = (u,v)| d \in B)P(B \in B)\) then \(P(d \in R| d = (u,v)) < P(d \in B| d = (u,v))\) so we should predict \(d\) will be blue.

Using a normal distribution for each of the \(x\)- and \(y\)-coordinates, we have \begin{align} P(d = (u,v) | d \in R)P(d \in R) & = \frac{1}{\sqrt{2\pi\sigma_{x,r}^2}}e^{-(u-\mu_{x,r})^2/ 2\sigma_{x,r}^2}\cdot \frac{1}{\sqrt{2\pi\sigma_{y,r}^2}}e^{-(v-\mu_{y,r})^2/ 2\sigma_{y,r}^2} \cdot \frac{|R|}{|R|+|B|} \\ & = \frac{|R|}{(|R|+|B|)2\pi\sqrt{\sigma_{x,r}^2\sigma_{y,r}^2}}e^{-((u-\mu_{x,r})^2/ 2\sigma_{x,r}^2+(v-\mu_{y,r})^2/ 2\sigma_{y,r}^2)} \end{align} and \begin{align} P(d = (u,v) | d \in B)P(d \in B) & = \frac{1}{\sqrt{2\pi\sigma_{x,b}^2}}e^{-(u-\mu_{x,b})^2/ 2\sigma_{x,b}^2}\cdot \frac{1}{\sqrt{2\pi\sigma_{y,b}^2}}e^{-(v-\mu_{y,b})^2/ 2\sigma_{y,b}^2} \cdot \frac{|B|}{|R|+|B|} \\ & = \frac{|B|}{(|R|+|B|)2\pi\sqrt{\sigma_{x,b}^2\sigma_{y,b}^2}}e^{-((u-\mu_{x,b})^2/ 2\sigma_{x,b}^2+(v-\mu_{y,b})^2/ 2\sigma_{y,b}^2)} \end{align} So, we get \begin{align} & P(d \in R| d = (u,v)) > P(d \in B| d = (u,v)) \Leftrightarrow \\ & P(d = (u,v)| d \in R)P(d \in R) > P(d = (u,v)| d \in B)P(d \in B) \Leftrightarrow \\ & \frac{|R|}{\sqrt{\sigma_{x,r}^2\sigma_{y,r}^2}}e^{-((u-\mu_{x,r})^2/ 2\sigma_{x,r}^2+(v-\mu_{y,r})^2/ 2\sigma_{y,r}^2)} > \frac{|B|}{\sqrt{\sigma_{x,b}^2\sigma_{y,b}^2}}e^{-((u-\mu_{x,b})^2/ 2\sigma_{x,b}^2+(v-\mu_{y,b})^2/ 2\sigma_{y,b}^2)} \end{align}

Example 2

Some of the best mini-golf players have assembled to be ranked. All of the players will be ranked between \(0\) and \(5.\) Based on previously collected data and assumptions, the courses players may attempt have the following difficulties: \begin{align} & \text{Course: Difficulty} \\ & 1- 0.5 \\ & 2- 1.1 \\ & 3- 1.8 \\ & 4- 2.2 \\ & 5- 2.5 \\ & 6- 2.8 \\ & 7- 3.1 \\ & 8- 3.9 \\ & 9- 4.4 \\ & 10- 4.8 \end{align} Players may or may not attempt all \(10\) courses.

Intuitively, a player of skill \(3\) should make par on courses \(1\) through \(6\) and should not make par on courses \(7\) through \(10.\) However, on a bad day a player of skill \(3\) may not make par on one of courses \(1\) through \(6\), or on a good day make par on courses \(7\) through \(10\). So, the model needs to allow for some leniency. If a player does not make part on course \(6\) but does make par on courses \(7,\) \(9\) and \(10,\) then there should be a high probability that the player can make par on course \(8.\) The over par score on hole \(6\) should be considered a fluke.

The probability that a player of skill class \(c\) will make par on a course of difficulty \(d\) will be given a logistic distribution (although other distributions could also work). Let \(D\) be the event that the player makes par on a course of difficulty \(D\) and let \(C\) be the player's skill class. Then \[P(D|C = c) = \frac{1}{1+e^{d-c}}\] Notice that as the difficulty \(d\) increases, the probability of making par decreases. Similarly, as the player's skill class \(c\) increases, the probability of making par increases.

We will assume we have no information about a random player. So, our prior assumption about the player gives equal weight to all possible classes. The pdf is \(f(c) = \frac{1}{5}\) for \(0 \leq c \leq 5.\) Since \(f(c)\) is the same for all values \(c,\) we do not need to use it to compute the argmax. Therefore, to assign a player a class we need to compute something of the form \[\hat{C} = \text{argmax}_{c} P(x_1|C = c)P(x_2|C = c)\dots P(x_n|C = c)\] Let \(A\) be the set of difficulty levels of holes on which a player makes par, and \(B\) be the set of difficulty levels of holes on which a player does not make par. Let's further assume that difficulty \(0\) is easy enough that any player should make par and \(5\) is hard enough that no player could make par. Using \(P(D|C = c) = \frac{1}{1+e^{d-c}}\) and \(P(D|C = c) = 1-\frac{1}{1+e^{d-c}},\) the class of the player will be \[\hat{C} = \text{argmax}_{c} \frac{1}{1+e^{-c}}\left(1-\frac{1}{1+e^{5-c}}\right)\left[\prod_{d \in A} \frac{1}{1+e^{d-c}}\right] \left[\prod_{d \in B} \left(1-\frac{1}{1+e^{d-c}}\right)\right]\] The assumption that every player could make par on a difficulty \(0\) course and no player could make par on a difficulty \(5\) course serves the purpose of preventing a players skill from tending to infinity or negative infinity. For example, without the assumption, if a player only participates in one hole and makes par then their score will be infinity. On the other hand, without the assumption if a player participates in one hole and does not make par, their score will be negative infinity.

You can use the white board to check which courses a player has attempted, and whether they have made par. The Naive Bayes algorithm will be used to classify the skill of the player.