Support vector machines (SVMs) have many applications including facial recognition software, handwriting recgonition, and applications in bioinformatics. In this lesson we introduce SVMs with a simple application. Suppose data is sorted into \(2\) classes. We want to find a hyperplane that separates the data.
The goal is easiest to visualize in \(2\)-dimensions. The data in the following image is separated into a red class and a blue class:
The goal is to have the computer draw a line between the two classes. There are many ways a line could be drawn. Here is one way to draw a line that separates the data:
Here is another:
The first line looks like a more reasonable way to split the plane than the second line. The reason the first line seems more reasonable is that the distance from the line to the blue class is roughly the same as the distance from the line to the red class. The second line is much closer to the blue class than the red class.
A support vector machine will allow us to find a line that seems reasonable, but that also has a tolerance for outliers in the case that there is no line that can separate the data.
We have data points in \(\mathbb{R}^n\) that are in two classes. We shall give the classes numerical values by assigning one class (say red) \(-1\) and the other class (blue) \(1.\) We assume we have a set of training data of the form \((\overrightarrow{x},y)\) where \(\overrightarrow{x}\) is the feature vector and \(y \in \{-1, 1\}\) is the class.
The goal is to draw a hyperplane that separates the data into two groups. We can represent a hyperplane as the set of all vectors \(\overline{v}\) such that \[\overrightarrow{w} \cdot \overrightarrow{v} + b = 0\] where \(\overrightarrow{w}\) is the normal vector, or weights and \(b\) is the constant.
First we look at the ideal case when the two classes can be separated by a plane. The goal is to find the plane that separates the points with equal distance between the plane and each class. The distance from the plane to each class should be as large as possible. For a given plane \(\overrightarrow{w} \cdot \overrightarrow{v} - b = 0,\) the equations of parallel planes that are equal distance above and below have equations \begin{align} & \overrightarrow{w} \cdot \overrightarrow{v} - b = 1 \\ & \overrightarrow{w} \cdot \overrightarrow{v} - b = -1 \end{align} The distance between the plane \(\overrightarrow{w} \cdot \overrightarrow{v} + b = 0\) and either of the other planes is \(\frac{1}{||\overrightarrow{w}||}.\) Therefore, the goal is to minimize \(||\overrightarrow{w}||.\) In practice, computing \(||\overrightarrow{w}||\) involves taking a square root, then inverting the result. There is no computational benefit to these last two computations, so it is more practical to maximize over \(||\overrightarrow{w}||.\)
The two planes \(\overrightarrow{w} \cdot \overrightarrow{v} - b = 1\) and \(\overrightarrow{w} \cdot \overrightarrow{v} - b = -1\) will separate the data so long as \begin{align} & \overrightarrow{w} \cdot \overrightarrow{x}_i + b \geq 1 \text{ when } y_i = 1\\ & \overrightarrow{w} \cdot \overrightarrow{x}_i - b \leq -1 \text{ when } y_i = -1 \end{align} Using the \(y_i\) classes in the equation, we can combine this into one condition: \[y_i(\overrightarrow{w} \cdot \overrightarrow{x}_i - b) \geq 1\] We need to find the vector \(\overrightarrow{w}\) and constant \(b\) that satisfies this condition and maximizes \(||\overline{w}||^2.\) Once we have found this \(\overrightarrow{w}\) and \(b,\) the classifier will predict a new vector \(\overrightarrow{v}\) is in class \[f(\overrightarrow{v}) = \text{sgn}(\overrightarrow{w} \cdot \overrightarrow{v} - b)\] where \(\text{sgn}(x) = 1\) if \(x > 0\) and \(\text{sgn}(x) = -1\) if \(x < 0.\)
Often, the data is not linearly separable. We need some condition that allows for errors. Since a point that is correctly classified satisfies \(y_i(\overrightarrow{w} \cdot \overrightarrow{x}_i - b) \geq 1,\) one way to measure the error of the classification \(\overrightarrow{x}\) is \[\text{max}(0, 1-y_i(\overrightarrow{w} \cdot \overrightarrow{x}_i - b))\] The goal is to minimize both \(||\overrightarrow{w}||^2\) and the errors. So, in general the goal is to minimize \[||\overline{w}||^2+C\sum_i\text{max}(0, 1-y_i(\overrightarrow{w} \cdot \overrightarrow{x}_i - b))\] where \(C\) is the parameter that determines the cost of making a mistake. By adjusting \(C,\) we can determine whether it is more important that there is a large distance between the two classes or more important that there are few errors.
You can draw points in the red class and blue class on the plane, then let the SVM separate the classes. The value of \(C\) determines the cost of an error. The light red and light blue fields represent the planes \(\overrightarrow{w} \cdot \overrightarrow{v} - b = 1\) and \(\overrightarrow{w} \cdot \overrightarrow{v} - b = -1,\) and they visually show the distance between the two groups. Notice that for higher values of \(C\) the distance will be smaller.
Left click in the box to add red points, right click for blue.
(Or left click + command key for blue). Then click the Classify!
Use left click + shift key to remove a point.
This is a \(2\)-dimensional example so our feature vectors are of the form \(\overrightarrow{x} = (x_1, x_2).\) The algorithm will use gradient descent, so it is easier to limit the solution space on the constant \(b\) as well as \(\overline{w}\) and the error. The algorithm finds a vector \(\overrightarrow{w} = (w_1, w_2)\) and the constant \(b\) that is a local minimum of \begin{align} & ||\overrightarrow{w}||^2+b^2+C\sum_i\text{max}(0, 1-y_i(\overrightarrow{w} \cdot \overrightarrow{x}_i - b)) \\ & = w_1^2+w_2^2+b^2+C\sum_i\text{max}(0, 1-y_i(w_1x_2 + w_2x_2 - b)) \end{align}
The following is an example that shows when the cost parameter is low, the algorithm allows a few points to be misclassified. In this picture, the cost of an error is low, the groups are separated by a large distance, and the algorithm finds that the red class is mostly on the left whereas the blue class is mostly on the right.
When the cost function is high, the algorithm will be more sensitive to misclassified data points. The distance between the classes is small, and the algorithm finds that the red class is on the bottom whereas the blue class is on the top.