\(K\) Nearest Neighbors (or KNN for short) is a supervised learning algorithm in which data is being sorted into classes. Suppose we are sorting the data into \(n\) classes \((C_1, C_2,\dots, C_n).\) We have a set of labeled data points which we use as training data. Let \(\overrightarrow{x}_1, \overrightarrow{x}_2, \dots, \overrightarrow{x}_m\) be the data points and let \(y_i\) be the class of data point \(\overrightarrow{x}_i.\)
Given a new data point \(\overline{x},\) the problem we are trying to solve is determining which class \(\overrightarrow{x}\) should belong to. The \(K\) Nearest Neighbors algorithm will look at the classes of the \(K\) nearest neighbors to \(\overrightarrow{x}\) (according to some distance measure) and assign \(\overrightarrow{x}\) the class of the majority of its neighbors.
It is possible that there are ties. Ties can be handled in different ways and the best way depends on the classification problem the KNN is being used to solve.
We are classifying data in a red class and a blue class. Suppose we have the following: \begin{align} & \text{Data } & \text{Class} \\ & (1,1) \mapsto & \text{red} \\ & (1,4) \mapsto & \text{blue} \\ & (5,2) \mapsto & \text{red} \\ & (8,5) \mapsto & \text{red} \\ & (6,3) \mapsto & \text{blue} \end{align} A new data point is observed at \((1,3)\) and we want to predict whether it will be a blue point or a red point.
We will consider the classification for two different values of \(K.\) Since there are \(2\) classes, if we use odd values of \(K\) there will be no ties.
\(K = 1:\) We only need to look at the class of the nearest point. The nearest point to \((1,3)\) that we have already classified is \((1,4).\) Since \((1,4)\) is blue, the KNN will classify \((1,3)\) as blue.
\(K = 3:\) We need to consider the classes of the three nearest points to \((1, 3).\) The three nearest points are \((1,4),\) \((1,1)\) and \((5,2).\) Two of the points are in the red class and one is in the blue class. Since more of the nearby points are red, we will classify \((1,3)\) as red.
As you can see from the example, the choice of \(K\) can determine which class the KNN will predict for a given data point. Small values of \(K\) make it more likely that you only consider nearby points but also gives a less stable prediction. Large values of \(K\) give more stable predictions but may cause the KNN to consider points that are not near a given data point.
You can draw points in the red class and blue class on the plane, then run the KNN algorithm to see the regions over which future points would be classified as red or blue. The value of \(K\) can be adjusted with the buttons below.
Click the code or the description to see the connection.