AMH - 18__ml.18_kernel

Introduction

Suppose you have \(5\) runners who enter a race and get the following times: \begin{align} & 3 \text{ min } 56 \text{ sec} \\ & 4 \text{ min } 1 \text{ sec} \\ & 4 \text{ min } 17 \text{ sec} \\ & 6 \text{ min } 53 \text{ sec} \\ & 7 \text{ min } 2 \text{ sec} \\ \end{align} You are to classify the racers by skill level. You could do this in several ways. For example, maybe the first \(3\) runners are in a 'fast' class and the last \(2\) runners are in a 'slow' class. Or, the classes could be divided by the minute on which a runner finished. So the first runner is in class '3,' the next two are in class '4,' the next is in class '6' and the last is in class '7.' However, that division may be unfair because the second runner is only \(5\) seconds behind the first, and they were placed in a class with a runner who is \(16\) seconds slower.

The kernel density algorithm is a way to create classes algorithmically.

Description

Kernel densities can be used to form clusters of real-valued data in \(n\)-dimensions. For each data point \(d\) with value \(x,\) add a function \(f_d\) such that the maximum of \(f_d\) is \(d\) and \(f_d\) is decreasing as points get further from \(d.\) So, if \(\mathcal{D}\) is the set of all data points, you get a function \[f(x) = \sum_{d \in \mathcal{D}} f_d(x)\] Given two points \(x\) and \(y,\) one can define a line segment connecting \(x\) and \(y\) as a continuous function \(f_{(x,y)}(t),\) \(0 \leq t \leq 1,\) such that \(f_{(x,y)}(0) = x\) and \(f_{(x,y)}(1) = y.\) The points \(x\) and \(y\) are in the same class if there exists a line segment \(f_{x,y}(t)\) that does not have a local minimum in \((0,1).\) Otherwise, \(x\) and \(y\) are in different classes.

One way to visiualize the algorithm is to imagine a hill with a peak at each data point. If there are two nearby data points, then the hills at each point will combine to make one big hill. If two data points are on the same hill, then they are in the same class. If they are on different hills, they are in different classes.

A common function to use is a normal distribution curve. Given a variance \(\sigma^2\) and a data point \(d\) at \(x,\) the curve is \[f_d(t) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-(t-x)^2/\sigma^2}\] For larger clusters of points, use a higher variance. A high variance has short, flat curves. For smaller clusters, use a lower variance. A low variance hsa tall, thin curves.