For an intuitive example, suppose a friend is to look at and choose one of the following shapes:
Your goal is to figure out which shape your friend chose. You can ask the following questions:
One large flaw with the above example is that is classifies every data point with \(100\%\) accuracy. Sometimes it is not possible to create such a decision tree, and often creating such a decision tree would lead to over fitting the training data.
Tree that generates the classifier:
The classes can be split half way between the \(x\) or \(y\) values of any two consecutive points in the same node. For example, if the decision tree has the points \((1,3), (2,2), (6, 4), (12, 3), (1,8),\) then to find where the data can be split, order the \(x\) and \(y\) values:
\[x: 1, 1, 2, 6, 12\]
\[y: 2, 3, 3, 4, 8\]
The splits can be at \(x = 1.5, 4,\) or \(9,\) or \(y = 2.5, 3.5\) or \(6.\)
If the decision is to split along \(x = 4,\) then \((1,3), (2,2)\) and \((1,8)\) will branch from the node to one side and \((6,4)\) and \((12, 3)\) will branch to the other side. If the decision is to split along \(y = 6\) then \((1,3), (2,2), (6,4)\) and \((12, 3)\) will be on one side and \((1, 8)\) will be on the other.
The tree will branch at most a depth of \(4\) meaning that after \(3\) decisions an output class will be estimated. The question is, how do we decide which branches to use?
The Gini index is a number from \(0\) to \(1,\) inclusive, which measures the homogeneity in a population with \(2\) classes. A population with Gini index \(0\) only has one class.
If the two classes in a population are class \(A\) and class \(B,\) then the Gini score of the population is
\[1-\left(\frac{|A|}{|A|+|B|}\right)^2-\left(\frac{|B|}{|A|+|B|}\right)^2 = \frac{2|A||B|}{(|A|+|B|)^2}\]
So, if the population is all class \(A,\) we get \(1-1^2-0^2=0.\) If the population is all class \(B,\) we get \(1-0^2-1^2=0.\) If the population is half \(A\) and half \(B\) we get \(1-\left(\frac{1}{2}\right)^2-\left(\frac{1}{2}\right)^2 = \frac{1}{2}.\)
In the example we have a red class and a blue class. Suppose also that we can split the tree along one of the characteristics \(c_1, c_2, \dots, c_n.\) Each decision about characteristic \(c_i\) divides the data into two groups, \(G_1(i)\) and \(G_2(i)\) where \(G_1(i)\) is the set of points with characteristic \(c_i\) and \(G_2(i)\) is the set of points that does not have characteristic \(c_i.\) Let \(G_1(i,r)\) be the set of points in \(G_1(i)\) that are red, \(G_1(i,b)\) be the set of points in \(G_1(i)\) that are blue, \(G_2(i,r)\) be the set of points in \(G_2(i)\) that are red and \(G_2(i,b)\) be the set of points in \(G_2(i)\) that are blue. The Gini index of characteristic \(c_i\) is
\[\left(\frac{2|G_1(i,r)||G_1(i,b)|}{(|G_1(i,r)|+|G_1(i,b)|)^2}\right)\cdot\frac{|G_1(i)|}{|G_1(i)|+|G_2(i)|}+
\left(\frac{2|G_2(i,r)||G_2(i,b)|}{(|G_2(i,r)|+|G_2(i,b)|)^2}\right)\cdot\frac{|G_2(i)|}{|G_1(i)|+|G_2(i)|}\]
In words, this formula is
\[G_i = \text{The Gini index in set }G_1(i) \cdot \text{Proportion in }G_1(i) + \text{The Gini index in set }G_2(i) \cdot \text{Proportion in }G_2(i)\]
The split will be along the characterstic \(i\) that minimizes \(G_i.\) This split will ensure that the groups divided based on whether they have characteristic \(c_i\) are as homogeneously red and blue as possible. The homogeneity of a larger population is weighted more heavily than a smaller population because we could easily make a class with one point that is completely homogeneous.
Suppose the data points \((1,3), (2,2), (6, 4), (12, 3), (1,8)\) are in the following classes:
\begin{align}
& (1,3): \text{red}\\
& (2,2): \text{red}\\
& (6,4): \text{blue}\\
& (12,3): \text{red}\\
& (1,8): \text{blue}\\
\end{align}
Splitting along \(x=4\) splits the data into \(\{(1,3), (2,2), (1,8)\}\) and \(\{(6,4), (12, 3)\},\) which have classes \(\{\text{red}, \text{red}, \text{blue}\}\) and \(\{\text{blue}, \text{red}\}.\) So, the Gini score is
\begin{align}
& \left(\frac{2 \cdot 2 \cdot 1}{(2 + 1)^2}\right)\cdot\frac{3}{5} + \left(\frac{2 \cdot 1 \cdot 1}{(1+1)^2}\right)\cdot \frac{2}{5} \\
& = \frac{7}{15} \\
& \approx 0.467
\end{align}
Splitting along \(y=6\) splits the data into \(\{(1,3), (2,2), (6,4), (12, 3)\}\) and \(\{(1,8)\},\) which have classes \(\{\text{red}, \text{red}, \text{blue}, \text{red}\}\) and \(\{\text{blue}\}.\) So, the Gini score is
\begin{align}
& \left(\frac{2 \cdot 3 \cdot 1}{(3+1)^2}\right)\cdot\frac{4}{5} + \left(\frac{2 \cdot 0 \cdot 1}{(0 + 1)^2}\right)\cdot \frac{1}{5} \\
& = \frac{3}{10} \\
& \approx 0.3
\end{align}
So, the split along \(y=6\) is better than the split along \(x=4\) since it has a lower Gini score. Notice that the groups appear to be more uniform.
Among all possible splits under consideration, the split along \(y = 3.5\) splits the data into a red group and a blue group. So, the split has Gini score \(0\) and would be the split used by the algorithm.