AMH - 18__ml.18_decision

Introduction

A decision tree is a directional tree which can be used to classify data. You can use a decision tree when

each data point has a correct target class,
and there is a set of characterstics you can use that have to divide the data into classes.

For an intuitive example, suppose a friend is to look at and choose one of the following shapes:

Your goal is to figure out which shape your friend chose. You can ask the following questions:

Is it a circle?
Is it a square?
Is it red?

One decision tree may be the following:

Using this decision tree, you first ask "Is it red?". You always start with the root node. If your friend answers "no," follow the arrow left to "Blue circle." Since "Blue circle" is a leaf node, you have figured out that your friend was thinking of the blue circle. On the other hand, if they answer "yes" then follow the arrow right. You will then ask "Is it a circle?". You can proceed down the tree by asking questions until you get to a leaf and find which shape your friend was thinking of.

An alternative decision tree would be

We will call the first decision tree DT1 and this decision tree DT2. Which is better?

Assuming your friend is equally like the pick any shape, we can compute the expected number of questions you must ask before you find your answer.

In DT1, one shape takes \(1\) question, one shape takes \(2\) questions, and two shapes take \(3\) questions. The average number of questions is \[\frac{1}{4} \cdot 1 + \frac{1}{4} \cdot 2 + \frac{2}{4} \cdot 3 = \frac{9}{4} = 2.25\] In DT2, every shape takes \(2\) questions. The average number of questions is \[\frac{4}{4} \cdot 2 = 2\] On average, you will need to ask fewer questions if you use DT2. Therefore, DT2 is preferable.

One large flaw with the above example is that is classifies every data point with \(100\%\) accuracy. Sometimes it is not possible to create such a decision tree, and often creating such a decision tree would lead to over fitting the training data.

Not possible: Suppose you only have \(2\) features, and no matter how you split along those feature there is ambiguity about which class the data points should be in. In this case, it is not possible to construct a tree that has \(100\%\) accuracy.
Over fitting: Suppose you have thousands of feature and \(100\) data points. However, only \(5\) features are helpful for determining which class a data point belongs in. If you create a tree that branches until every point is its own leaf, as we did with the above example, then when given a new point the algorithm will not know how to classify it.

Sometimes a decision tree is measured based on how many decisions it must make to classify a data point, and sometimes it is measured based on how accurately if classifies the data points. See the interactive example for an example that measures accuracy.

How the Example Works

The classes can be split half way between the \(x\) or \(y\) values of any two consecutive points in the same node. For example, if the decision tree has the points \((1,3), (2,2), (6, 4), (12, 3), (1,8),\) then to find where the data can be split, order the \(x\) and \(y\) values: \[x: 1, 1, 2, 6, 12\] \[y: 2, 3, 3, 4, 8\] The splits can be at \(x = 1.5, 4,\) or \(9,\) or \(y = 2.5, 3.5\) or \(6.\)

If the decision is to split along \(x = 4,\) then \((1,3), (2,2)\) and \((1,8)\) will branch from the node to one side and \((6,4)\) and \((12, 3)\) will branch to the other side. If the decision is to split along \(y = 6\) then \((1,3), (2,2), (6,4)\) and \((12, 3)\) will be on one side and \((1, 8)\) will be on the other.

The tree will branch at most a depth of \(4\) meaning that after \(3\) decisions an output class will be estimated. The question is, how do we decide which branches to use?

The Gini index is a number from \(0\) to \(1,\) inclusive, which measures the homogeneity in a population with \(2\) classes. A population with Gini index \(0\) only has one class.

If the two classes in a population are class \(A\) and class \(B,\) then the Gini score of the population is \[1-\left(\frac{|A|}{|A|+|B|}\right)^2-\left(\frac{|B|}{|A|+|B|}\right)^2 = \frac{2|A||B|}{(|A|+|B|)^2}\] So, if the population is all class \(A,\) we get \(1-1^2-0^2=0.\) If the population is all class \(B,\) we get \(1-0^2-1^2=0.\) If the population is half \(A\) and half \(B\) we get \(1-\left(\frac{1}{2}\right)^2-\left(\frac{1}{2}\right)^2 = \frac{1}{2}.\)

In the example we have a red class and a blue class. Suppose also that we can split the tree along one of the characteristics \(c_1, c_2, \dots, c_n.\) Each decision about characteristic \(c_i\) divides the data into two groups, \(G_1(i)\) and \(G_2(i)\) where \(G_1(i)\) is the set of points with characteristic \(c_i\) and \(G_2(i)\) is the set of points that does not have characteristic \(c_i.\) Let \(G_1(i,r)\) be the set of points in \(G_1(i)\) that are red, \(G_1(i,b)\) be the set of points in \(G_1(i)\) that are blue, \(G_2(i,r)\) be the set of points in \(G_2(i)\) that are red and \(G_2(i,b)\) be the set of points in \(G_2(i)\) that are blue. The Gini index of characteristic \(c_i\) is \[\left(\frac{2|G_1(i,r)||G_1(i,b)|}{(|G_1(i,r)|+|G_1(i,b)|)^2}\right)\cdot\frac{|G_1(i)|}{|G_1(i)|+|G_2(i)|}+ \left(\frac{2|G_2(i,r)||G_2(i,b)|}{(|G_2(i,r)|+|G_2(i,b)|)^2}\right)\cdot\frac{|G_2(i)|}{|G_1(i)|+|G_2(i)|}\] In words, this formula is \[G_i = \text{The Gini index in set }G_1(i) \cdot \text{Proportion in }G_1(i) + \text{The Gini index in set }G_2(i) \cdot \text{Proportion in }G_2(i)\] The split will be along the characterstic \(i\) that minimizes \(G_i.\) This split will ensure that the groups divided based on whether they have characteristic \(c_i\) are as homogeneously red and blue as possible. The homogeneity of a larger population is weighted more heavily than a smaller population because we could easily make a class with one point that is completely homogeneous.

Suppose the data points \((1,3), (2,2), (6, 4), (12, 3), (1,8)\) are in the following classes: \begin{align} & (1,3): \text{red}\\ & (2,2): \text{red}\\ & (6,4): \text{blue}\\ & (12,3): \text{red}\\ & (1,8): \text{blue}\\ \end{align} Splitting along \(x=4\) splits the data into \(\{(1,3), (2,2), (1,8)\}\) and \(\{(6,4), (12, 3)\},\) which have classes \(\{\text{red}, \text{red}, \text{blue}\}\) and \(\{\text{blue}, \text{red}\}.\) So, the Gini score is \begin{align} & \left(\frac{2 \cdot 2 \cdot 1}{(2 + 1)^2}\right)\cdot\frac{3}{5} + \left(\frac{2 \cdot 1 \cdot 1}{(1+1)^2}\right)\cdot \frac{2}{5} \\ & = \frac{7}{15} \\ & \approx 0.467 \end{align} Splitting along \(y=6\) splits the data into \(\{(1,3), (2,2), (6,4), (12, 3)\}\) and \(\{(1,8)\},\) which have classes \(\{\text{red}, \text{red}, \text{blue}, \text{red}\}\) and \(\{\text{blue}\}.\) So, the Gini score is \begin{align} & \left(\frac{2 \cdot 3 \cdot 1}{(3+1)^2}\right)\cdot\frac{4}{5} + \left(\frac{2 \cdot 0 \cdot 1}{(0 + 1)^2}\right)\cdot \frac{1}{5} \\ & = \frac{3}{10} \\ & \approx 0.3 \end{align} So, the split along \(y=6\) is better than the split along \(x=4\) since it has a lower Gini score. Notice that the groups appear to be more uniform.

Among all possible splits under consideration, the split along \(y = 3.5\) splits the data into a red group and a blue group. So, the split has Gini score \(0\) and would be the split used by the algorithm.