K-Nearest Neighbor (KNN) and Decision Tree
Table of Contents
1. Parametric Model Learning via Supervised LearningĀ¶
Parametric model learning refers to the supervised learning approach in which the relationship between input variables and output variables is represented by a model with a fixed, finite number of parameters. The parameters of the model are learned or estimated directly from labeled training data. All machine learning algorithms we have studied thus far belong to this category.
Training Dataset
Given a training dataset consisting of input-output pairs:
$$ \left\{ \left(x^{(1)}, y^{(1)}\right), \left(x^{(2)}, y^{(2)}\right), \dots, \left(x^{(m)}, y^{(m)}\right) \right\} $$
Model Approximation
The primary goal is to identify a function $f_{\omega}$ that depends on a learning parameter $\omega$. This function is expected to approximate the true output $y$ based on the given inputs $x$ as accurately as possible:
$$f_{\omega}(x) \approx y$$
Loss Function
To quantitatively evaluate how well $f_{\omega}$ approximates the output, we define a loss function. The loss function measures the discrepancy between the predicted and actual values:
$$\ell \left(f_{\omega}\left(x^{(i)}\right), y^{(i)}\right)$$
Optimization Problem
The optimal learning parameters $\omega$ are determined by solving the following optimization problem, which aims to minimize the average loss across all training samples:
$$ \begin{aligned} \text{minimize} &\quad \frac{1}{m}\sum_{i=1}^{m}\ell\left(f_{\omega}\left(x^{(i)}\right), y^{(i)}\right) \\ \text{subject to} &\quad \omega \in \boldsymbol{\omega} \end{aligned} $$
Prediction for New Data
Once the learning parameters $\omega$ are determined, the learned function $f_{\omega}(x)$ can be used to predict the output for previously unseen input data $x$:
$$\hat y = f_{\omega}(x) $$
Thus, parametric supervised learning provides a structured method to approximate relationships between input features and target outputs effectively.
2. K-Nearest Neighbor (KNN)Ā¶
2.1. Non-Parametric Methods in Machine LearningĀ¶
A non-parametric method is a type of machine learning model that does not assume a fixed number of parameters or a specific functional form for the underlying relationship between inputs and outputs. These methods are flexible, data-driven, and adapt directly to the shape and characteristics of the dataset.
Among various non-parametric methods, we will primarily focus on K-Nearest Neighbor (KNN) due to its intuitive nature, simplicity, and effectiveness in demonstrating core principles of non-parametric learning.
2.2. K-Nearest Neighbor (KNN) RegressionĀ¶
We represent our regression model as follows:
$$y = f(x) + \varepsilon$$
where:
- $f(x)$ is the unknown function relating the input $x$ to the output $y$.
- $\varepsilon$ represents measurement errors and other discrepancies.
Given a good approximation of the function $f$, we can predict the value of $y$ at new data points $x_{\text{new}}$. One straightforward approach to achieve this is known as the nearest neighbor method, defined by:
$$\hat{y} = \text{avg}\left(y \mid x \in \mathcal{N}(x_{\text{new}})\right)$$
where $\mathcal{N}(x_{\text{new}})$ denotes the neighborhood of data points around the new input $x_{\text{new}}$. Typically, this neighborhood includes the $K$ closest points to $x_{\text{new}}$, measured using a suitable distance metric (e.g., Euclidean distance).
Data Generation
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
N = 100
w1 = 0.5
w0 = 2
x = np.random.normal(0, 15, N).reshape(-1,1)
y = w1*x + w0 + 5*np.random.normal(0, 1, N).reshape(-1,1)
plt.figure(figsize = (6, 4))
plt.title('Data Set')
plt.plot(x, y, '.', label = 'Data')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.axis([-40, 40, -30, 30])
plt.grid(alpha = 0.3)
plt.show()
neighbors.KNeighborsRegressor from sklearn
In this case, we set $k=1$
from sklearn import neighbors
reg = neighbors.KNeighborsRegressor(n_neighbors = 1)
reg.fit(x, y)
Prediction of New Data
x_new = np.array([[5]])
pred = reg.predict(x_new)[0,0]
print(pred)
Plot
xp = np.linspace(-30, 30, 100).reshape(-1,1)
yp = reg.predict(xp)
plt.figure(figsize = (6, 4))
plt.title('k-Nearest Neighbor Regression')
plt.plot(x, y, '.', label = 'Original Data')
plt.plot(xp, yp, label = 'kNN')
plt.plot(x_new, pred, 'o', label = 'Prediction')
plt.plot([x_new[0,0], x_new[0,0]], [-30, pred], 'k--', alpha = 0.5)
plt.plot([-40, x_new[0,0]], [pred, pred], 'k--', alpha = 0.5)
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.axis([-40, 40, -30, 30])
plt.grid(alpha = 0.3)
plt.show()
It is evident that this prediction above is overly sensitive to its immediate neighbors, leading to unstable results.
To mitigate overfitting or to make predictions less sensitive to noise and local fluctuations in data, one effective approach is to increase the number of neighbors, $k$, used in K-Nearest Neighbor regression:
Smaller $ k $:
- Predictions are highly sensitive to local variations, potentially leading to overfitting.
Larger $ k $:
- Results in smoother predictions and reduces sensitivity to noise.
- Balances out local irregularities by averaging across a larger neighborhood.
- May reduce variance at the expense of slightly increasing bias.
In practice, selecting an appropriate value for $ k $ is crucial and typically determined using cross-validation.
Below, we increase the neighborhood size to $ k = 21 $ and examine how this adjustment affects the behavior of the KNN regression model.
reg = neighbors.KNeighborsRegressor(n_neighbors = 21)
reg.fit(x, y)
xp = np.linspace(-30, 30, 100).reshape(-1,1)
yp = reg.predict(xp)
plt.figure(figsize = (6, 4))
plt.title('k-Nearest Neighbor Regression')
plt.plot(x, y, '.', label = 'Original Data')
plt.plot(xp, yp, label = 'Regression Result')
plt.plot(x_new, pred, 'o', label = 'Prediction')
plt.plot([x_new[0,0], x_new[0,0]], [-30, pred], 'k--', alpha = 0.5)
plt.plot([-40, x_new[0,0]], [pred, pred], 'k--', alpha = 0.5)
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.axis([-40, 40, -30, 30])
plt.grid(alpha = 0.3)
plt.show()
2.3. K-Nearest Neighbor (KNN) ClassificationĀ¶
The K-Nearest Neighbor (KNN) algorithm can be also employed to classify a data point based on the classes of its closest neighbors in the feature space.
Formally, in KNN classification, an object is classified by identifying the class most frequently occurring among its $ k $ nearest neighbors, where $ k $ is a positive integer.
If $ k = 1 $, the object is classified directly by assigning it the same class as its single closest neighbor.
For $ k > 1 $, the object is assigned to the class determined by majority voting among the $ k $ nearest neighbors.
Choosing the value of $ k $ affects the accuracy and generalization of the classifier, making its selection an important step typically done through cross-validation.
How KNN classification works:
(1) Choosing $k$
- Suppose we set $k=3$. This means we consider the 3 closest neighbors of the new data point.
(2) Identifying Neighbors:
- Calculate distances from the new data point (blue) to all other points in the dataset, selecting the three nearest neighbors.
(3) Majority Voting:
Examine the classes of the three closest neighbors. If, for example:
- Two neighbors are green.
- One neighbor is red.
Then the new data point is classified as green, as it is the majority class among the nearest neighbors.
(4) Classification Result:
- Based on this majority vote, the new point is assigned to the dominant class among its neighbors.
Data Generation
m = 1000
X = -1.5 + 3*np.random.uniform(size = (m,2))
y = np.zeros([m,1])
for i in range(m):
if np.linalg.norm(X[i,:],2) <= 1:
y[i] = 1
C1 = np.where(y == 1)[0]
C0 = np.where(y == 0)[0]
theta = np.linspace(0, 2*np.pi, 100)
plt.figure(figsize = (6, 6))
plt.plot(X[C1,0], X[C1,1], 'o', label = 'C1', markerfacecolor = "k", markeredgecolor = 'k', markersize = 4)
plt.plot(X[C0,0], X[C0,1], 'o', label = 'C0', markerfacecolor = "None", alpha = 0.3, markeredgecolor = 'k', markersize = 4)
plt.plot(np.cos(theta), np.sin(theta), '--', color = 'orange')
plt.axis([-1.5, 1.5, -1.5, 1.5])
plt.axis('equal')
plt.axis('off')
plt.show()
neighbors.KNeighborsClassifier from sklearn
In this case, we set $k=1$
from sklearn import neighbors
clf = neighbors.KNeighborsClassifier(n_neighbors = 1)
clf.fit(X, np.ravel(y))
Prediction for unseen data
X_new = np.array([1,1]).reshape(1,-1)
result = clf.predict(X_new)[0]
print(result)
Plot
res = 0.01
[X1gr, X2gr] = np.meshgrid(np.arange(-1.5, 1.5, res), np.arange(-1.5, 1.5, res))
Xp = np.hstack([X1gr.reshape(-1,1), X2gr.reshape(-1,1)])
inC1 = clf.predict(Xp).reshape(-1,1)
inCircle = np.where(inC1 == 1)[0]
plt.figure(figsize = (6, 6))
plt.plot(X[C1,0], X[C1,1], 'o', label = 'C1', markerfacecolor = "k", alpha = 0.5, markeredgecolor = 'k', markersize = 4)
plt.plot(X[C0,0], X[C0,1], 'o', label = 'C0', markerfacecolor = "None", alpha = 0.3, markeredgecolor='k', markersize = 4)
plt.plot(np.cos(theta), np.sin(theta), '--', color = 'orange')
plt.plot(Xp[inCircle][:,0], Xp[inCircle][:,1], 's', alpha = 0.5, color = 'r', markersize = 1)
plt.axis([-1.5, 1.5, -1.5, 1.5])
plt.axis('equal')
plt.axis('off')
plt.show()
The above example of KNN classification illustrates the method clearly. However, let's now consider what happens when outliers are present, especially with a small value of $k$:
Data Generation with Outliers
m = 1000
X = -1.5 + 3*np.random.uniform(size = (m,2))
y = np.zeros([m,1])
for i in range(m):
if np.linalg.norm(X[i,:], 2) <= 1:
if np.random.uniform() < 0.05:
y[i] = 0
else:
y[i] = 1
else:
if np.random.uniform() < 0.05:
y[i] = 1
else:
y[i] = 0
C1 = np.where(y == 1)[0]
C0 = np.where(y == 0)[0]
theta = np.linspace(0, 2*np.pi, 100)
plt.figure(figsize = (6, 6))
plt.plot(X[C1,0], X[C1,1], 'o', label = 'C1', markerfacecolor = "k", markeredgecolor = 'k', markersize = 4)
plt.plot(X[C0,0], X[C0,1], 'o', label = 'C0', markerfacecolor = "None", alpha = 0.3, markeredgecolor = 'k', markersize = 4)
plt.plot(np.cos(theta), np.sin(theta), '--', color = 'orange')
plt.axis([-1.5, 1.5, -1.5, 1.5])
plt.axis('equal')
plt.axis('off')
plt.show()
When $k = 1$
clf = neighbors.KNeighborsClassifier(n_neighbors = 1)
clf.fit(X, np.ravel(y))
res = 0.01
[X1gr, X2gr] = np.meshgrid(np.arange(-1.5, 1.5, res), np.arange(-1.5, 1.5, res))
Xp = np.hstack([X1gr.reshape(-1,1), X2gr.reshape(-1,1)])
inC1 = clf.predict(Xp).reshape(-1,1)
inCircle = np.where(inC1 == 1)[0]
plt.figure(figsize = (6, 6))
plt.plot(X[C1,0], X[C1,1], 'o', label = 'C1', markerfacecolor = "k", alpha = 0.5, markeredgecolor = 'k', markersize = 4)
plt.plot(X[C0,0], X[C0,1], 'o', label = 'C0', markerfacecolor = "None", alpha = 0.3, markeredgecolor='k', markersize = 4)
plt.plot(np.cos(theta), np.sin(theta), '--', color = 'orange')
plt.plot(Xp[inCircle][:,0], Xp[inCircle][:,1], 's', alpha = 0.5, color = 'r', markersize = 1)
plt.axis([-1.5, 1.5, -1.5, 1.5])
plt.axis('equal')
plt.axis('off')
plt.show()
When $k = 11$
clf = neighbors.KNeighborsClassifier(n_neighbors = 11)
clf.fit(X, np.ravel(y))
res = 0.01
[X1gr, X2gr] = np.meshgrid(np.arange(-1.5, 1.5, res), np.arange(-1.5, 1.5, res))
Xp = np.hstack([X1gr.reshape(-1,1), X2gr.reshape(-1,1)])
inC1 = clf.predict(Xp).reshape(-1,1)
inCircle = np.where(inC1 == 1)[0]
plt.figure(figsize = (6, 6))
plt.plot(Xp[inCircle][:,0], Xp[inCircle][:,1], 's', alpha = 0.5, color = 'r', markersize = 1)
plt.plot(X[C1,0], X[C1,1], 'o', label = 'C1', markerfacecolor = "k", alpha = 0.5, markeredgecolor = 'k', markersize = 4)
plt.plot(X[C0,0], X[C0,1], 'o', label = 'C0', markerfacecolor = "None", alpha = 0.3, markeredgecolor='k', markersize = 4)
plt.plot(np.cos(theta), np.sin(theta), '--', color = 'orange')
# plt.legend(fontsize = 12)
plt.axis([-1.5, 1.5, -1.5, 1.5])
plt.axis('equal')
plt.axis('off')
plt.show()
Important Lesson:
- The classification result strongly depends on the choice of $k$. Different values of $k$ can yield different classifications.
- A small $k$ leads to sensitive classifications (possible overfitting), while a larger $k$ gives smoother, more generalized decisions.
2.4. Parametric vs Non-Parametric MethodsĀ¶
At first glance, non-parametric methods such as K-Nearest Neighbor (KNN) might appear easier to understand and implement compared to parametric methods, particularly during the training stage. This simplicity arises because non-parametric models directly use the training data without requiring complex parameter optimization or learning algorithms.
However, during the evaluation or prediction stage, the situation changes significantly:
(1) Parametric Methods:
- Once trained, the model is represented by a fixed number of parameters.
- Predictions on new, unseen data are computed very quickly (near real-time), requiring minimal computational resources.
(2) Non-Parametric Methods (e.g., KNN):
- Lack an explicit parametric model and therefore must evaluate the entire dataset (or a substantial portion of it) each time new data is presented.
- Predicting the output for new data requires computing distances between the new point and all previously stored data points.
- Computational cost grows significantly, becoming a substantial burden with large datasets.
Therefore, while non-parametric methods offer simplicity and flexibility during training, parametric methods often excel in efficiency and scalability when making predictions, especially in large-scale or real-time applications.
3. Decision TreeĀ¶
A decision tree is a powerful and intuitive machine learning algorithm used for both classification and regression tasks. It models decisions in a tree-like structure, where data is split based on certain conditions to predict outcomes.
Advantages of Decision Trees
Simple to Understand and Interpret: Their tree structure mimics human decision-making processes
Feature Importance Identification: Decision trees naturally highlight the most influential features
Source: Artificial Intelligence at MIT by Prof. Patrick Henry Winston
from IPython.display import YouTubeVideo
YouTubeVideo('SXBG3RGr_Rc', width = "560", height = "315")
3.1. Decision Tree AlgorithmĀ¶
Suppose we have the following dataset with 4 features as inputs and the corresponding output values. Each row represents an observation where:
- Feature 1 to Feature 4 are input variables.
- The Output is the target variable, which can be either "Good" or "Bad".
This dataset can be used for building a decision tree model to predict the output based on the given features. We will demonstrate the process of constructing a decision tree through sequential steps, identifying the most effective features for data splitting at each stage.
(1) Step 1 (Feature Test): Homogeneous Set in the First Level
To begin, we split the dataset based on Feature 1.
Classification analysis:
- The first subset contains a mixture of two 'Good' and two 'Bad' instances.
- The second subset contains only three 'Bad' instances.
- The third subset contains only one 'Good' instance.
Observing the number of splits after repeating the classification analysis with the remaining features:
- Feature 1: 4 splits (2nd and 3rd subsets)
- Feature 2: 3 splits (1st subset)
- Feature 3: 2 splits (2dn subset)
- Feature 4: 0 splits (none)
Since Feature 1 achieves the maximum number of splits, it is the most effective feature for the initial split. This ensures the maximum separation of data points, improving the tree's effectiveness in distinguishing classes at the first level.
(2) Step 2: Homogeneous Set in the Second Level
In the next step, we examine the subset where Feature 1 = ? (unknown), which still contains both 'Good' and 'Bad' instances. This indicates that further splitting is required to improve classification accuracy for this group.
The number of splits achieved by the remaining features within this subset is as follows:
- Feature 2: 4 splits
- Feature 3: 2 splits
- Feature 4: 0 splits
Since Feature 2 achieves the maximum number of splits in this subset, it is selected for the second-level split. As there are no remaining data points requiring further classification, the entire dataset can be successfully classified with these two consecutive steps.
(3) Final Decision Tree
Combining the results from both steps, the final decision tree can be summarized as follows:
- Step 1: Split the data based on Feature 1.
- Step 2: For the subset where Feature 1 = ?, further split the data based on Feature 2.
3.2. Disorder and Quality of TestĀ¶
In large datasets, achieving completely homogeneous sets - subsets in which all data points belong to the same class - is often unrealistic. This is because larger datasets inherently exhibit greater variability, increased class overlap, and more significant noise, making complete separation challenging.
Thus, it becomes useful to define a concept that quantifies the degree of homogeneity or inhomogeneity of sets rather than relying solely on absolute purity.
To assess how disordered (or impure) a dataset is, we introduce disorder measures (also known as impurity measures). These measures quantify the degree of mixing among different classes within a given subset, providing a practical metric to evaluate and guide splitting decisions in decision trees.
Definition: Disorder of Single Set (Binary Case)
$$\begin{align*} D & = -p_G \log_2{p_G} - p_B \log_2{p_B} \\\\ & = -\frac{G}{T} \log_2 \frac{G}{T} - \frac{B}{T} \log_2 \frac{B}{T} \qquad \text{where }\; G: \text{Good}, \; B: \text{Bad},\; T: \text{Total}\\\\ & = -x \log_2 x - (1-x) \log_2 (1-x) \qquad \text{cross-entropy ?}\\ \end{align*} $$
where
- $p_G = \frac{G}{T}$: the proportion of 'Good' instances
- $p_B = \frac{B}{T}$: the proportion of 'Bad' instances
- $T = G + B$: the total number of instances in the set
This defintion of disorder may seem unexpected at first. However:
- (Maximum disorder) If half of the total is 'Good', the disorder measure $D$ equals 1:
$$\text{When }\; \frac{G}{T} = \frac{1}{2} \quad \implies \quad D = -\frac{1}{2} \log_2 \frac{1}{2} -\frac{1}{2} \log_2 \frac{1}{2} = 1$$
- (Minimum disorder) If all of them are 'Good', the disorder measure $D$ equals 0:
$$\text{When }\;\frac{G}{T} = 1 \quad \implies \quad D = -1 \log_2 1 - 0 \log_2 0 \approx 0$$
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace(0.01, 0.99, 100)
y = -x*np.log2(x) - (1-x)*np.log2(1-x)
plt.figure(figsize = (6, 4))
plt.plot(x, y, linewidth = 3)
plt.xlabel(r'$x$')
plt.grid(alpha = 0.3)
plt.show()
When two classes in the binary case are present in equal proportions (half and half), the disorder is maximized. This aligns with the disorder measure where $D = 1$, indicating the highest level of uncertainty or disorder.
Note: Entropy (Information Gain)
The disorder measure is actually derived from entropy. Entropy measures the level of uncertainty or disorder in a set.
$$\text{Entropy} = -\sum_{i=1}^{C} p_i \log_2 p_i$$
Where:
$ C = $ Number of classes
$ p_i = $ Proportion of class $ i $ in the node
Entropy $ = 0 \rightarrow$ Pure node (homogeneous set)
Entropy $=$ High $\rightarrow$ Highly disordered node (mixed classes)
Quality of test
Now that we have defined the concept of disorder for a given set. Since datasets are often divided into multiple subsets, each containing a different number of samples, the quality of a test must account for this distribution.
Formally, the quality of a test is defined as follows:
$$ Q(\text{test}) = \sum_{i} D(\text{set}_i) \times \frac{n_i}{N} = \sum_{i} D(\text{set}_i) \times \frac{\text{# of samples in set}_i}{\text{# of samples in all sets}} $$
where:
- $ Q(\text{test}) $ = The overall disorder after the split
- $ D(\text{set}_i) $ = The disorder measure for subset $ i $
- $ n_i $ = Number of samples in subset $ i $
- $ N $ = Total number of samples across all subsets
3.2.1. ExampleĀ¶
Let's revisit the same problem to build a decision tree using the concepts of disorder and quality of test.
(1) Step 1 (Feature Test): Homogeneous Set in the First Level
To begin, we split the dataset based on Feature 1.
Classification analysis:
- The first subset: $D = 1$
- The second subset: $D= 0$
- The third subset: $D=0$
Qualiyt of test
$$Q = 1 \times \frac{4}{8} + 0 \times \frac{3}{8} + 0 \times \frac{1}{8} = \frac{1}{2}$$
Observing the number of splits after repeating the classification analysis with the remaining features:
- Feature 1:
$$Q = 1 \times \frac{4}{8} + 0 \times \frac{3}{8} + 0 \times \frac{1}{8} = 0.5 $$
- Feature 2:
$$Q = 0 \times \frac{3}{8} + \left (-\frac{3}{5}\ln \frac{3}{5} -\frac{2}{5}\ln \frac{2}{5} \right) \times \frac{5}{8}= 0.6$$
- Feature 3:
$$Q = 0.92 \times \frac{3}{8} + 0 \times \frac{2}{8} + 0.92 \times \frac{3}{8} = 0.69$$
- Feature 4:
$$Q = 0.92 \times \frac{3}{8} + 1 \times \frac{2}{8} + 0.92 \times \frac{3}{8} = 0.94$$
Since Feature 1 achieves the best quality of splits (i.e., the lowest $Q$ value), it is identified as the most effective feature for the initial split. Selecting Feature 1 ensures the greatest reduction in disorder, maximizing the separation of data points and enhancing the decision tree's ability to distinguish between classes at the first level.
(2) Step 2: Homogeneous Set in the Second Level
In the next step, we examine the subset where Feature 1 = ? (unknown), which still contains both 'Good' and 'Bad' instances. This indicates that further splitting is required to improve classification accuracy for this group.
The quality of splits achieved by the remaining features within this subset is as follows:
- Feature 2: $Q = 0$
- Feature 3: $Q = 0.5$
- Feature 4: $Q = 1$
Since Feature 2 achieves the best quality of test in this subset, it is selected for the second-level split. As there are no remaining data points requiring further classification, the entire dataset can be successfully classified with these two consecutive steps.
(3) Final Decision Tree
Combining the results from both steps, the final decision tree can be summarized as follows:
- Step 1: Split the data based on Feature 1.
- Step 2: For the subset where Feature 1 = ?, further split the data based on Feature 2.
3.3. Interpretation of Decision TreesĀ¶
We have completed our study of how decision trees are constructed. Now, we will focus on interpreting the structure of the resulting decision tree.
Decision Tree: Feature Importance, and Node Location
In decision trees, feature importance and the location of a node within the tree are closely related. The positioning of a feature in the tree - particularly near the root or deeper in the branches - can provide insights into its significance.
(1) Features Near the Root Node
- Features appearing closer to the root tend to have higher importance.
- Splitting near the root affects a larger proportion of data points, making these splits more impactful in reducing impurity.
- Important features are often selected early in the tree's construction because they best separate the data.
(2) Features in Deeper Nodes
- Features appearing deeper in the tree are typically less important.
- These splits only apply to smaller data subsets, thus having a more limited impact on overall impurity reduction.
Decision Tree and the Concept of a Game
The structure and logic behind decision trees in machine learning closely resemble the thought process in certain games that involve strategic questioning, deduction, or decision-making. One notable example is the classic game "Twenty Questions."
A decision tree mimics the decision-making process used in many games by following a structured path of questions or conditions to arrive at a conclusion. Just like in games that require logical steps, a decision tree iteratively narrows down possible outcomes by making choices at each step.
How Decision Trees Mirror Game Strategies:
- Starting Point: Just like starting a game with broad possibilities, a decision tree begins at the root node, which represents the entire dataset.
- Step-by-Step Questioning: Each internal node in the tree represents a decision point, much like asking a question in a game to gather more information.
- Elimination of Options: Each split in the tree eliminates certain possibilities, similar to how strategic questions in games reduce potential answers.
- Final Outcome: The leaf nodes in a decision tree provide the predicted class or result, just as a player eventually identifies the correct answer in a game.
3.4. Implementation of Decision TreeĀ¶
In this section, we will explore how to implement a Decision Tree in Python using the sklearn.tree module.
The key requirement is to convert categorical values into numerical values, as decision tree algorithms in Scikit-learn operate on numerical data.
Import Libraries
import numpy as np
import pandas as pd
from sklearn import tree
import matplotlib.pyplot as plt
Encode Categorical Variables
Since Scikit-learn requires numerical data, we will encode the categorical values.
data = {
'Feature 1': ['?', 'Yes', '?', 'No', '?', 'Yes', 'Yes', '?'],
'Feature 2': ['Yes', 'Yes', 'No', 'No', 'No', 'No', 'No', 'Yes'],
'Feature 3': ['Low', 'High', 'High', 'Medium', 'Medium', 'Low', 'Medium', 'High'],
'Feature 4': ['Medium', 'Medium', 'Medium', 'High', 'Low', 'High', 'High', 'Low'],
'Output': ['Bad', 'Bad', 'Good', 'Good', 'Good', 'Bad', 'Bad', 'Bad']
}
# Convert to DataFrame
df = pd.DataFrame(data)
# Encode categorical values into numbers
encoding = {'Yes': 1, 'No': 0, '?': -1, 'High': 2, 'Medium': 1, 'Low': 0, 'Good': 1, 'Bad': 0}
df.replace(encoding, inplace=True)
# Split data into features (X) and target (y)
X = df.drop('Output', axis=1)
y = df['Output']
Train a Decision Tree Model
model = tree.DecisionTreeClassifier(criterion='entropy', max_depth =3, random_state = 42)
model.fit(X, y)
Visualize the Decision Tree
Visualizing the decision tree helps in understanding its structure and decision logic.
plt.figure(figsize = (5, 7))
tree.plot_tree(model, feature_names = X.columns, class_names = ['Bad', 'Good'])
plt.show()
Key Discussion Points on Decision Tree Implementation
When implementing a decision tree using Scikit-learnās DecisionTreeClassifier, several important aspects should be considered:
(a) Manually derived decision tree
(b) Convert categorical values into numerical values
- Decision trees in Scikit-learn require numerical input, meaning categorical variables must be encoded appropriately.
(c) It's important to recognize that the results obtained manually and those produced by Scikit-learnās DecisionTreeClassifier may differ.
The DecisionTreeClassifier in Scikit-learn is designed to perform binary splits by default.
Each node splits the data into two branches, even when the dataset could be divided into multiple subsets.
In contrast, manual calculations often allow for multi-way splits (e.g., splitting data into three groups based on Feature 1).
3.5. Nonlinear ClassificationĀ¶
This section will examine the use of decision trees for nonlinear classification.
Data Generation
X1 = np.array([[-1.1,0],[-0.3,0.1],[-0.9,1],[0.8,0.4],[0.4,0.9],[0.3,-0.6],
[-0.5,0.3],[-0.8,0.6],[-0.5,-0.5]])
X0 = np.array([[-1,-1.3], [-1.6,2.2],[0.9,-0.7],[1.6,0.5],[1.8,-1.1],[1.6,1.6],
[-1.6,-1.7],[-1.4,1.8],[1.6,-0.9],[0,-1.6],[0.3,1.7],[-1.6,0],[-2.1,0.2]])
X1 = np.asmatrix(X1)
X0 = np.asmatrix(X0)
plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], 'ro', label = 'C1')
plt.plot(X0[:,0], X0[:,1], 'bo', label = 'C0')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.axis('equal')
plt.show()
N = X1.shape[0]
M = X0.shape[0]
X = np.asarray(np.vstack([X1,X0]))
y = np.asarray(np.vstack([np.ones([N,1]), np.zeros([M,1])]))
tree.DecisionTreeClassifier
clf = tree.DecisionTreeClassifier(criterion = 'entropy', max_depth = 4, random_state = 0)
clf.fit(X,y)
Prediction
clf.predict([[0,1]])
Plot
# to plot
[X1gr, X2gr] = np.meshgrid(np.arange(-3, 3, 0.1), np.arange(-3, 3, 0.1))
Xp = np.hstack([X1gr.reshape(-1,1), X2gr.reshape(-1,1)])
q = clf.predict(Xp)
q = np.asmatrix(q).reshape(-1,1)
C1 = np.where(q == 1)[0]
plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], 'ro', label = 'C1')
plt.plot(X0[:,0], X0[:,1], 'bo', label = 'C0')
plt.plot(Xp[C1,0], Xp[C1,1], 'gs', markersize = 8, alpha = 0.1, label = 'Decison Tree')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.axis('equal')
plt.show()
Out of this example, several key points are worth discussing:
(1) Decision trees can easily handle nonlinear classification:
- Decision trees are well-suited for nonlinear classification because they partition the data space into smaller, manageable regions. By applying multiple sequential splits, decision trees can construct flexible decision boundaries that adapt to complex patterns in the data, making them highly effective even when the underlying relationship is nonlinear.
(2) Decision boundaries are parallel to the axes:
- Decision trees create axis-aligned decision boundaries because each split considers only one feature at a time. As a result, the decision boundaries are typically stepwise and rectangular, rather than smooth or curved.
(3) Increased complexity requires deeper trees:
- For data with strong nonlinear patterns or complex structures, the decision tree may require more splits to capture these patterns effectively. This leads to a deeper tree with more nodes and branches. While deeper trees improve model flexibility, they can also increase the risk of overfitting. Pruning or limiting tree depth can help mitigate this.
3.6. Multiclass ClassificationĀ¶
A decision tree is also a flexible algorithm capable of efficiently handling multiclass classification problems, where the target variable can take on three or more distinct categories.
To illustrate how decision trees handle multiclass classification, consider the following example involving a dataset with three distinct classes.
from sklearn import tree
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
## generate three simulated clusters
mu1 = np.array([1, 7])
mu2 = np.array([3, 4])
mu3 = np.array([6, 5])
SIGMA1 = 0.8*np.array([[1, 1.5],
[1.5, 3]])
SIGMA2 = 0.5*np.array([[2, 0],
[0, 2]])
SIGMA3 = 0.5*np.array([[1, -1],
[-1, 2]])
np.random.seed(10)
X1 = np.random.multivariate_normal(mu1, SIGMA1, 100)
X2 = np.random.multivariate_normal(mu2, SIGMA2, 100)
X3 = np.random.multivariate_normal(mu3, SIGMA3, 100)
y1 = 1*np.ones([100,1])
y2 = 2*np.ones([100,1])
y3 = 3*np.ones([100,1])
plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], '.', label = 'C1')
plt.plot(X2[:,0], X2[:,1], '.', label = 'C2')
plt.plot(X3[:,0], X3[:,1], '.', label = 'C3')
plt.xlabel('$X_1$')
plt.ylabel('$X_2$')
plt.legend(loc = 1)
plt.grid(alpha = 0.3)
plt.axis([-2, 10, 0, 12])
plt.show()
Decision tree with a maximum depth of 2
X = np.vstack([X1, X2, X3])
y = np.vstack([y1, y2, y3])
clf = tree.DecisionTreeClassifier(criterion = 'entropy', max_depth = 2, random_state = 0)
clf.fit(X,y)
res = 0.3
[X1gr, X2gr] = np.meshgrid(np.arange(-2, 10, res), np.arange(0, 12, res))
Xp = np.hstack([X1gr.reshape(-1,1), X2gr.reshape(-1,1)])
q = clf.predict(Xp)
q = np.asmatrix(q).reshape(-1,1)
C1 = np.where(q == 1)[0]
C2 = np.where(q == 2)[0]
C3 = np.where(q == 3)[0]
plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], '.', label = 'C1')
plt.plot(X2[:,0], X2[:,1], '.', label = 'C2')
plt.plot(X3[:,0], X3[:,1], '.', label = 'C3')
plt.plot(Xp[C1,0], Xp[C1,1], 's', color = 'blue', markersize = 8, alpha = 0.1)
plt.plot(Xp[C2,0], Xp[C2,1], 's', color = 'orange', markersize = 8, alpha = 0.1)
plt.plot(Xp[C3,0], Xp[C3,1], 's', color = 'green', markersize = 8, alpha = 0.1)
plt.xlabel('$X_1$')
plt.ylabel('$X_2$')
plt.legend(loc = 1)
plt.grid(alpha = 0.3)
plt.axis([-2, 10, 0, 12])
plt.show()
Decision tree with a maximum depth of 3
clf = tree.DecisionTreeClassifier(criterion = 'entropy', max_depth = 3, random_state = 0)
clf.fit(X,y)
q = clf.predict(Xp)
q = np.asmatrix(q).reshape(-1,1)
C1 = np.where(q == 1)[0]
C2 = np.where(q == 2)[0]
C3 = np.where(q == 3)[0]
plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], '.', label = 'C1')
plt.plot(X2[:,0], X2[:,1], '.', label = 'C2')
plt.plot(X3[:,0], X3[:,1], '.', label = 'C3')
plt.plot(Xp[C1,0], Xp[C1,1], 's', color = 'blue', markersize = 8, alpha = 0.1)
plt.plot(Xp[C2,0], Xp[C2,1], 's', color = 'orange', markersize = 8, alpha = 0.1)
plt.plot(Xp[C3,0], Xp[C3,1], 's', color = 'green', markersize = 8, alpha = 0.1)
plt.xlabel('$X_1$')
plt.ylabel('$X_2$')
plt.legend(loc = 1)
plt.axis([-2, 10, 0, 12])
plt.show()
Decision tree with a maximum depth of 4
clf = tree.DecisionTreeClassifier(criterion = 'entropy', max_depth = 4, random_state = 0)
clf.fit(X,y)
q = clf.predict(Xp)
q = np.asmatrix(q).reshape(-1,1)
C1 = np.where(q == 1)[0]
C2 = np.where(q == 2)[0]
C3 = np.where(q == 3)[0]
plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], '.', label = 'C1')
plt.plot(X2[:,0], X2[:,1], '.', label = 'C2')
plt.plot(X3[:,0], X3[:,1], '.', label = 'C3')
plt.plot(Xp[C1,0], Xp[C1,1], 's', color = 'blue', markersize = 8, alpha = 0.1)
plt.plot(Xp[C2,0], Xp[C2,1], 's', color = 'orange', markersize = 8, alpha = 0.1)
plt.plot(Xp[C3,0], Xp[C3,1], 's', color = 'green', markersize = 8, alpha = 0.1)
plt.xlabel('$X_1$')
plt.ylabel('$X_2$')
plt.legend(loc = 1)
plt.axis([-2, 10, 0, 12])
plt.show()
As a decision tree becomes deeper, it can make more detailed classifications. While this improves the model's ability to capture complex patterns, it also increases the risk of overfitting. To avoid this, limiting the tree's depth is often used to improve its generalization.
3.7. Random ForestĀ¶
Ensemble methods are machine learning techniques that combine multiple models to improve predictive performance. Among ensemble methods, Random Forest is one of the most widely used and effective approaches, particularly for classification and regression tasks.
Random Forest: A Powerful Ensemble Learning Algorithm
Random Forest is a widely used ensemble learning algorithm that builds multiple decision trees and combines their outputs to improve predictive performance. It is particularly effective for classification, regression, and feature importance analysis.
How Random Forest Works
Random Forest is based on the Bagging (Bootstrap Aggregating) technique, where multiple models are trained independently, and their predictions are aggregated for improved accuracy and robustness.
Step 1: Bootstrap Sampling
- The algorithm randomly selects multiple subsets of data from the original dataset (sampling with replacement).
- Each subset is used to train an individual decision tree.
Step 2: Random Feature Selection
- For each split in a tree, Random Forest considers only a random subset of features rather than all features.
- This feature randomness introduces diversity among the trees, enhancing the modelās robustness.
Step 3: Tree Growth
- Each tree is grown fully or until a predefined stopping criterion (e.g., max_depth) is reached.
- Each tree may vary in structure due to random sampling and feature selection.
Step 4: Aggregation of Predictions
- For classification, Random Forest uses majority voting - the class predicted by the most trees is the final output.
- For regression, Random Forest averages the predictions from all trees.
ensemble.RandomForestClassifier
from sklearn import ensemble
clf = ensemble.RandomForestClassifier(n_estimators = 100, max_depth = 3, random_state = 0)
clf.fit(X,np.ravel(y))
res = 0.3
[X1gr, X2gr] = np.meshgrid(np.arange(-2, 10, res), np.arange(0, 12, res))
Xp = np.hstack([X1gr.reshape(-1,1), X2gr.reshape(-1,1)])
q = clf.predict(Xp)
q = np.asmatrix(q).reshape(-1,1)
C1 = np.where(q == 1)[0]
C2 = np.where(q == 2)[0]
C3 = np.where(q == 3)[0]
plt.figure(figsize = (6, 4))
plt.plot(X1[:,0], X1[:,1], '.', label = 'C1')
plt.plot(X2[:,0], X2[:,1], '.', label = 'C2')
plt.plot(X3[:,0], X3[:,1], '.', label = 'C3')
plt.plot(Xp[C1,0], Xp[C1,1], 's', color = 'blue', markersize = 8, alpha = 0.1)
plt.plot(Xp[C2,0], Xp[C2,1], 's', color = 'orange', markersize = 8, alpha = 0.1)
plt.plot(Xp[C3,0], Xp[C3,1], 's', color = 'green', markersize = 8, alpha = 0.1)
plt.xlabel('$X_1$')
plt.ylabel('$X_2$')
plt.legend(loc = 1)
plt.grid(alpha = 0.3)
plt.axis([-2, 10, 0, 12])
plt.show()
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')