Machine Learning #6 — Tree-Based Models, Entropy, Gini Index, Decision Trees for Classification and Regression

Göker Güner
7 min readMay 31, 2024

--

Hello,
If you are not interested in the Turkish version of this article, you can skip this paragraph.

Bu yazı, öğrendiklerimi pekiştirmek ve paylaşmak için yazdığım Makine Öğrenimi serimin altıncı makalesidir. Serinin beşinci yazısı için: link

Makalenin Türkçe versiyonu için: link

Hello, in the sixth part of our Machine Learning series, we discuss tree-based models, the use of which we have briefly touched upon previously, in more depth theoretically.

Tree-based models are powerful and flexible methods used to solve classification and regression problems by dividing data sets into branches.

In this chapter;

  • I will explain basic concepts such as entropy and Gini index, and detail how decision trees are created and used.
  • I will also show with examples how to apply decision trees for classification and regression problems.

Decision Trees for Classification

Given a labeled dataset, a classification tree learns a series of if-else questions about individual features to extract labels. Unlike linear models, trees can capture non-linear relationships between features and labels. Additionally, trees do not require features to be at the same scale, for example through standardization.

When we say classification tree, we actually mean a decision tree trained for the classification task.

We imported the data set named “Wisconsin Breast Cancer Dataset” that we will use for this study. We can now import it directly into our project from the UC Irvine ML Repository, which you can see both in this article series and in many beginner-level resources on Artificial Intelligence, without having to download it in .csv format.

You can see the sample usage by clicking on this link and pressing the “Import In Python” button.

Let’s remember again that we can simply use a decision tree in this way.

Why Decision Trees?

  • If there are non-linear relationships in your data and you want to capture these relationships,
  • If model interpretability is important to you and you want to visualize the decision-making process,
  • If you are working with categorical data and do not want to convert this data to numerical data,
  • If you want to highlight more important features by selecting features,
  • If you are working with multiple classification problems, decision trees may have an advantage over logistic regression.

We will see the difference when we compare it numerically, but to better understand how it works, let’s examine it visually first. We can visualize the comparison with the help of the following function.

Now we build our Logistic Regression model and draw our comparison chart.

Since our data set is quite small, the number of data we cannot predict is quite small. However, note that Decision Trees have a better boundary drawing ability when it comes to capturing non-linear relationships.

How Do Decision Trees Learn to Classify?

Let’s start by defining some terms. A decision tree is a data structure consisting of a hierarchy of individual units called nodes. A node is a point that contains a question or prediction.

The root is the node where the decision tree begins to grow. It has no parent node and contains a question that leads to 2 child nodes via two branches. An internal node is a node that has a parent. It also contains a question that leads to 2 child nodes. Finally, a node that has no child nodes is called a leaf. A leaf has one parent node and no questions. This is where a prediction is made. Recall that when a classification tree is trained on a labeled dataset, the tree learns patterns from features in a way that produces the purest leaves. In other words, the tree is trained so that one class label dominates each leaf.

Sample representation of a Decision Tree.

Visually, we can see the parts of a decision tree this way. Root, branch, internal node and leaf form the parts of a decision tree.

So, what kind of mathematics does our decision tree follow when creating these leaves?

It is important to note here that when a regression tree is trained on a dataset, the purity of a node is measured using the mean squared error of the targets at that node.

This means that the regression tree tries to find splits that produce leaves where the target values ​​in each leaf are as close as possible to the given mean value of that leaf.

Mean Square Error (MSE) for a Node

Here N is the number of instances in the node.

Yi, is the actual target value of the i-th sample.

yˉ , is the average target value of all samples in the node.

Average of Target Values ​​in a Node

As the regression tree is trained, it tries to reduce the MSE for each split. Therefore, at each split, it aims to reduce the total MSE by ensuring that the target values ​​in the leaves are as close as possible to the average target value of that leaf.

Combining these concepts, the regression tree tries to minimize the sum of the MSEs of the left and right child nodes at each split. This minimization is called Total Purity Reduction.

Total Purity Reduction

MSE parent, is the MSE of the parent node.

MSE left and MSE right, are the MSEs of the left and right child nodes.

N total, is the total number of instances in the master node.

N left and N right, are the sample numbers in the left and right subnodes.

The regression tree algorithm tries to maximize the purity decay (ΔMSE) at each split. We use Information Gain to measure this concept of purity decay.

What is Information Gain?

Information Gain measures the degree to which uncertainty or impurity is reduced by splitting a data set based on a particular feature. This measurement is used when creating decision trees and choosing which feature to use at each division step. The goal is to split the dataset in a way that maximizes information gain.

Entropy and Information Gain

Information Gain is based on the concept of entropy. Entropy is a measure of the uncertainty or randomness in a data set. High entropy indicates that the dataset is more complex or random, while low entropy indicates that it is more orderly.

Entropy Calculation

Entropy is used to measure uncertainty in a data set. The entropy equation is as follows:

In this equation:

  • H(D) is the entropy for the dataset ( D ).
  • (k), is number of classes.
  • p-i is the probability of the i-th class.

Information Gain Calculation

Information gain is the amount of reduction in entropy when we split a data set based on a particular property:

In this equation:

  • IG(D, A) is the information gain of feature (A) for dataset (D).
  • (H(D)) is the entropy of the dataset (before splitting).
  • Values(A) is the set of different values ​​that feature (A) can take.
  • 𝐷𝑣 is the subset of the data set when the feature (A) takes the value (v).
  • |D| is the total number of samples in the dataset.
  • |𝐷𝑣| is the number of samples in the subset with the value (v).
  • H(𝐷𝑣) is the entropy of the subset.

Summary:

  • Entropy measures the uncertainty in a data set.
  • Information Gain measures the reduction in entropy when we split a dataset based on a particular feature.
  • Splits with high Information Gain make the dataset purer (less ambiguous) and are therefore preferred when creating decision trees.
  • Information Gain is used to choose the best split at each node of decision trees, thus creating more accurate and effective models.

The concepts of Gini index and entropy can be used for this information gain criterion. Which criterion value should be used may vary from project to project, depending on the size of the data set or hyperparameters.

Decision Trees for Regression

Recall that in regression the target variable is continuous. In other words, the output of your model is a real number. So using a decision tree here too can enable our model to learn the dataset better than linear regression.

For this study, we use the Auto MPG dataset from the UC Irvine Machine Learning Repository.

We have shown that the regression tree provides a slight advantage over linear regression(For see the results, slide to end of the cell above). Note that these values ​​may vary depending on the size of the data set and hyperparameters.

Finally, let’s visualize this comparison(For see the plots, slide to end of the cell below again).

Thus, we have come to the end of our first article in which we examined decision trees in depth. In the next section we talk about Bias-Variance Tradeoff, Bagging, Random Forests, Boosting ve Model Tuning concepts.

See you in the next articles.

--

--