How to evaluate the performance of a decision tree

Roberto MURA · ‎2021-07-01

In this article, the third of a series of MLC AI tips articles, we will explain how to evaluate the performance of a trained decision tree model in order to build the best possible model for the Machine Learning Core embedded in the latest generation of ST MEMS sensors.

1. Introduction

How can we know the decision tree model we have trained is good enough? There are multiple methods available to measure model performance. The most common Key Parameter Index (KPI) to judge the performance of a ML model is the accuracy calculated as percentage of correct predictions vs total number of predictions. Other metrics are available from different combinations of the basic elements, true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN).
With these, it is possible to compute:

Precision: the ratio of actual positive observations to the total predicted positive observations
Recall (or sensitivity): the ratio of actual positive observations to all the observations in actual class
F1-score: the harmonic mean of precision and recall, often used as an alternative of accuracy.

While the discussed KPIs summarize performance results, as an alternative we can also directly visualize the confusion matrix resulting from the testing dataset, as in the example shown below. The numbers in the diagonal represent the right predictions and need to be maximized, while all the numbers in the matrix outside the diagonal are errors in the prediction.
If there is a specific class that tends to be often confused with another one, it may be needed to re-examine again the training dataset used to find possible errors, or to revise the features selection.

2. Cross-validation techniques

We often randomly split the data in training and testing sets to train and test the model. The training data is assumed to have captured diverse data information, sufficient to well classify the testing data
To select a stable model, a cross-validation method can be implemented. Cross-validation is also known as out-of-sampling technique, and it is useful for the following reasons:
for some specific problems, a random sampling method may split all the data of a particular use case as a part of testing data but not of training data. In this case, the model would perform poorly since the model is not trained for the specific use case.
since we are performing a random split, the accuracy of the model could change after every random split.

2.1. What are the cross-validation techniques available?

Various cross-validation techniques are available, with advantages and disadvantages. Among them, we can mention the methods Leave-one-out, Leave-p-out, K-fold and Holdout.
The most used method is the K-fold cross-validation (specifically 10-fold). With this method, training data are divided in k splits and training is done k times.

5-fold cross validation

At each iteration, one of the subsets is used as the testing set and the other k-1 subsets are used as training sets. In this manner each subset is used k-1 times for training and 1 time for testing. The benefit of this approach is that the error estimation is averaged over all k trials to get the total effectiveness of the model. At the end of K-fold validation, the final model will be prepared utilizing all the training data.
When we perform K-fold cross validation, its validation summarizes various errors through statistical measures such as mean squared error, root mean squared error or absolute median deviation. These parameters are useful in selecting the appropriate model in determining the quality of the model.

3. Underfitting and Overfitting

To guarantee a defined level of quality for the trained model, we are often trying to avoid the two most common causes of poor model performance: underfitting and overfitting.

Underfitting happens when a machine learning model is not complex enough or does not have sufficient information (in the selected features) to capture the relationship and information contained in the dataset and thus distinguish the classes. The concept of underfitting, good fitting (robust model) and overfitting is illustrated below.
Overfitting is a modeling error that occurs when the model performs well on trained data points but performs poorly on unseen data. It would have good accuracy on training dataset but low accuracy during testing or on another newly collected dataset.

3.1. How to determine if the model is underfitting?

When the model is underfitting it fails to capture all data patterns contained in training data. In other words, if the trained model performs poorly both on training data and test data, then it can be due to underfitting. In this scenario, the selected features may not be sufficient to capture the information for the model to fit the data.
Underfitting is easy to handle, and the standard solution is to add extra features and train the model again. The other option is to reduce the number of classes. In certain cases, if data between two classes look similar, the model may not be able to do a decent job in classifying it. Adding diverse data may also be useful for addressing the underfitting issue because the model may be able to increase the complexity and learn the patterns that were missed in the previous training.

3.2. How to determine if the model is overfitting?

The overfitting model generally takes the form of making an overly complex model to capture features in data used in training. This situation arises when the model has a lot of flexibility and freedom in fitting as many points as possible by interpreting noise as actual pattern. Cross-validation generally helps in identifying the overfitting. When the distance between the accuracy obtained with the training dataset, say 95%, is high compared with what the model would predict for the testing dataset, say 50/70%, then it is a probable identification of an overfitting problem. The RMSE (Root Mean Square Error) obtained in K-fold validation is a cumulative error on the various folds on validation data. If the final model accuracy is good but RMSE is high (> 0.2), then it is a good sign of overfitting.

To address and reduce overfitting the following techniques can be adopted.

Collect more logs
This is usually a straightforward solution for an overfitting issue. New logs should be collected in a different situation with greater diversity. The collected data should capture various use cases that the model would encounter in a real-world application.
Data augmentation
Sometimes, we do not have multiple users to collect logs or it is not easy to have diverse environments / conditions. We can consider adding noise or rotating the raw data at different angles. For example, let us consider the head gesture detection problem. The objective is to detect the classes “nod”, “shake“, ”stationary”, “walk”, “swing” with the sensor attached to a headset. We expect that the y-axis points towards Earth (same orientation as gravity) and the x-axis is the user heading direction. However, if test data has only a limited number of users with a limited orientation setting, then the model could suffer from two issues.
First, the decision tree trained on this data would work perfectly for the people who collect the logs but can show poor performance on other unseen users.
Second, the model is tuned for a specific orientation and the obtained performance could be affected by the user’s preference for the orientation of the headset. To overcome these issues, we can:
Inject noise as mentioned earlier to handle any small variation in data.
Rotate the existing data at different angles which are possible during actual usage and add the rotated data with the training dataset. The trained model will cover most of the user’s preferred orientation and the solution would be robust enough against sensor orientation.
Reduce number of features
Selecting a large number of features during the training phase may allow more flexibility and freedom to fit the model to all the training data, but it can cause overfitting. Removing features that are not very important can help in lowering the complexity of the model and solve the problem. As we mentioned when discussing the feature visualization in the previous article, plotting and visualizing features helps in understanding the relationship between features and classes.

Pruning
Pruning means simplifying the decision tree by removing sections of the tree that are not critical. These can correspond to the nodes which are not classifying too many data points. There are two types of pruning:

Pre-pruning: it stops growing the tree before it continues to classify the subset
Post-pruning: after building the decision tree, logic decides about trimming a certain node or subtree.

Most of the time, post-pruning is preferred since it is not easy to know in advance the right moment to stop growing the tree. In post-pruning, the logic first grows the complete decision tree and then, using specific criteria (such as information gain or imbalance of classes), starts removing the non-critical part of the decision tree. For instance, if there are certain nodes in the decision tree that classify only few samples of a particular class and the remaining large number of samples belong to other classes, then we can replace this node with the leaf which classifies all of the samples that belong to the other classes. The performance of the resulting decision tree will not be adversely affected by this operation.
Pruning reduces the leaves which are part of the decision tree that may cause overfitting. Let us take the nodes shown in the figure below as an example.

The left-hand side node (Features 4 ≤ -1.71) of the above decision tree classifies a sole exception sample but the other 13 samples belong to another class. In this case, it could be better to make this node a leaf.
The decision tree algorithm inside Unico-GUI has the option for pruning. We can set up a limited number of nodes and the confidence factor would prune until we meet the condition.

The approach to determine optimum pruning is through monitoring the testing and training accuracy. If we plot the accuracy/error of both testing and training datasets against the level of pruning, we tend to get a plot as shown below.

Without pruning, the model error on training data will be lowest but the error on the testing set can be high due to overfitting. As we increase the level of pruning, the error in the training set will increase, but the error on the validation set will start reducing because some of the nodes were overfit to the noise. As we further increase the pruning, the error on both validation and training sets will increase because pruning will remove nodes that are classifying correctly. This is the point to stop the pruning process.

An overview of underfitting and overfitting issues and solutions is summarized below.