How to choose features, filters and window size to train a decision tree?

Lorenzo BRACCO · ‎2021-06-25

In this article, the third of a series of MLC AI tips articles, we will explain how to choose the features to train a decision tree, how to choose the window size to use to compute those features and how to apply the right filters to leverage frequency information to improve classification accuracy in order to build the best possible model for the Machine Learning Core embedded in the latest generation of ST MEMS sensors.

1. Introduction

After the completion of pre-processing of the training dataset, in turn derived by a proper sensor configuration setting and an organized data collection phase, it is time to train the decision tree model, and in particular define the window length and start the process of feature selection. This process is important because it is not possible, neither suggested, to calculate all the features, because:

In total with MLC, it is possible to compute 12 different features for each raw data (Acc_X, Y, Z, V, V2), leading to 60 features for the accelerometer sensor only (120 if also using gyro data)
Max number of features supported by MLC is 32.

In addition, the usage of too many features leads to big ML models and high risk of overfitting.

2. Select window size

Window size is defined as the number of samples, the window, for which features are computed. It is strongly dependent on the specific application, and it should be defined to capture all the information needed to separate the different classes. A short window length could make the response quick. However, it could also reduce the model accuracy if the window length is too small to capture enough information.

If the motion is repetitive at some frequency, ideal window size should be long enough to capture the entire motion pattern. Please also consider the slowest varying signal: for example, if we want to distinguish between class A which has a periodicity of ~0.9 Hz and class B which has a periodicity of ~1.9 Hz, we should decide the window size based on class A and set window size to be (1/0.9 Hz) ≈ 1.1 s or longer.

3. Feature selection

Selection of informative features starts by observing the motion pattern from collected data. Consider the example of recognizing head gestures, such as “nod” when the sensors are mounted on the headset with the y-axis pointing toward Earth (gravity direction) and the x-axis aligned with user heading direction. Here, we might choose the x-axis and y-axis acceleration with variance, mean, minimum, and maximum because these features of the x-axis and y-axis will change when the user is nodding.

Take another example. If we mount sensors on a drilling machine, we do not want to limit the use of machine for specific orientations. In this case, we should NOT apply orientation-dependent features, such as, x, y, z, mean values, minimum, maximum. We might choose variance, zero-crossing, peak-detection, energy of norm (the magnitude of x, y, z-axis). These features are independent from the orientation of the drilling machine and thus are most applicable for detecting vibrations.

3.1. Leverage model training

One approach to derive a first possible pool of candidate features is by selecting all the available features and train the model. In our case, the available features are pre-defined in the MLC. While training the model, the decision tree learning algorithm will select the features that can better discriminate the classes. Of course, we can try to select features which are intuitive to solve the problem.

There are some drawbacks in this approach, such as the possibility to get a model that is overfitting due to the usage of too many features or features that are not actually relevant to the detection of the specific classes. Therefore, it is always recommended to rely also on the other methods to avoid using all features in the first place. A better approach is to identify a set of potentially good features and then train the model to see which ones are mainly used.

3.2. Features visualization

Visualization of computed features may help in selecting and dropping features to optimize the performance of the model. The information contribution from a single feature can be evaluated with a measure of how much different classes are separated. The analysis can be visually performed with 1D/2D representations.

The figure below on the left shows a histogram of a single features applied on a 3 classes problem. In this case, we can see that classes are well separated, hence the feature is a good candidate for model training.

It may be useful, when possible, to plot the Cartesian diagram of 2 features, like in the picture above on the right, where is shown a 2D plot related to a 2 classes classification problem and a selection of 2 different features. In this case, the strict separation is evident and thus the information from combining the selected two features is visible. This kind of plot is helpful to also evaluate the relationship between different features and classes. The figure on left side is plotting histogram of a single feature for each class. The distribution of feature value for each class is very different and even using simple thresholds the classes can be separated.

The same approach is supported by ML tools such as WEKA. The 2D plots shown below represent the relationship between different features and classes. In particular, it is possible to observe that the feature combinations in the green block show separability, while the ones in the red block are not.

Additional considerations on features inclusion are correlation and variability:

If the feature has a linear relationship or is strongly correlated with another feature, the information is already captured by another feature and hence this feature can be dropped. If this is hard to be visualized, the covariance matrix can be computed to identify the most correlated features. In the plot below we can see there are many non-diagonal elements which are strongly correlated (yellow) and hence they can be dropped.
Variability: if the feature does not show much variation between different classes, the feature may not help in separating classes.

3.3. Features ranking

Finally, ML tools provide some built-in options to identify the best subset of informative features. Below an example is shown for the features ranking in WEKA. In the example shown, the problem has 30 features and by using feature ranking logic, Weka selects only 11 important features for the current dataset. It may then be a good idea also not to use them all but choose only the ones ranked higher.

4. Filters selection

The basic features such as mean, variance, and energy contain direct information from raw signal samples, but they might not be able to separate classes with different dominating frequencies. It is always possible, and in many cases suggested, to apply filters on raw signals to isolate specific information and separate classes more efficiently. As an example, let us consider a human activity detection problem. We can distinguish precisely if the user is walking (typically 1-2 Hz) or jogging (2.5 - 4 Hz) using appropriate filters. A Fourier analysis provides insight on which frequency region is dominating for each class.

This process involves the selection of a proper filter (low pass, high pass, bandpass) and the definition of the cutoff frequencies. Consider the case of energy as sole feature for the classification problem. The selection of filter and frequency of interest is done by finding, with an FFT, the significant energy in the non-overlapping region between classes. The plot below illustrates the overlapping region and frequency region of interest for classes S1 and S2.

If we compute the energy on the entire span of frequencies (0-250 Hz), the risk is to collapse information for both classes. It may not be sufficient to separate 2 classes if the energy for different classes is similar in module but with different frequency distribution. The best option is to apply an appropriate filter to remove from the signal a specified frequency region before computing the feature.

If we take the example in the figure above and apply a low-pass filter at 60 Hz, the signal S2 will be completely filtered out and part of S1 which is below 55 Hz will pass through. Now if we compute the energy on both signals after filtering, S1 will have higher energy compared to S2 because the S2 signal is completely filtered out.

5. Training-test dataset split

Once we have defined the window length, informative features and suitable filters, it is time to train the model. To this purpose, the dataset must be divided into:

training set
testing/validation set.

For every machine learning model, it is important to split the data in training, testing and validation (if needed). The training data is used to train the model, testing data is used to evaluate the quality/accuracy of a trained model and validation data is used if we have more than one model to evaluate. The accuracy on testing and training data can be useful in deciding if model is underfitting, overfitting or well balanced. For example, if the accuracy on trained data is high (> 80%) while on test data is low (< 60%), model is likely overfitting and model may require some pruning or adding more diverse data. On the other hand, if accuracy of trained and testing data is low (< 60%), model is likely underfitting and may require adding more features or reduce pruning.

Usually, the ratio between training and testing samples is about 80:20 or 70:30 if we are evaluating a single model. In case we are interested in evaluating multiple models (from different training algorithms such as J48 and CART or using different feature sets), then we should split the data into training, validation and testing in a 60:20:20 ratio. The validation data should only be used to select the best model and test data should be used to evaluate the final accuracy of selected model. The selection of the ratio for splitting the data is problem-specific and can be changed according to the application requirements.