How to collect and process data to train a decision tree?

Roberto MURA · ‎2021-06-15

In this article, the second of the MLC AI tips articles, we will explain how to manage a robust and meaningful data collection to train decision trees that can be embedded on the Machine Learning Core of our ST MEMS sensors.

1. Introduction

The usual flow for developing a machine learning classifier, and in particular a decision tree algorithm , is shown in the picture below. Data collection and data labelling (for supervised learning) are the first two steps. Then, the model is trained using the training data (a subset of the whole dataset). The model is finally transferred and implemented on the device with Machine Learning Core to perform real-time validation testing using the embedded hardware.

2. Collecting data

Sensor data can be collected and labelled using various applications such as Unico-GUI on PC, ST BLE Sensor app, AlgoBuilder, FP-SNS-DATALOG1 etc., along with different hardware device options such as the ProfiMEMS board, the SensorTile.Box, the Nucleo Boards and STWIN.

A general guideline for a proper data collection campaign is based on the following points:

Class definition. The classification problem must be defined from classes to be recognized. It should be also clarified what happens when an input does not belong to any of the defined classes. In this case, think about the inclusion of an “other/idle” class to add to the class list.
Sensor. Some of the classification problems can be solved by using an accelerometer signal, some with a gyroscope signal, and some require both sensor inputs. It is recommended to enable logging from both sensors to investigate which configuration can provide the best accuracy in case of 6-axis inertial module products. A 3-axis magnetometer can also be accessed through the sensor hub capability in the 6-axis MLC family of sensors.
Sensor configuration. The various tools/systems used for data recording (e.g., Unico-GUI with ProfiMEMS board), allow to set sensor rates (ODR) and full scales (FS):
- Make sure the sensor data is collected at a sufficient ODR and bandwidth to capture the needed information. Bandwidth, which represents the maximum frequency signal captured by the system, is generally half of ODR. In case the exact ODR selection is not feasible in advance, it is preferable to set a high ODR (e.g. 104 Hz) to maximize bandwidth and then utilize decimation/resample methods. Consider also that the defined configuration must be maintained for the entire process, and that the ML algorithm will process the field data in the same manner.
- Full Scale (FS) should be defined in accordance with the application requirement. For example, the gyro FS can be very different if the sensor is used for vehicle or wrist applications.

Data consistency and balancing. Collect sufficient data according to the application requirements and that represent the actual use case. If the application is based on a wrist device, do not include data recorded on smartphones. It is also important to collect a similar amount of data for each class. The machine learning training algorithm will try to achieve maximum total accuracy and consequently, classes with a small amount of data will get lower weighting and the model may be biased toward classes with a large amount of data. For example, as an extreme case, if class ‘A’ has 90% data and class ‘B’ has 10% data, then the model could assign all the output to class ‘A’ with a final accuracy of 90%, which is of course wrong.
Data logging format. Data must be logged in text file (.txt) with the same convention defined in Unico-GUI (for details, refer to the application note AN5259 ) if the data logging is done using some other tool.

After completing this first data collection phase, a series of operations are needed to set up the classification problem. Data should be validated and cleaned to avoid outliers and poor labeling. Visualizing data is a good practice for data validation, selecting features, labelling and more.

3. Visualizing data

Data visualization can be used in data pre-processing and analysis. It can also be used in understanding the relationship between different features and variations across different classes. We can divide this process into time and frequency domains. Main differences are shown in the next picture.

3.1 Time Domain Visualization

There are several options available to visualize the collected data. If we collect accelerometer data in [mg] and gyroscope data in [dps], we can use MATLAB/Python/Excel/R to generate plots as shown in the following example.

The analysis of the signal in time domain allows to:

Visually inspect any poor quality or outliers.
Understand if a pattern exists between different classes.
Select the informative features (mean, peak-to-peak, peak, etc.).

3.2 Frequency Domain Visualization

The frequency-domain analysis could be helpful to complement the time domain analysis. The same tools (MATLAB/Python/Excel/R) can be used to compute and produce FFT or spectrogram plots. Such plots can provide insights about applying filters on the signal before it is processed to compute features. Or they can be useful to evaluate if a specific recorded log is correctly referenced. For example, think about a data log for a “walking” use case. With an FFT plot it is possible to verify the dominant frequency associated with the processed signal. If this frequency is too high (e.g., 2.5 Hz), we can assume that the log should be referenced as “running” instead of “walking”.

The next picture shows the frequency spectrum for different human activities (walking, fast walking, jogging, biking) considered in a classification problem.

From the plot, each activity has a distinct dominating frequency component, thus appropriate filters can be applied to extract informative signals for each class. For example, if we apply a low-pass filter at 1.8 Hz, we can separate the walking and fast walking classes from jogging class, characterized by higher energy at higher frequencies.

4. Cleaning and labeling data

Labeling of sensor data is a crucial point for decision tree models development. The best practice is to conduct sanity checks for validating the collected data before starting the ML training exercise. Data cleaning is the process of fixing or dropping corrupted, incorrectly formatted, or incomplete data. Users can easily visualize and check, for instance, mean, variance, minimum, and maximum values. If a computed feature is significantly different from logs in the same labeled class (but they are supposed to be similar in that application), then labelled data may be not clean.

In most projects, we know and collect the ground truth when we are recording data. Usually it is possible to start and stop the log with only a single class registered. But it might happen that different classes are included in a single recorded log file, for various reasons. In such cases, data should be truncated to have one single class per one single file before the log file is considered ready for ML training.
Consider again the example from activity recognition with a wrist device. The next plot shows the acceleration signal of motion pattern. Ideally the idea is to collect data for stationary, walking, and fast-walking classes. Unfortunately, it is clear that the recorded log is mixed with all the activities. The user should truncate the log and label correctly each sub-part to include the right information as a proper input of the model.

Another issue is shown below. A walking pattern is recorded, but at the start and at the end of the log stationary samples are present. In addition, in the middle of log recording there is evidence for a spike that is not classified. Annotations of recorded logs are suggested to avoid incorrect training data.

All these issues could affect the generated decision tree and the performance of the MLC result. Let’s summarize, here are some basic suggestions for a robust data collection:

The units of labelled data should be consistent in each recorded file. For example, use [mg] for acceleration and [dps] for angular velocity
Separate data based on the class, using any available tool such as Python, MATLAB, R. Or directly define a procedure to record data already cleaned and truncated
Remove data which are not part of any class or use it for an “other/idle” class.