cancel
Showing results for 
Search instead for 
Did you mean: 

Baby detector combining MEMS sensors and STM32 MCUs through Artificial Intelligence

Eleon BORLINI
ST Employee
Baby Detector is an application developed to sense the presence of babies in a small environment. The idea is to face the problem of babies forgotten inside cars, that can lead to their death especially on hot summer days. The core elements of the system are: a MP23ABS1 analog MEMS microphone, able to detect sound made by these babies; a STM32F4 micro-controller able to process the audio data and several different motion/environmental sensors from ST portfolio checking the environment conditions, that can raise the danger for the baby if too harsh.

1.      Reason of the research
1.1        Available solutions
2.      ST solution
2.1       Artificial intelligence for audio
2.1.1       Dataset
2.1.2       Features
2.1.3       Training and analysis
2.2       Machine learning for motion
2.3       Sensortile.box: the board for application development
3.      Next steps

1. Reason of the research


Searching for new ways to innovate often pass through our daily experience, and this is the case of the described application.


In the last 20 years, there were over 600 deaths of children abandoned inside cars worldwide. These were mainly caused by heatstroke, a type of severe heat illness that results in a very high body temperature (over 40.0 °C) and confusion, that can progress, if not treated, to seizures, muscle breakdowns and kidney failures, eventually leading to the death of the subject. A car left under the sun is affected by the greenhouse effect, that can heat up the inside of the car very quickly (80% of the increase in temperature happens in the first 10 minutes) and cracking the windows does not help much to slow down the heating process. Babies are much more affected than adults by the heat inside the car, both practically and biologically: For a young child it is difficult, if not impossible, to break out of a locked car without any help, and a child’s body overheats 3-5 times faster than an adult body. No wonder 90% of recorded incidents involved babies under 3 years old (55% in 0 to 1-year-old span).
 

1.1 Available solutions


There are various solutions available on the market nowadays, both with and without certification, that can help parents remember their baby on board.
Many of these devices are based on a pressure sensor used as weight scale: for example, MyMi and tata pad are little pillows to be put on the baby seat. The problem is they can be fooled by the presence of another object on the seat, giving a false alarm (false positive) to the user, or, even worse, can slip out of place and not give the alarm even if the baby is there (false negative).
Other solutions rely on the user too much to be totally reliable, for example BebèCare is a device that is to be put on the seatbelt of the baby seat and should be activated by the user when the baby is in the car. If the adult gets far from the car without disabling it, so forgetting the baby, an alarm is triggered, but this solution is not foolproof.
Lastly there are apps or devices that just remember the user that a baby can be on board when they exit the car, by linking to the engine of the vehicle or to the navigation app. They can be turned off by the user to not be annoyed every time or can be not activated when needed.
There is a need for a different, more reliable system, on which a parent can entrust the life of its baby, and ST has the resources to provide for a device of this type.
 

2. ST solution


While the other devices only focus on detecting the presence of the baby, it can be useful to have information to the environment status in which the baby is. Knowing if the baby is in danger is almost as important as knowing if the baby is inside the car. To classify the level of danger for a baby inside a car 2 parameters could be kept into consideration: the presence of an adult and the temperature of the environment.
 
  • For the main task of the baby detector, it was decided to use a MP23ABS1 analog microphone to check if a baby is present. Thinking that a baby in danger would cry out loud, an Artificial Intelligence (AI)-powered baby crying detection system was developed using Deep Neural Network (DNN), running on and STM32 microcontroller.
  • If an adult is inside the car with the baby, the danger level is low, but to be able to sense an adult inside a car is not an easy task. This feature is performed with the Machine Learning Core of the 6-axis motion sensor LSM6DSOX: it can detect if the car is moving, so that an adult is driving the car, having the situation under control.
  • Temperature plays an important role because if the it is within an acceptable range the parent should be informed of the danger of the baby, but when the temperature rises, a small delay can become vital and if the parent does not immediately responds other recipients should be notified.
The concept of "fusion of sensors" or "sensor fusion" is based on the possibility to use a sensor strong point to make up for another sensor shortcomings. In this case the microphone can detect the presence of a crying baby but does not know if there is a dangerous situation ongoing. The context recognition provided by the 6-axis device with the help of the temperature sensor can improve the accuracy of the system. The microphone is a sensor that has a higher power consumption (up to several hundreds uA) with respect to accelerometers or pressure sensors, but it can be reduced with the help of context recognition: it is not useful to keep the microphone running while the car is moving, so the microphone should run only if the accelerometer detects that the car is still, and should be kept off while the adult is driving. All the sensors involved in this system must be managed by the STM32 microcontroller. For our application, the L4 family was chosen to show that the computational power of the algorithm developed is within a medium tier microcontroller’s reach. This also enables to have lower current consumption and less area occupied by the system itself.
The SensorTile.box is the perfect platform to develop this kind of system: it has all the aforementioned components and can add the possibility to display the results on the ST BLE Sensor App, that is also able to send the data to a big database or to other phones.

2.1 Artificial intelligence applied to audio systems


Artificial Intelligence (in short AI) is the theory and development of computer systems able to perform tasks normally requiring human intelligence. Usually, big AI algorithms available online run on servers belonging to developers, but the additional challenge here is to adapt a deep neural network, capable of classifying a crying baby, into the smaller memory capacity of the STM32L4.
When building a Neural Network (NN), the usual starting phase is the creation or search of a suitable dataset, i.e. a collection of samples from which the NN can learn. This phase involves the acquisition and the proper labeling of homogeneous 
The next step is the "features" extraction from the samples of the dataset: these are specific characteristics (mean, variance, or more sophisticated data elaborations) that can be recognized in an easier way by the algorithm.
Finally, the type of NN used is chosen and the actual learning phase, called training, starts. The first results are usually not very good, so the next step is studying the problems of the algorithm and starting over from the dataset phase to improve the overall accuracy of the system.

 
  1. 2.1.1 Dataset

The dataset for a NN for the classification of babies' cries should obviously be composed by audio samples of crying babies. The network must be provided with examples that are babies and with ones that are not, to make it able to detect the difference. These samples have to be previously labeled accordingly to their content (for example "1" for baby's cry, "2" for "adult's speech", and so on). It is not easy to find an open database of this kind, however...
The
 most fitting set was found to be the "donate a cry", composed by homemade recordings of crying babies. After analyzing this dataset, it was immediately clear that a lot of samples were not usable for the training phase: many of them were too noisy, or full of people talking intermitted with baby crying, or even clearly grown adults pretending to cry like babies. Being the only base found, the dataset was manually checked and refined: the bad samples were deleted or cut when possible, to provide only true crying samples.
Because the neural network must be able to discern between which sample is a cry and which is not, a wider set of sounds was searched, and 2 different useful datasets were found: "Urban Sound" and "Human voice" by Kaggle.
  • The first dataset is composed of audio samples from street events, downloaded from www.freesound.org, collected by the Music and Audio Research Laboratory of New York University and then made public for researches. It is divided in 10 sound categories: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gun shot, jackhammer, siren, and street music. We wanted to use this dataset so that the network would be able to classify sounds coming from outside the car too and we kept the original labels, instead of putting a “no baby” label on all to better analyze training results.
  • The second dataset comes from an online competition series on www.kaggle.com, designed to divide human voices in male and female ones. Our goal is not to discern between them, but to be able to analyze the higher pitched female voices. The labels were kept for this set, too.
Exploring the first results, two mistakes were clear: the amount of data on cries was far too low with respect to the other audio samples, and the time piece to be kept into consideration should have been smaller for long audio samples and constant for all data. So, all data was divided in 1 second clips to have the same amount of information for every sample. Moreover, for baby crying data, it was chosen to have a 90% overlapping window. This means that if a sample was 2 seconds long, it was divided in 10 files, one starting from 0.0s and ending at 1.0s, the second one starting from 0.1s and ending at 1.1s, and so on until the 2s timestamp. This way a wider number of examples for cries was achieved, and it became comparable with other data.
 
  1. 2.1.2 Features

After a proper dataset preparation, samples could in principle give feeding to a machine learning program. It’s however a good practice to select and extract certain features from the data, to help the network to learn. The features are measurable characteristic of the phenomenon being observed, for example the Fast Fourier Transform (FFT) of an audio signal (for audio recognition), the brightness of an image (for day/night image classification), or again the standard deviation of a motion signal (for activity recognition). In more detail, when working with audio classification NN, the state of the art is using the MFCCgram of an audio signal. The MFCC (Mel Frequency Cepstral Coefficients) are a mathematical transformation of the FFT of the signal -obtained after a "mel filter", used to enhance frequencies audible by the human ear-, while the MFCCgram is a visual representation of the changes in time of the MFCC plot, like the spectrogram is for the FFT.
Considering that the NN has to run on a microcontroller and not on a server mainframe, we tried to use methods that should have lowered the computational work performed by the CPU. The idea was to utilize only the MFCC of one second of audio, because the cry of a baby is not something that comes and go in a matter of milliseconds, in conjunction with some other features. In the first tries a lot of different features were chosen (MFCC, mel spectrum, chroma, contrast, tonnetz, mean energy, centroid and roll off frequency), but going on with tests we realized that incrementing the MFCC and mel spectrum accuracy (using a higher frequency resolution) and getting rid of other features worked better. In the end a set of 158 features, composed by 128 mel spectrum samples and 30 MFCC samples gave the best results for this application, proving that sometimes, the computation power is not as important as the work behind the development.
 
766.png
 
  1. 2.1.3 Training and analysis

The network development and its training were performed using Keras tool (based on Python) and they were subject to a lot of changes during the process: every time the training steps ended the results were analyzed with the help of the "confusion matrix", that is a mean to know which categories were mistaken with which categories, a measure of the "cross-talk" (false positives and negatives) among categories. In the current application case, the confusion matrix shows the number of times and the percentage a baby cry was mistaken for a dog barking or a men talking was mistaken for a baby.

As lessons learnt, this matrix led to the recognition of different errors: the above-mentioned too high amount of "not baby" data, the use of wrong parameters for the training phase, the fact that a higher number of nodes does not help certain categories, and, most important, that different features help recognizing different categories, so, for example, if using tonnetz gave an advantage in classifying the street music with higher accuracy, on the other side the higher complexity gave problem to the recognition of baby cries. A linear network was initially tested along with a neural network to compare the results and see if it managed to get a good accuracy, but it was not enough, so the focus shifted on a deep feed-forward neural network. Several tests were then performed using different structures for the network: the number of hidden layers and their neurons, along with the activation functions were changed to get the highest possible accuracy. In the end, the best trade-off between accuracy and complexity was found to be a neural network with 2 hidden layers of 100 neurons each.
The overall accuracy of the final network is now 85% over the whole dataset, and almost 90% on the baby crying category for both training and cross-validation results. This result is theoretically good, but when the network analyzes real data a whole new set of variables comes into consideration and the resulting accuracy can be slightly lower.
 

767.png


 

2.2 Machine Learning applied to motion context recognition


Since the application is also intended to recognize the context and not only to monitor the baby condition, the -axis (accelerometer and gyroscope) motion sensor LSM6DSOX has been used, taking advantage of its embedded Machine Learning capability, able to provide a classification without using resources of the microcontroller. Its Machine Learning Core (MLC) is namely designed to process the data inside its circuitry. The MLC has a relatively simple Decision Tree logic, that is a mathematical tool composed by a series of configurable nodes, each node is characterized by one “if-then-else” condition (input signals against a threshold). You can find a broad literature at THIS LINK .
 
768.png

You can also find many examples on the use of MLC on Github (LINK).
 

2.3 Application development board

 
When porting a neural network for a microcontroller application, STM32CubeMX.AI pack is warmly suggested to convert the network. The AI tool embedded in this software can convert most of neural network structure files into C code, and import in the project of the firmware all the libraries that enables to use it on the microcontroller. STM32CubeMX.AI can also check whether the neural network in object fits the capabilities of the microcontroller of choice. In our case the network structure was very little with respect to the STM32L4R9 memory size.
After converting the feature extraction algorithm, the application was finally ready to be tested. The chosen platform is the SensorTile.box, since it is a board already available packing together a MEMS microphone, a LSM6DSOX and a temperature sensor connected to an MCU (STM32L4R9). This demo board is also able to connect to a phone app (ST BLE Sensor ) to give a graphical feedback to the user. This allows to display both the NN and the MLC network on the same page, with a third icon that alerts the user only if both the baby is crying and the vehicle is still. The temperature threshold for which the situation is very dangerous was not decided yet, so it is not implemented, but the system is ready to import that feature too.
 
769.jpg      770.jpg

 

3. Next steps

 
The first next step is testing the performances of the application on real world data. The tests until now were performed on a small set of babies (without resorting to dangerous methods to make them cry) and changing the test subjects may give some problems. Even a slight decrease in the system accuracy is to be investigated and corrected, if the baby Crying Detector is to be used as a safety device.
A research should be performed, to determine the temperature range for which the baby is in danger, to decide which is the threshold to implement in the system. This research should also involve medical staff, to be sure that the safety levels are kept at the highest standard possible.
Also the demo application is ready to be used on ST BLE Sensor app, but this app is made to present the capabilities of the ST environments to everyone, it is not able to run on the background and send notification to the user. For this reason, a custom app must be developed, to have a fast communication of the results to the parent, or even to designated authorities, with a system based on the e-911 used in USA.
Version history
Last update:
‎2021-10-21 08:24 AM
Updated by: