DMS on STM32MP2

20DeViL00 · ‎2026-04-27

Hi everyone,

I’m working on a Driver Monitoring System (DMS) pipeline on the STM32MP25 and trying to understand both NPU utilization and overall performance limitations.

Setup details:
Platform: STM32MP25

Framework: stai_mpu_network (using .nb models with use_hw_acceleration=True)

Camera pipeline: GStreamer (appsink + cairooverlay)

Current performance: ~15 FPS

GPU load: ~25%

Models in use:
Face Detection (.nb) → runs every frame

Face Landmark (.nb) → runs every 3 frames

Iris Landmark (.nb) → runs every 3 frames (same schedule as landmarks)

YOLOv8n (Smoking/Calling, .nb) → runs every 3 frames (offset scheduling)

Scheduling logic:
Face detection: every frame

Landmark + eye models: frame_count % 3 == 0

YOLO model: frame_count % 3 == 1

So not all models are executed every frame, but FPS is still limited to ~15.

Questions:
1. NPU verification
Even though .nb models are used with HW acceleration enabled, I don’t have confirmation that inference is actually running on the NPU.

How can I verify that the NPU is being used (logs, tools, counters)?

Is there a way to monitor NPU utilization in real time?

2. Performance / FPS optimization
Given that:

GPU usage is only ~25%

Models are not all running every frame

Still only ~15 FPS

What are the typical bottlenecks on STM32MP25 in this kind of pipeline?

3. Face recognition pipeline
We are planning to extend this DMS pipeline with face recognition (driver identification).

Considering STM32MP25 constraints:

What would be a suitable lightweight face recognition pipeline?
Any suggested models (quantized / NPU-friendly) for:
Face embedding (e.g., MobileFaceNet or similar)
Best practices for integrating recognition without significantly impacting FPS?

Any guidance on confirming NPU usage, improving FPS, and selecting a suitable face recognition pipeline would be very helpful.

Thanks.

ABRIS.1 · ‎2026-04-28

Hello 20DeViL00,

Thank you for your message and clear description. We will help you debug your issue.

Because the GPU load is not high (approximately 25%), I suspect that the bottleneck is not related to NPU execution time. The GPU and NPU are integrated in the same hardware IP, so the GPU load tracks both GPU and NPU load. It is not possible to track only the NPU load.

The first step is to benchmark each model separately using the x-linux-ai-benchmark tool, which should already be installed on your STM32MP25. This benchmark tool provides information such as inference time, percentage of operations mapped to the GPU, and percentage of operations mapped to the NPU.

Can you provide the benchmark information for each model? This will help us understand the expected performance by adding each inference time, without considering post-processing and display time.

Also, can you provide the CPU load while running your application?

20DeViL00 · ‎2026-04-29

Hello @ABRIS.1

Thank you for the guidance. I ran the x-linux-ai-benchmark tool for each model. Here are the results:

Model Inference Time (ms) CPU % GPU % NPU % Peak RAM (MB)

face_detection	2.88	13.74	86.26	21.57
face_landmark	4.13	28.17	71.83	22.20
iris_landmark	3.11	35.76	64.24	22.38
yolov8n_smk_call	22.79	5.86	94.14	24.35

All models show 0% CPU usage during inference. Additionally, while running the full pipeline, the CPU load observed via top is approximately 85–90%.

Please let us know if you need any additional information.

Thank you.

ABRIS.1 · ‎2026-04-29

Hello @20DeViL00,

Thank you for the information you shared. This information is important for analyzing your issue.

Summary of the current situation:

The models run in hardware, mainly on the neural processing unit (NPU), so they are well optimized for our hardware.
Based on your previous message, you run the following:
- Face detection: every frame. Inference time is 2.88 ms, which equals 347 frames per second (FPS).
- Landmark and eye models: every third frame (frame_count % 3 == 0). Inference time is 2.88 ms + 4.13 ms + 3.11 ms = 10.12 ms, which equals 98 FPS.
- YOLO model: every third frame (frame_count % 3 == 1). Inference time is 2.88 ms + 22.79 ms = 25.67 ms, which equals 39 FPS.

Based on this information, the GPU/NPU execution of the models is not the bottleneck. To the inference time, you must add the pre-processing time and the post-processing time.

The next inference starts only after post-processing finishes and the information is displayed.

As your CPU load is very high, we suspect that post-processing is the bottleneck in your application. Can you place a timer at the beginning and at the end of your post-processing function to determine the time spent in post-processing?

Thank you.

Regards,
Alexis