Model inference on STM32N6570-DK much slower than reported (3.6 s vs 20 ms)

tonyzzzzz · ‎2025-10-26

I am currently running the head_landmarks model from the official STM32 Model Zoo:
https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/pose_estimation/head_landmarks

The ONNX model I used is face_landmarks_v1_192_int8_pc.onnx (downloaded directly from the github.
The model can be successfully executed on the STM32N6570-DK board using the NPU, and the output results are correct.

However, the inference speed is much slower than expected:

Actual inference time on N6570-DK: ~3.6 seconds per frame
Reported time in the Model Zoo README: ~20.52 milliseconds per frame

I would like to confirm:

Is there any specific optimization or configuration (e.g., memory placement, quantization format, build options, or runtime parameters) required to achieve the published 20 ms performance?
Could this large gap indicate that part of the model is running on the CPU instead of the NPU?
Is there a way to check, from the generated ai_network_report or logs, which layers are accelerated by the NPU and which ones fall back to the CPU?

Any guidance or clarification on how to reproduce the official benchmark performance would be highly appreciated.

Best regards, Tony