Model inference on STM32N6570-DK much slower than reported (3.6 s vs 20 ms)

tonyzzzzz · ‎2025-10-26

I am currently running the head_landmarks model from the official STM32 Model Zoo:
https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/pose_estimation/head_landmarks

The ONNX model I used is face_landmarks_v1_192_int8_pc.onnx (downloaded directly from the github.
The model can be successfully executed on the STM32N6570-DK board using the NPU, and the output results are correct.

However, the inference speed is much slower than expected:

Actual inference time on N6570-DK: ~3.6 seconds per frame
Reported time in the Model Zoo README: ~20.52 milliseconds per frame

I would like to confirm:

Is there any specific optimization or configuration (e.g., memory placement, quantization format, build options, or runtime parameters) required to achieve the published 20 ms performance?
Could this large gap indicate that part of the model is running on the CPU instead of the NPU?
Is there a way to check, from the generated ai_network_report or logs, which layers are accelerated by the NPU and which ones fall back to the CPU?

Any guidance or clarification on how to reproduce the official benchmark performance would be highly appreciated.

Best regards, Tony

Julian E. · ‎2025-10-31

Hi @tonyzzzzz,

Could you describe a bit what you did exactly? What application are you using?

With the dev cloud, I get 6ms inference time.

To use the NPU, you need to have "--st-neural-art" in your generate command, for example:

stedgeai generate --model model --target stm32n6 --st-neural art

Ini the report, you will see, which epoch layer is in software (MCU) or hardware (NPU).

In the case of this model, I got only EC epoch (epoch controller epochs: multiple hw blocks, for optimization)

====================================================================================
Epochs details
   ---------------------------------------------------------------------------------
Total number of epochs: 106 of which 2 implemented in software
epoch ID   HW/SW/EC Operation (SW only)
epoch 1       EC
epoch 2       EC
epoch 3       EC
epoch 4       EC
epoch 5       EC
epoch 6       EC
epoch 7       EC
epoch 8       EC
epoch 9       EC
....

Making sure HW is used is important, and another important thing is to make sure that activations fit in internal memeory. Again, in the generate report you can see something like this:

====================================================================================
Memory usage information  (input/output buffers are included in activations)
   ---------------------------------------------------------------------------------
	flexMEM    [0x34000000 - 0x34000000]:          0  B /          0  B  (  0.00 % used) -- weights:          0  B (  0.00 % used)  activations:          0  B (  0.00 % used)
	cpuRAM1    [0x34064000 - 0x34064000]:          0  B /          0  B  (  0.00 % used) -- weights:          0  B (  0.00 % used)  activations:          0  B (  0.00 % used)
	cpuRAM2    [0x34100000 - 0x34200000]:          0  B /      1.000 MB  (  0.00 % used) -- weights:          0  B (  0.00 % used)  activations:          0  B (  0.00 % used)
	npuRAM3    [0x34200000 - 0x34270000]:          0  B /    448.000 kB  (  0.00 % used) -- weights:          0  B (  0.00 % used)  activations:          0  B (  0.00 % used)
	npuRAM4    [0x34270000 - 0x342E0000]:    288.000 kB /    448.000 kB  ( 64.29 % used) -- weights:          0  B (  0.00 % used)  activations:    288.000 kB ( 64.29 % used)
	npuRAM5    [0x342E0000 - 0x34350000]:    432.000 kB /    448.000 kB  ( 96.43 % used) -- weights:          0  B (  0.00 % used)  activations:    432.000 kB ( 96.43 % used)
	npuRAM6    [0x34350000 - 0x343C0000]:          0  B /    448.000 kB  (  0.00 % used) -- weights:          0  B (  0.00 % used)  activations:          0  B (  0.00 % used)
	octoFlash  [0x71000000 - 0x78000000]:    470.355 kB /    112.000 MB  (  0.41 % used) -- weights:    470.355 kB (  0.41 % used)  activations:          0  B (  0.00 % used)
	hyperRAM   [0x90000000 - 0x92000000]:          0  B /     32.000 MB  (  0.00 % used) -- weights:          0  B (  0.00 % used)  activations:          0  B (  0.00 % used)
Total:                                             1.162 MB                                  -- weights:    470.355 kB                  activations:    720.000 kB
====================================================================================
Used memory ranges
   ---------------------------------------------------------------------------------
	npuRAM4    [0x34270000 - 0x342E0000]: 0x34270000-0x342B8000
	npuRAM5    [0x342E0000 - 0x34350000]: 0x342E0000-0x3434C000
	octoFlash  [0x71000000 - 0x78000000]: 0x71000000-0x71075970
====================================================================================

In this case, you can see that the activations are all in internal memory and weights are in external memory.

As weights are only read once, it is ok to have them in external flash. But for activations, you will see a big degradation if they do not fit in internal memory.

Doc: Session - ST Edge AI Developer Cloud

Have a good day,

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.