STM32N6 FP32 model on CPU+NPU flow gives higher latency than CPU-only flow.

RanjithRemasan · ‎2026-04-02

Is FP32 forcing software kernels and preventing NPU acceleration?

Hi,

I am testing CIFAR-10 ResNet on STM32N6570-DK and I need help understanding a latency gap between two STEdgeAI generation flows.

I used the same FP32 model as input:

CPU-only generation flow
CPU+NPU generation flow

Both flows generated C files and hex successfully, and both run correctly on target.
However, the CPU+NPU build is significantly slower than CPU-only.

What I would like ST to confirm:

For STM32N6, can full FP32 ResNet graphs be truly accelerated on Neural-ART, or are they generally expected to run as software float kernels?
Is it normal that CPU+NPU generation for FP32 can be slower than CPU-only generation?
For real NPU acceleration, is calibrated INT8 mandatory in practice?
Which exact generator settings should I use to ensure that most layers map to NPU kernels (and how to verify mapping from generated artifacts)?

Observed latency:

CPU-only flow: about 211 ms
CPU+NPU flow: about 439 ms

My main question:
Is this expected for FP32 models on STM32N6 because operations are mapped to software float kernels (CPU path), so the CPU+NPU flow adds runtime overhead without real NPU acceleration?

My target and setup:

Board: STM32N6570-DK
Model: ResNet FP32 TFLite
Tool: STEdgeAI generated C and hex
Measurement: on-board DWT-based latency, single-sample inference loop

Julian E. · ‎2026-04-03

Hi @RanjithRemasan,

The NPU only support INT8, so it did not accelerate anything in your case.

But what you report is still strange... Could you share the model please?

Have a good day,

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.