2026-04-02 10:21 PM - last edited on 2026-04-03 3:14 AM by Andrew Neil
Is FP32 forcing software kernels and preventing NPU acceleration?
Hi,
I am testing CIFAR-10 ResNet on STM32N6570-DK and I need help understanding a latency gap between two STEdgeAI generation flows.
I used the same FP32 model as input:
Both flows generated C files and hex successfully, and both run correctly on target.
However, the CPU+NPU build is significantly slower than CPU-only.
What I would like ST to confirm:
Observed latency:
My main question:
Is this expected for FP32 models on STM32N6 because operations are mapped to software float kernels (CPU path), so the CPU+NPU flow adds runtime overhead without real NPU acceleration?
My target and setup:
2026-04-03 5:40 AM
Hi @RanjithRemasan,
The NPU only support INT8, so it did not accelerate anything in your case.
But what you report is still strange... Could you share the model please?
Have a good day,
Julian