cancel
Showing results for 
Search instead for 
Did you mean: 

STM32N6 FP32 model on CPU+NPU flow gives higher latency than CPU-only flow.

RanjithRemasan
Associate II

Is FP32 forcing software kernels and preventing NPU acceleration?

 

Hi,

I am testing CIFAR-10 ResNet on STM32N6570-DK and I need help understanding a latency gap between two STEdgeAI generation flows.

I used the same FP32 model as input:

  1. CPU-only generation flow
  2. CPU+NPU generation flow

Both flows generated C files and hex successfully, and both run correctly on target.
However, the CPU+NPU build is significantly slower than CPU-only.

What I would like ST to confirm:

  1. For STM32N6, can full FP32 ResNet graphs be truly accelerated on Neural-ART, or are they generally expected to run as software float kernels?
  2. Is it normal that CPU+NPU generation for FP32 can be slower than CPU-only generation?
  3. For real NPU acceleration, is calibrated INT8 mandatory in practice?
  4. Which exact generator settings should I use to ensure that most layers map to NPU kernels (and how to verify mapping from generated artifacts)?

Observed latency:

  1. CPU-only flow: about 211 ms
  2. CPU+NPU flow: about 439 ms

My main question:
Is this expected for FP32 models on STM32N6 because operations are mapped to software float kernels (CPU path), so the CPU+NPU flow adds runtime overhead without real NPU acceleration?

My target and setup:

  1. Board: STM32N6570-DK
  2. Model: ResNet FP32 TFLite
  3. Tool: STEdgeAI generated C and hex
  4. Measurement: on-board DWT-based latency, single-sample inference loop
1 REPLY 1
Julian E.
ST Employee

Hi @RanjithRemasan,

 

The NPU only support INT8, so it did not accelerate anything in your case.

But what you report is still strange... Could you share the model please?

 

Have a good day,

Julian

 


In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.