2025-06-09 11:08 AM
Hi,
I'm trying to use the NPU of the STM32N6 to run a onnx model in attachment.
The issue I try to fix is that stedgeai is not using the NPU for all operations. Some of them are run in SW instead of HW, like Conv and Gemm and I don't understand what is preventing the acceleration.
Here is the stedgeai output:
/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.1.0/Utilities/linux/stedgeai generate --target stm32n6 --name network -
m denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q.onnx --st-neural-art "n6-noextmem@/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.0.0/scripts/N6_scripts/user_neuralar
t.json" --verbosity 1
ST Edge AI Core v2.1.0-20194 329b0e98d
WARNING: Unsupported keys in the current profile n6-noextmem are ignored: memory_desc
> memory_desc is not a valid key anymore, use machine_desc instead
>>>> EXECUTING NEURAL ART COMPILER
/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.1.0/Utilities/linux/atonn -i "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_output/denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q_OE_3_2_0.onnx" --json-quant-file "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_output/denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q_OE_3_2_0_Q.json" -g "network.c" --load-mdesc "/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.1.0/Utilities/configs/stm32n6.mdesc" --load-mpool "/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.0.0/scripts/N6_scripts/my_mpools/stm32n6__noextmem.mpool" --save-mpool-file "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_ws/neural_art__network/stm32n6__noextmem.mpool" --out-dir-prefix "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_ws/neural_art__network/" --optimization 3 --all-buffers-info --mvei --no-hw-sw-parallelism --cache-maintenance --Oalt-sched --native-float --enable-virtual-mem-pools --Omax-ca-pipe 4 --Oshuffle-dma --Ocache-opt --Os --output-info-file "c_info.json"
<<<< DONE EXECUTING NEURAL ART COMPILER
Exec/report summary (generate)
---------------------------------------------------------------------------------------------------------------------------------
model file : /media/doc/USB5_EXT4/Projects/IA/TestAI/denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q.onnx
type : onnx
c_name : network
options : allocate-inputs, allocate-outputs
optimization : balanced
target/series : stm32n6npu
workspace dir : /media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_ws
output dir : /media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_output
model_fmt : ss/sa per channel
model_name : denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q
model_hash : 0x53769866cbc27812c941e6bf8eee7d23
params # : 1,250,317 items (4.77 MiB)
---------------------------------------------------------------------------------------------------------------------------------
input 1/5 : 'Input_18_out_0', int8(1x257x1), 257 Bytes, QLinear(0.024541926,2,int8), activations
input 2/5 : 'Input_13_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations
input 3/5 : 'Input_9_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations
input 4/5 : 'Input_4_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations
input 5/5 : 'Input_0_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations
inputs (total) : 0 Bytes
output 1/5 : 'Quantize_84_out_0', int8(1x257x1), 257 Bytes, QLinear(0.003748817,-128,int8), activations
output 2/5 : 'Quantize_47_out_0', int8(1x256), 256 Bytes, QLinear(0.007790764,0,int8), activations
output 3/5 : 'Quantize_49_out_0', int8(1x256), 256 Bytes, QLinear(0.007674512,0,int8), activations
output 4/5 : 'Quantize_70_out_0', int8(1x256), 256 Bytes, QLinear(0.007619916,-2,int8), activations
output 5/5 : 'Quantize_72_out_0', int8(1x256), 256 Bytes, QLinear(0.007183936,-1,int8), activations
outputs (total) : 0 Bytes
macc : 0
weights (ro) : 1,287,937 B (1.23 MiB) (4 segments) / -3,713,331(-74.2%) vs float model
activations (rw) : 435,585 B (425.38 KiB) (1 segment) *
ram (total) : 435,585 B (425.38 KiB) = 435,585 + 0 + 0
---------------------------------------------------------------------------------------------------------------------------------
(*) 'input'/'output' buffers can be used from the activations buffer
[...]
Total number of epochs: 23 of which 3 implemented in software
epoch ID HW/SW/EC Operation (SW only)
epoch 1 HW
epoch 2 -SW- ( Conv )
epoch 3 -SW- ( Conv )
epoch 4 HW
epoch 5 HW
epoch 6 HW
epoch 7 HW
epoch 8 HW
epoch 9 HW
epoch 10 HW
epoch 11 HW
epoch 12 HW
epoch 13 HW
epoch 14 HW
epoch 15 HW
epoch 16 HW
epoch 17 HW
epoch 18 HW
epoch 19 HW
epoch 20 HW
epoch 21 -SW- ( Conv )
epoch 22 HW
epoch 23 HW
[...]
The model is quantized using this script:
import numpy
from onnxruntime.quantization import QuantFormat, QuantType, StaticQuantConfig, quantize, preprocess, CalibrationMethod
from onnxruntime.quantization import CalibrationDataReader
example_inputs = numpy.random.randn(1, 257, 1).astype(numpy.float32)
example_hidden = numpy.random.randn(1, 256).astype(numpy.float32)
class XXXDataReader(CalibrationDataReader):
def __init__(self):
self.enum_data = None
pass
def get_next(self):
if self.enum_data is None:
self.enum_data = iter(
[{"input": example_inputs,
"lstm_hidden_input_h_0": example_hidden,
"lstm_hidden_input_c_0": example_hidden,
"lstm_hidden_input_h_1": example_hidden,
"lstm_hidden_input_c_1": example_hidden}]
)
return next(self.enum_data, None)
def rewind(self):
pass
dr = XXXDataReader()
conf = StaticQuantConfig(
calibration_data_reader=dr,
quant_format=QuantFormat.QDQ,
calibrate_method=CalibrationMethod.MinMax,
activation_type=QuantType.QInt8,
weight_type=QuantType.QInt8,
#op_types_to_quantize=["Conv","Slice"],
extra_options={
"ForceQuantizeNoInputCheck":True,
},
# nodes_to_exclude=['resnetv17_dense0_fwd', ..],
#nodes_to_quantize=['/conv1/Conv'],
per_channel=True)
preprocess.quant_pre_process("denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified.onnx", "denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_p.onnx")
quantize("denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_p.onnx", "denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q.onnx", conf)
When I run atonn with "--d-lower 50", I see logs like this, but I don't what "scale-offset format" means as the issue was the same with symmetric input and weights :
Lowering Conv2D_23 id=80 with Lowerer HW
Lowering Conv ...
HW lowering not done for Conv node=Conv2D_23 because output tensor (id=522) has scale-offset format
Let's try with the next lowerer
Let's try with the next lowerer
Lowering Conv2D_23 id=80 with Lowerer SW (scale offset)
Standard Software Lowering operations for node Conv2D_23
tensor w: 1
tensor h: 1
tensor ch: 257
tensor chin: 257
[...]
Lowering Gemm_28_conv_16 id=94 with Lowerer HW
Lowering Conv ...
HW lowering not done for Conv node=Gemm_28_conv_16 because output tensor (id=611) has scale-offset format
Let's try with the next lowerer
Let's try with the next lowerer
Lowering Gemm_28_conv_16 id=94 with Lowerer SW (scale offset)
Standard Software Lowering operations for node Gemm_28_conv_16
tensor w: 1
tensor h: 1
tensor ch: 257
tensor chin: 1024
[...]
Lowering Conv2D_79 id=218 with Lowerer HW
Lowering Conv ...
HW lowering not done for Conv node=Conv2D_79 because output tensor (id=1458) has scale-offset format
Let's try with the next lowerer
Let's try with the next lowerer
Lowering Conv2D_79 id=218 with Lowerer SW (scale offset)
Standard Software Lowering operations for node Conv2D_79
tensor w: 1
tensor h: 1
tensor ch: 257
tensor chin: 257
What is preventing the above 3 operations to be run on the NPU ?
Especially for the Gemm_28_conv_16 node as other Gemm operations are accelerated on the NPU.
The quantized model is in attachment.
Thanks for your help,
Alexis Murzeau