2025-06-09 11:08 AM
Hi,
I'm trying to use the NPU of the STM32N6 to run a onnx model in attachment.
The issue I try to fix is that stedgeai is not using the NPU for all operations. Some of them are run in SW instead of HW, like Conv and Gemm and I don't understand what is preventing the acceleration.
Here is the stedgeai output:
/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.1.0/Utilities/linux/stedgeai generate --target stm32n6 --name network -
m denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q.onnx --st-neural-art "n6-noextmem@/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.0.0/scripts/N6_scripts/user_neuralar
t.json" --verbosity 1
ST Edge AI Core v2.1.0-20194 329b0e98d
WARNING: Unsupported keys in the current profile n6-noextmem are ignored: memory_desc
> memory_desc is not a valid key anymore, use machine_desc instead
>>>> EXECUTING NEURAL ART COMPILER
/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.1.0/Utilities/linux/atonn -i "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_output/denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q_OE_3_2_0.onnx" --json-quant-file "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_output/denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q_OE_3_2_0_Q.json" -g "network.c" --load-mdesc "/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.1.0/Utilities/configs/stm32n6.mdesc" --load-mpool "/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.0.0/scripts/N6_scripts/my_mpools/stm32n6__noextmem.mpool" --save-mpool-file "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_ws/neural_art__network/stm32n6__noextmem.mpool" --out-dir-prefix "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_ws/neural_art__network/" --optimization 3 --all-buffers-info --mvei --no-hw-sw-parallelism --cache-maintenance --Oalt-sched --native-float --enable-virtual-mem-pools --Omax-ca-pipe 4 --Oshuffle-dma --Ocache-opt --Os --output-info-file "c_info.json"
<<<< DONE EXECUTING NEURAL ART COMPILER
Exec/report summary (generate)
---------------------------------------------------------------------------------------------------------------------------------
model file : /media/doc/USB5_EXT4/Projects/IA/TestAI/denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q.onnx
type : onnx
c_name : network
options : allocate-inputs, allocate-outputs
optimization : balanced
target/series : stm32n6npu
workspace dir : /media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_ws
output dir : /media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_output
model_fmt : ss/sa per channel
model_name : denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q
model_hash : 0x53769866cbc27812c941e6bf8eee7d23
params # : 1,250,317 items (4.77 MiB)
---------------------------------------------------------------------------------------------------------------------------------
input 1/5 : 'Input_18_out_0', int8(1x257x1), 257 Bytes, QLinear(0.024541926,2,int8), activations
input 2/5 : 'Input_13_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations
input 3/5 : 'Input_9_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations
input 4/5 : 'Input_4_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations
input 5/5 : 'Input_0_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations
inputs (total) : 0 Bytes
output 1/5 : 'Quantize_84_out_0', int8(1x257x1), 257 Bytes, QLinear(0.003748817,-128,int8), activations
output 2/5 : 'Quantize_47_out_0', int8(1x256), 256 Bytes, QLinear(0.007790764,0,int8), activations
output 3/5 : 'Quantize_49_out_0', int8(1x256), 256 Bytes, QLinear(0.007674512,0,int8), activations
output 4/5 : 'Quantize_70_out_0', int8(1x256), 256 Bytes, QLinear(0.007619916,-2,int8), activations
output 5/5 : 'Quantize_72_out_0', int8(1x256), 256 Bytes, QLinear(0.007183936,-1,int8), activations
outputs (total) : 0 Bytes
macc : 0
weights (ro) : 1,287,937 B (1.23 MiB) (4 segments) / -3,713,331(-74.2%) vs float model
activations (rw) : 435,585 B (425.38 KiB) (1 segment) *
ram (total) : 435,585 B (425.38 KiB) = 435,585 + 0 + 0
---------------------------------------------------------------------------------------------------------------------------------
(*) 'input'/'output' buffers can be used from the activations buffer
[...]
Total number of epochs: 23 of which 3 implemented in software
epoch ID HW/SW/EC Operation (SW only)
epoch 1 HW
epoch 2 -SW- ( Conv )
epoch 3 -SW- ( Conv )
epoch 4 HW
epoch 5 HW
epoch 6 HW
epoch 7 HW
epoch 8 HW
epoch 9 HW
epoch 10 HW
epoch 11 HW
epoch 12 HW
epoch 13 HW
epoch 14 HW
epoch 15 HW
epoch 16 HW
epoch 17 HW
epoch 18 HW
epoch 19 HW
epoch 20 HW
epoch 21 -SW- ( Conv )
epoch 22 HW
epoch 23 HW
[...]
The model is quantized using this script:
import numpy
from onnxruntime.quantization import QuantFormat, QuantType, StaticQuantConfig, quantize, preprocess, CalibrationMethod
from onnxruntime.quantization import CalibrationDataReader
example_inputs = numpy.random.randn(1, 257, 1).astype(numpy.float32)
example_hidden = numpy.random.randn(1, 256).astype(numpy.float32)
class XXXDataReader(CalibrationDataReader):
def __init__(self):
self.enum_data = None
pass
def get_next(self):
if self.enum_data is None:
self.enum_data = iter(
[{"input": example_inputs,
"lstm_hidden_input_h_0": example_hidden,
"lstm_hidden_input_c_0": example_hidden,
"lstm_hidden_input_h_1": example_hidden,
"lstm_hidden_input_c_1": example_hidden}]
)
return next(self.enum_data, None)
def rewind(self):
pass
dr = XXXDataReader()
conf = StaticQuantConfig(
calibration_data_reader=dr,
quant_format=QuantFormat.QDQ,
calibrate_method=CalibrationMethod.MinMax,
activation_type=QuantType.QInt8,
weight_type=QuantType.QInt8,
#op_types_to_quantize=["Conv","Slice"],
extra_options={
"ForceQuantizeNoInputCheck":True,
},
# nodes_to_exclude=['resnetv17_dense0_fwd', ..],
#nodes_to_quantize=['/conv1/Conv'],
per_channel=True)
preprocess.quant_pre_process("denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified.onnx", "denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_p.onnx")
quantize("denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_p.onnx", "denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q.onnx", conf)
When I run atonn with "--d-lower 50", I see logs like this, but I don't what "scale-offset format" means as the issue was the same with symmetric input and weights :
Lowering Conv2D_23 id=80 with Lowerer HW
Lowering Conv ...
HW lowering not done for Conv node=Conv2D_23 because output tensor (id=522) has scale-offset format
Let's try with the next lowerer
Let's try with the next lowerer
Lowering Conv2D_23 id=80 with Lowerer SW (scale offset)
Standard Software Lowering operations for node Conv2D_23
tensor w: 1
tensor h: 1
tensor ch: 257
tensor chin: 257
[...]
Lowering Gemm_28_conv_16 id=94 with Lowerer HW
Lowering Conv ...
HW lowering not done for Conv node=Gemm_28_conv_16 because output tensor (id=611) has scale-offset format
Let's try with the next lowerer
Let's try with the next lowerer
Lowering Gemm_28_conv_16 id=94 with Lowerer SW (scale offset)
Standard Software Lowering operations for node Gemm_28_conv_16
tensor w: 1
tensor h: 1
tensor ch: 257
tensor chin: 1024
[...]
Lowering Conv2D_79 id=218 with Lowerer HW
Lowering Conv ...
HW lowering not done for Conv node=Conv2D_79 because output tensor (id=1458) has scale-offset format
Let's try with the next lowerer
Let's try with the next lowerer
Lowering Conv2D_79 id=218 with Lowerer SW (scale offset)
Standard Software Lowering operations for node Conv2D_79
tensor w: 1
tensor h: 1
tensor ch: 257
tensor chin: 257
What is preventing the above 3 operations to be run on the NPU ?
Especially for the Gemm_28_conv_16 node as other Gemm operations are accelerated on the NPU.
The quantized model is in attachment.
Thanks for your help,
Alexis Murzeau
Solved! Go to Solution.
2025-06-10 5:13 AM
Hello @AMurz.1,
It is because of a bug that fails to split layers with channel equals to certain prime numbers. We are aware of it and it was escalated to the dev team. We are waiting for a fix.
In your case, 257 is one of this problematic numbers.
To fix this, please change this number of channels in your concerned layer to 256 and it should fix the bug.
Have a good day,
Julian
2025-06-10 5:13 AM
Hello @AMurz.1,
It is because of a bug that fails to split layers with channel equals to certain prime numbers. We are aware of it and it was escalated to the dev team. We are waiting for a fix.
In your case, 257 is one of this problematic numbers.
To fix this, please change this number of channels in your concerned layer to 256 and it should fix the bug.
Have a good day,
Julian
2025-06-11 2:51 PM
Hi,
Thanks, splitting the Gemm operation in two 144 + 113 makes them both accelerated on the NPU.
But I have a question about the MAC/cycles of power-of-2 Gemm operations doing 1024*256 MAC.
It seems to not use the full CONVACC MAC/cycles, according to the reference manual, one CONVACC is able to do at most 36 16*8 MACs:
But the 1024*256 Gemm operation seems to only use 2 MAC/cycles according to the tmp.dot graph output of atonn and the real timing when measuring on STM32N657 hardware :
Is there a way to improve the MAC/cycle ratio to speed-up inference ?
I see this paragraph in ST Edge AI documentation which may be related :
What "run at the speed" means ? The inference speed would be the same ?
As I have ICH=256 and OCH=1024 in my case, am I limited by the OCH = N*16 (with N = 64) ?
I'm attaching the quantized model generated by stedgeai in st_ai_output folder and the tmp.dot converted to svg format.
2025-06-12 4:49 AM
Hello @AMurz.1,
Here is the answer from our experts:
The 2Mac/cycle there, is the maximum theoretical value that we can obtain here, given all the conditions around including memory access etc.
In the screenshot of the SVG graph, you can see that there is a property of the conv node saying that the "choked ports = (weights)". This is due to the fact that we don't have any data reuse with GEMM node, since each value is read and used only once. That's why we don't get better measures in terms of Mac/cycle like the maximum ones stated here by the user:
furthermore, the 16x8 simd mode is enabled (chosen since we have an FSUB !=0) and thus we can't enable the Deep1x1 mode
Have a good day,
Julian