STM32N6 NPU acceleration sometimes not used for 1x1 Conv or Gemm operations

AMurz.1 · ‎2025-06-09

Hi,

I'm trying to use the NPU of the STM32N6 to run a onnx model in attachment.

The issue I try to fix is that stedgeai is not using the NPU for all operations. Some of them are run in SW instead of HW, like Conv and Gemm and I don't understand what is preventing the acceleration.

Here is the stedgeai output:

/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.1.0/Utilities/linux/stedgeai generate --target stm32n6 --name network -
m denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q.onnx --st-neural-art "n6-noextmem@/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.0.0/scripts/N6_scripts/user_neuralar
t.json" --verbosity 1
ST Edge AI Core v2.1.0-20194 329b0e98d
WARNING: Unsupported keys in the current profile n6-noextmem are ignored: memory_desc                                                                                                                   
        > memory_desc is not a valid key anymore, use machine_desc instead                                                                                                                              
 >>>> EXECUTING NEURAL ART COMPILER                                                                                                                                                                     
   /home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.1.0/Utilities/linux/atonn -i "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_output/denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q_OE_3_2_0.onnx" --json-quant-file "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_output/denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q_OE_3_2_0_Q.json" -g "network.c" --load-mdesc "/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.1.0/Utilities/configs/stm32n6.mdesc" --load-mpool "/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.0.0/scripts/N6_scripts/my_mpools/stm32n6__noextmem.mpool" --save-mpool-file "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_ws/neural_art__network/stm32n6__noextmem.mpool" --out-dir-prefix "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_ws/neural_art__network/" --optimization 3 --all-buffers-info --mvei --no-hw-sw-parallelism --cache-maintenance --Oalt-sched --native-float --enable-virtual-mem-pools --Omax-ca-pipe 4 --Oshuffle-dma --Ocache-opt --Os --output-info-file "c_info.json"
 <<<< DONE EXECUTING NEURAL ART COMPILER                                                                                                                                                                
                                                                                                                                                                                                        
 Exec/report summary (generate)
 ---------------------------------------------------------------------------------------------------------------------------------
 model file         :   /media/doc/USB5_EXT4/Projects/IA/TestAI/denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q.onnx   
 type               :   onnx                                                                                                      
 c_name             :   network                                                                                                   
 options            :   allocate-inputs, allocate-outputs                                                                         
 optimization       :   balanced                                                                                                  
 target/series      :   stm32n6npu                                                                                                
 workspace dir      :   /media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_ws                                                          
 output dir         :   /media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_output                                                      
 model_fmt          :   ss/sa per channel                                                                                         
 model_name         :   denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q                                                
 model_hash         :   0x53769866cbc27812c941e6bf8eee7d23                                                                        
 params #           :   1,250,317 items (4.77 MiB)                                                                                
 ---------------------------------------------------------------------------------------------------------------------------------
 input 1/5          :   'Input_18_out_0', int8(1x257x1), 257 Bytes, QLinear(0.024541926,2,int8), activations                      
 input 2/5          :   'Input_13_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations                        
 input 3/5          :   'Input_9_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations                         
 input 4/5          :   'Input_4_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations                         
 input 5/5          :   'Input_0_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations                         
 inputs (total)     :   0 Bytes                                                                                                   
 output 1/5         :   'Quantize_84_out_0', int8(1x257x1), 257 Bytes, QLinear(0.003748817,-128,int8), activations                
 output 2/5         :   'Quantize_47_out_0', int8(1x256), 256 Bytes, QLinear(0.007790764,0,int8), activations                     
 output 3/5         :   'Quantize_49_out_0', int8(1x256), 256 Bytes, QLinear(0.007674512,0,int8), activations                     
 output 4/5         :   'Quantize_70_out_0', int8(1x256), 256 Bytes, QLinear(0.007619916,-2,int8), activations                    
 output 5/5         :   'Quantize_72_out_0', int8(1x256), 256 Bytes, QLinear(0.007183936,-1,int8), activations                    
 outputs (total)    :   0 Bytes                                                                                                   
 macc               :   0                                                                                                         
 weights (ro)       :   1,287,937 B (1.23 MiB) (4 segments) / -3,713,331(-74.2%) vs float model                                   
 activations (rw)   :   435,585 B (425.38 KiB) (1 segment) *                                                                      
 ram (total)        :   435,585 B (425.38 KiB) = 435,585 + 0 + 0                                                                  
 ---------------------------------------------------------------------------------------------------------------------------------
 (*) 'input'/'output' buffers can be used from the activations buffer
                                                                                                                                                                                                        
[...]
                                                                                                                   
Total number of epochs: 23 of which 3 implemented in software                                                                                                                                           
                                                                                                                                                                                                        
epoch ID   HW/SW/EC Operation (SW only)                                                                                                                                                                 
epoch 1       HW                                                                                                                                                                                        
epoch 2      -SW-   (        Conv        )                                                                                                                                                              
epoch 3      -SW-   (        Conv        )                                                                                                                                                              
epoch 4       HW                                                                                                                                                                                        
epoch 5       HW                                                                                                                                                                                        
epoch 6       HW                                                                                                                                                                                        
epoch 7       HW                                                                                                                                                                                        
epoch 8       HW                                                                                                                                                                                        
epoch 9       HW                                                                                                                                                                                        
epoch 10      HW                                                                                                                                                                                        
epoch 11      HW                                                                                                                                                                                        
epoch 12      HW                                                                                                                                                                                        
epoch 13      HW                                                                                                                                                                                        
epoch 14      HW                                                                                                                                                                                        
epoch 15      HW                                                                                                                                                                                        
epoch 16      HW                                                                                                                                                                                        
epoch 17      HW                                                                                                                                                                                        
epoch 18      HW                                                                                                                                                                                        
epoch 19      HW                                                                                                                                                                                        
epoch 20      HW                                                                                                                                                                                        
epoch 21     -SW-   (        Conv        )                                                                                                                                                              
epoch 22      HW                                                                                                                                                                                        
epoch 23      HW                                                                                                                                                                                        

[...]

The model is quantized using this script:

import numpy

from onnxruntime.quantization import QuantFormat, QuantType, StaticQuantConfig, quantize, preprocess, CalibrationMethod
from onnxruntime.quantization import CalibrationDataReader


example_inputs = numpy.random.randn(1, 257, 1).astype(numpy.float32)
example_hidden = numpy.random.randn(1, 256).astype(numpy.float32)

class XXXDataReader(CalibrationDataReader):
    def __init__(self):
        self.enum_data = None
        pass

    def get_next(self):
        if self.enum_data is None:
            self.enum_data = iter(
                [{"input": example_inputs,
                  "lstm_hidden_input_h_0": example_hidden,
                  "lstm_hidden_input_c_0": example_hidden,
                  "lstm_hidden_input_h_1": example_hidden,
                  "lstm_hidden_input_c_1": example_hidden}]
            )
        return next(self.enum_data, None)

    def rewind(self):
        pass

dr = XXXDataReader()

conf = StaticQuantConfig(
    calibration_data_reader=dr,
    quant_format=QuantFormat.QDQ,
    calibrate_method=CalibrationMethod.MinMax,
    activation_type=QuantType.QInt8,
    weight_type=QuantType.QInt8,
    #op_types_to_quantize=["Conv","Slice"],
    extra_options={
        "ForceQuantizeNoInputCheck":True,
    },
    # nodes_to_exclude=['resnetv17_dense0_fwd', ..],
    #nodes_to_quantize=['/conv1/Conv'],
    per_channel=True)

preprocess.quant_pre_process("denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified.onnx", "denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_p.onnx")
quantize("denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_p.onnx", "denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q.onnx", conf)

When I run atonn with "--d-lower 50", I see logs like this, but I don't what "scale-offset format" means as the issue was the same with symmetric input and weights :

 Lowering  Conv2D_23 id=80 with Lowerer HW
Lowering Conv ...
HW lowering not done for Conv node=Conv2D_23 because output tensor (id=522) has scale-offset format
 Let's try with the next lowerer
 Let's try with the next lowerer
 Lowering  Conv2D_23 id=80 with Lowerer SW (scale offset)
Standard Software Lowering operations for node Conv2D_23
tensor w: 1
tensor h: 1
tensor ch: 257
tensor chin: 257

[...]

 Lowering  Gemm_28_conv_16 id=94 with Lowerer HW
Lowering Conv ...
HW lowering not done for Conv node=Gemm_28_conv_16 because output tensor (id=611) has scale-offset format
 Let's try with the next lowerer
 Let's try with the next lowerer
 Lowering  Gemm_28_conv_16 id=94 with Lowerer SW (scale offset)
Standard Software Lowering operations for node Gemm_28_conv_16
tensor w: 1
tensor h: 1
tensor ch: 257
tensor chin: 1024

[...]

 Lowering  Conv2D_79 id=218 with Lowerer HW
Lowering Conv ...
HW lowering not done for Conv node=Conv2D_79 because output tensor (id=1458) has scale-offset format
 Let's try with the next lowerer
 Let's try with the next lowerer
 Lowering  Conv2D_79 id=218 with Lowerer SW (scale offset)
Standard Software Lowering operations for node Conv2D_79
tensor w: 1
tensor h: 1
tensor ch: 257
tensor chin: 257

What is preventing the above 3 operations to be run on the NPU ?

Especially for the Gemm_28_conv_16 node as other Gemm operations are accelerated on the NPU.

The quantized model is in attachment.

Thanks for your help,

Alexis Murzeau

Julian E. · ‎2025-06-10

Hello @AMurz.1,

It is because of a bug that fails to split layers with channel equals to certain prime numbers. We are aware of it and it was escalated to the dev team. We are waiting for a fix.

In your case, 257 is one of this problematic numbers.

To fix this, please change this number of channels in your concerned layer to 256 and it should fix the bug.

Have a good day,

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.

View solution in original post

Julian E. · ‎2025-06-10

Hello @AMurz.1,

It is because of a bug that fails to split layers with channel equals to certain prime numbers. We are aware of it and it was escalated to the dev team. We are waiting for a fix.

In your case, 257 is one of this problematic numbers.

To fix this, please change this number of channels in your concerned layer to 256 and it should fix the bug.

Have a good day,

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.

AMurz.1 · ‎2025-06-11

Hi,

Thanks, splitting the Gemm operation in two 144 + 113 makes them both accelerated on the NPU.

But I have a question about the MAC/cycles of power-of-2 Gemm operations doing 1024*256 MAC.

It seems to not use the full CONVACC MAC/cycles, according to the reference manual, one CONVACC is able to do at most 36 16*8 MACs:

But the 1024*256 Gemm operation seems to only use 2 MAC/cycles according to the tmp.dot graph output of atonn and the real timing when measuring on STM32N657 hardware :

Is there a way to improve the MAC/cycle ratio to speed-up inference ?

I see this paragraph in ST Edge AI documentation which may be related :

What "run at the speed" means ? The inference speed would be the same ?

As I have ICH=256 and OCH=1024 in my case, am I limited by the OCH = N*16 (with N = 64) ?

I'm attaching the quantized model generated by stedgeai in st_ai_output folder and the tmp.dot converted to svg format.

Julian E. · ‎2025-06-12

Hello @AMurz.1,

Here is the answer from our experts:

The 2Mac/cycle there, is the maximum theoretical value that we can obtain here, given all the conditions around including memory access etc.

In the screenshot of the SVG graph, you can see that there is a property of the conv node saying that the "choked ports = (weights)". This is due to the fact that we don't have any data reuse with GEMM node, since each value is read and used only once. That's why we don't get better measures in terms of Mac/cycle like the maximum ones stated here by the user:

furthermore, the 16x8 simd mode is enabled (chosen since we have an FSUB !=0) and thus we can't enable the Deep1x1 mode

Have a good day,

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.