cancel
Showing results for 
Search instead for 
Did you mean: 

STM32N6 NPU acceleration sometimes not used for 1x1 Conv or Gemm operations

AMurz.1
Associate II

Hi,

I'm trying to use the NPU of the STM32N6 to run a onnx model in attachment.

The issue I try to fix is that stedgeai is not using the NPU for all operations. Some of them are run in SW instead of HW, like Conv and Gemm and I don't understand what is preventing the acceleration.

Here is the stedgeai output:

/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.1.0/Utilities/linux/stedgeai generate --target stm32n6 --name network -
m denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q.onnx --st-neural-art "n6-noextmem@/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.0.0/scripts/N6_scripts/user_neuralar
t.json" --verbosity 1
ST Edge AI Core v2.1.0-20194 329b0e98d
WARNING: Unsupported keys in the current profile n6-noextmem are ignored: memory_desc                                                                                                                   
        > memory_desc is not a valid key anymore, use machine_desc instead                                                                                                                              
 >>>> EXECUTING NEURAL ART COMPILER                                                                                                                                                                     
   /home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.1.0/Utilities/linux/atonn -i "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_output/denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q_OE_3_2_0.onnx" --json-quant-file "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_output/denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q_OE_3_2_0_Q.json" -g "network.c" --load-mdesc "/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.1.0/Utilities/configs/stm32n6.mdesc" --load-mpool "/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.0.0/scripts/N6_scripts/my_mpools/stm32n6__noextmem.mpool" --save-mpool-file "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_ws/neural_art__network/stm32n6__noextmem.mpool" --out-dir-prefix "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_ws/neural_art__network/" --optimization 3 --all-buffers-info --mvei --no-hw-sw-parallelism --cache-maintenance --Oalt-sched --native-float --enable-virtual-mem-pools --Omax-ca-pipe 4 --Oshuffle-dma --Ocache-opt --Os --output-info-file "c_info.json"
 <<<< DONE EXECUTING NEURAL ART COMPILER                                                                                                                                                                
                                                                                                                                                                                                        
 Exec/report summary (generate)
 ---------------------------------------------------------------------------------------------------------------------------------
 model file         :   /media/doc/USB5_EXT4/Projects/IA/TestAI/denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q.onnx   
 type               :   onnx                                                                                                      
 c_name             :   network                                                                                                   
 options            :   allocate-inputs, allocate-outputs                                                                         
 optimization       :   balanced                                                                                                  
 target/series      :   stm32n6npu                                                                                                
 workspace dir      :   /media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_ws                                                          
 output dir         :   /media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_output                                                      
 model_fmt          :   ss/sa per channel                                                                                         
 model_name         :   denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q                                                
 model_hash         :   0x53769866cbc27812c941e6bf8eee7d23                                                                        
 params #           :   1,250,317 items (4.77 MiB)                                                                                
 ---------------------------------------------------------------------------------------------------------------------------------
 input 1/5          :   'Input_18_out_0', int8(1x257x1), 257 Bytes, QLinear(0.024541926,2,int8), activations                      
 input 2/5          :   'Input_13_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations                        
 input 3/5          :   'Input_9_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations                         
 input 4/5          :   'Input_4_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations                         
 input 5/5          :   'Input_0_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations                         
 inputs (total)     :   0 Bytes                                                                                                   
 output 1/5         :   'Quantize_84_out_0', int8(1x257x1), 257 Bytes, QLinear(0.003748817,-128,int8), activations                
 output 2/5         :   'Quantize_47_out_0', int8(1x256), 256 Bytes, QLinear(0.007790764,0,int8), activations                     
 output 3/5         :   'Quantize_49_out_0', int8(1x256), 256 Bytes, QLinear(0.007674512,0,int8), activations                     
 output 4/5         :   'Quantize_70_out_0', int8(1x256), 256 Bytes, QLinear(0.007619916,-2,int8), activations                    
 output 5/5         :   'Quantize_72_out_0', int8(1x256), 256 Bytes, QLinear(0.007183936,-1,int8), activations                    
 outputs (total)    :   0 Bytes                                                                                                   
 macc               :   0                                                                                                         
 weights (ro)       :   1,287,937 B (1.23 MiB) (4 segments) / -3,713,331(-74.2%) vs float model                                   
 activations (rw)   :   435,585 B (425.38 KiB) (1 segment) *                                                                      
 ram (total)        :   435,585 B (425.38 KiB) = 435,585 + 0 + 0                                                                  
 ---------------------------------------------------------------------------------------------------------------------------------
 (*) 'input'/'output' buffers can be used from the activations buffer
                                                                                                                                                                                                        
[...]
                                                                                                                   
Total number of epochs: 23 of which 3 implemented in software                                                                                                                                           
                                                                                                                                                                                                        
epoch ID   HW/SW/EC Operation (SW only)                                                                                                                                                                 
epoch 1       HW                                                                                                                                                                                        
epoch 2      -SW-   (        Conv        )                                                                                                                                                              
epoch 3      -SW-   (        Conv        )                                                                                                                                                              
epoch 4       HW                                                                                                                                                                                        
epoch 5       HW                                                                                                                                                                                        
epoch 6       HW                                                                                                                                                                                        
epoch 7       HW                                                                                                                                                                                        
epoch 8       HW                                                                                                                                                                                        
epoch 9       HW                                                                                                                                                                                        
epoch 10      HW                                                                                                                                                                                        
epoch 11      HW                                                                                                                                                                                        
epoch 12      HW                                                                                                                                                                                        
epoch 13      HW                                                                                                                                                                                        
epoch 14      HW                                                                                                                                                                                        
epoch 15      HW                                                                                                                                                                                        
epoch 16      HW                                                                                                                                                                                        
epoch 17      HW                                                                                                                                                                                        
epoch 18      HW                                                                                                                                                                                        
epoch 19      HW                                                                                                                                                                                        
epoch 20      HW                                                                                                                                                                                        
epoch 21     -SW-   (        Conv        )                                                                                                                                                              
epoch 22      HW                                                                                                                                                                                        
epoch 23      HW                                                                                                                                                                                        

[...]

The model is quantized using this script:

import numpy

from onnxruntime.quantization import QuantFormat, QuantType, StaticQuantConfig, quantize, preprocess, CalibrationMethod
from onnxruntime.quantization import CalibrationDataReader


example_inputs = numpy.random.randn(1, 257, 1).astype(numpy.float32)
example_hidden = numpy.random.randn(1, 256).astype(numpy.float32)

class XXXDataReader(CalibrationDataReader):
    def __init__(self):
        self.enum_data = None
        pass

    def get_next(self):
        if self.enum_data is None:
            self.enum_data = iter(
                [{"input": example_inputs,
                  "lstm_hidden_input_h_0": example_hidden,
                  "lstm_hidden_input_c_0": example_hidden,
                  "lstm_hidden_input_h_1": example_hidden,
                  "lstm_hidden_input_c_1": example_hidden}]
            )
        return next(self.enum_data, None)

    def rewind(self):
        pass

dr = XXXDataReader()

conf = StaticQuantConfig(
    calibration_data_reader=dr,
    quant_format=QuantFormat.QDQ,
    calibrate_method=CalibrationMethod.MinMax,
    activation_type=QuantType.QInt8,
    weight_type=QuantType.QInt8,
    #op_types_to_quantize=["Conv","Slice"],
    extra_options={
        "ForceQuantizeNoInputCheck":True,
    },
    # nodes_to_exclude=['resnetv17_dense0_fwd', ..],
    #nodes_to_quantize=['/conv1/Conv'],
    per_channel=True)

preprocess.quant_pre_process("denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified.onnx", "denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_p.onnx")
quantize("denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_p.onnx", "denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q.onnx", conf)

 

When I run atonn with "--d-lower 50", I see logs like this, but I don't what "scale-offset format" means as the issue was  the same with symmetric input and weights :

 Lowering  Conv2D_23 id=80 with Lowerer HW
Lowering Conv ...
HW lowering not done for Conv node=Conv2D_23 because output tensor (id=522) has scale-offset format
 Let's try with the next lowerer
 Let's try with the next lowerer
 Lowering  Conv2D_23 id=80 with Lowerer SW (scale offset)
Standard Software Lowering operations for node Conv2D_23
tensor w: 1
tensor h: 1
tensor ch: 257
tensor chin: 257

[...]

 Lowering  Gemm_28_conv_16 id=94 with Lowerer HW
Lowering Conv ...
HW lowering not done for Conv node=Gemm_28_conv_16 because output tensor (id=611) has scale-offset format
 Let's try with the next lowerer
 Let's try with the next lowerer
 Lowering  Gemm_28_conv_16 id=94 with Lowerer SW (scale offset)
Standard Software Lowering operations for node Gemm_28_conv_16
tensor w: 1
tensor h: 1
tensor ch: 257
tensor chin: 1024

[...]

 Lowering  Conv2D_79 id=218 with Lowerer HW
Lowering Conv ...
HW lowering not done for Conv node=Conv2D_79 because output tensor (id=1458) has scale-offset format
 Let's try with the next lowerer
 Let's try with the next lowerer
 Lowering  Conv2D_79 id=218 with Lowerer SW (scale offset)
Standard Software Lowering operations for node Conv2D_79
tensor w: 1
tensor h: 1
tensor ch: 257
tensor chin: 257

 

What is preventing the above 3 operations to be run on the NPU ?

Especially for the Gemm_28_conv_16 node as other Gemm operations are accelerated on the NPU.

The quantized model is in attachment.

 

Thanks for your help,

Alexis Murzeau

0 REPLIES 0