Trying to get 1 - x accelerated on NPU but Neg gives wrong result and Sub is SW only

AMurz.1 · ‎2025-06-20

Hi,

I'm trying to implement "1 - x" computation as a part of implementing GRU using onnx model format.

I tried doing Sub(1, x), but the Sub operator is not accelerated on the NPU and is run on in software instead.

I've attached the model "single_sub_rewritten_q.onnx" for this :

(const_1 is a vector of all 1 here, but using a single scalar of 1.0 does the same).

And this is the output of stedgeai which shows that Sub is running outside the NPU:

Epochs details                                                                                                                                                                               
   ---------------------------------------------------------------------------------                                                                                                         
Total number of epochs: 1 of which 1 implemented in software                                                                                                                                 
                                                                                                                                                                                             
epoch ID   HW/SW/EC Operation (SW only)                                                                                                                                                      
epoch 1      -SW-   (        Sub         )                                                                                                                                                   
====================================================================================

Then I tried to do Add(1, Neg(x)) instead, the resulting model is accelerated on the NPU, but the output of the NPU is wrong and always the constant 127 according to the validation output.

Running the validation gives this results:

$STEDGEAI_CORE_DIR/Utilities/linux/stedgeai analyze --target stm32n6 --name network -m single_add_neg_rewritten_q.onnx --st-neural-art "n6-noextmem-noec@user_neuralart.json" --verbosity 3
[...]

$STEDGEAI_CORE_DIR/Utilities/linux/python $STEDGEAI_CORE_DIR/scripts/N6_scripts/n6_loader.py --build-config N6-DK  --skip-flash
[...]

$STEDGEAI_CORE_DIR/Utilities/linux/stedgeai validate --target stm32n6 --name network -m single_add_neg_rewritten_q.onnx --st-neural-art "n6-noextmem@user_neuralart.json" --verbosity 3 --mode target -d serial:921600

   Statistic per tensor                                                                                                                                                                      
   -------------------------------------------------------------------------------------                                                                                                     
   tensor   #    type[shape]:size      min   max      mean      std  name                                                                                                                    
   -------------------------------------------------------------------------------------                                                                                                     
   I.0      10   i8[1,1,1,32]:32      -127   124    -2.125   74.938  Input_1_out_0                                                                                                           
   O.0      10   i8[1,1,1,32]:32       127   127   127.000    0.000  Quantize_7_out_0                                                                                                        
   -------------------------------------------------------------------------------------
 Evaluation report (summary)                                                                                                                                                                 
 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------       
 Output       acc    rmse            mae             l2r             mean          std         nse         cos         tensor                                                                
 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------       
 X-cross #1   n.a.   146.493072510   125.875000000   146.493057251   -125.875000   75.055359   -2.809524   -0.857454   'output_QuantizeLinear_Input', 10 x int8(1x1x1x32), m_id=[7]          
 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

According to the generated .csv and the statistic per tensor, the output is a vector with all values equal 127, which lead to a bad l2r value in the Evaluation report.

Finally, I've tried with Add(1, Mul(x, -1)) and it works, this is accelerated on the NPU and gives the correct result, but use 2 ARITH HW unit to do that.

I see that "Sub" should be HW accelerated according to https://stedgeai-dc.st.com/assets/embedded-docs/stneuralart_operator_support.html.

I see my post ressemble more of a bug report but I'm just trying to understand what's the way to do "1 - x".

Is it caused by a bug when handling Sub operator (that may become fixed in the future) ? Or is this a limitation of the NPU ?

Thanks for your help. The validation on hardware without writing a single line of C code is amazing !

Regards,

Alexis Murzeau

Julian E. · ‎2025-06-26

Hello @AMurz.1,

Thank you for the information.

I did replicate all of you describe on my side.

I opened internal tickets to investigate.

I will keep you updated

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.