2025-06-20 4:32 PM
Hi,
I'm trying to implement "1 - x" computation as a part of implementing GRU using onnx model format.
I tried doing Sub(1, x), but the Sub operator is not accelerated on the NPU and is run on in software instead.
I've attached the model "single_sub_rewritten_q.onnx" for this :
(const_1 is a vector of all 1 here, but using a single scalar of 1.0 does the same).
And this is the output of stedgeai which shows that Sub is running outside the NPU:
Epochs details
---------------------------------------------------------------------------------
Total number of epochs: 1 of which 1 implemented in software
epoch ID HW/SW/EC Operation (SW only)
epoch 1 -SW- ( Sub )
====================================================================================
Then I tried to do Add(1, Neg(x)) instead, the resulting model is accelerated on the NPU, but the output of the NPU is wrong and always the constant 127 according to the validation output.
Running the validation gives this results:
$STEDGEAI_CORE_DIR/Utilities/linux/stedgeai analyze --target stm32n6 --name network -m single_add_neg_rewritten_q.onnx --st-neural-art "n6-noextmem-noec@user_neuralart.json" --verbosity 3
[...]
$STEDGEAI_CORE_DIR/Utilities/linux/python $STEDGEAI_CORE_DIR/scripts/N6_scripts/n6_loader.py --build-config N6-DK --skip-flash
[...]
$STEDGEAI_CORE_DIR/Utilities/linux/stedgeai validate --target stm32n6 --name network -m single_add_neg_rewritten_q.onnx --st-neural-art "n6-noextmem@user_neuralart.json" --verbosity 3 --mode target -d serial:921600
Statistic per tensor
-------------------------------------------------------------------------------------
tensor # type[shape]:size min max mean std name
-------------------------------------------------------------------------------------
I.0 10 i8[1,1,1,32]:32 -127 124 -2.125 74.938 Input_1_out_0
O.0 10 i8[1,1,1,32]:32 127 127 127.000 0.000 Quantize_7_out_0
-------------------------------------------------------------------------------------
Evaluation report (summary)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Output acc rmse mae l2r mean std nse cos tensor
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
X-cross #1 n.a. 146.493072510 125.875000000 146.493057251 -125.875000 75.055359 -2.809524 -0.857454 'output_QuantizeLinear_Input', 10 x int8(1x1x1x32), m_id=[7]
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
According to the generated .csv and the statistic per tensor, the output is a vector with all values equal 127, which lead to a bad l2r value in the Evaluation report.
Finally, I've tried with Add(1, Mul(x, -1)) and it works, this is accelerated on the NPU and gives the correct result, but use 2 ARITH HW unit to do that.
I see that "Sub" should be HW accelerated according to https://stedgeai-dc.st.com/assets/embedded-docs/stneuralart_operator_support.html.
I see my post ressemble more of a bug report but I'm just trying to understand what's the way to do "1 - x".
Is it caused by a bug when handling Sub operator (that may become fixed in the future) ? Or is this a limitation of the NPU ?
Thanks for your help. The validation on hardware without writing a single line of C code is amazing !
Regards,
Alexis Murzeau