2026-02-15 3:44 AM
Hi,
I am compiling a MiniFASNet-based liveness ONNX model for STM32N6 using ST Edge AI Core v2.2.0.
The model behaves correctly in Python, but when deployed on STM32N6 the performance degrades noticeably.
What is confusing is the following:
Original ONNX model → good results in Python
ST-generated optimized model (*_OE_3_3_0.onnx) → also good results in Python (not compared to original model)
Same compiled model running on STM32N6 → significantly worse liveness performance
Command used:
Compilation summary (excerpt):
Input: f32 (1x128x128x3)
Output: f32 (1x2)
Model format: ss/sa per tensor
119 epochs (2 software: QuantizeLinear, DequantizeLinear)
Native float enabled
Activations allocated in NPU RAM regions
(Full log pasted below)
Preprocessing on STM32 matches Python exactly:
Resize size identical (128x128)
Same interpolation
Same normalization
Same channel order
Postprocessing matches Python:
Same logit difference logic
Same threshold
Same decision rule
ST optimized ONNX (*_OE_3_3_0.onnx) produces correct results in Python.
The discrepancy only appears when executing on STM32N6.
Outputs are not random
Inference runs successfully
Logits are reasonable values
However, classification confidence is consistently worse compared to Python
Could there be any known STM32N6 NPU runtime considerations that could cause numeric drift compared to PC execution?
Is there any scenario where per-tensor quantization fallback could behave differently on target compared to ONNX Runtime execution of the optimized model?
I want to isolate whether this is:
A runtime configuration issue
A memory/cache issue
Or something specific to the STM32N6 execution environment
Any guidance on how to systematically debug numeric differences between PC and STM32N6 execution would be appreciated.
PS C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows> ./stedgeai generate --model .\best_model_quantized_calib.onnx --target stm32n6 --st-neural-art default@user_neuralart.json --input-data-type float32 --output-data-type float32 --inputs-ch-position chlast --no-onnx-optimizer --verbosity 3
ST Edge AI Core v2.2.0-20266 2adc00962
WARNING: Unsupported keys in the current profile default are ignored: memory_desc
> memory_desc is not a valid key anymore, use machine_desc instead
>>>> EXECUTING NEURAL ART COMPILER
C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/atonn.exe -i "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/st_ai_output/best_model_quantized_calib_OE_3_3_0.onnx" --json-quant-file "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/st_ai_output/best_model_quantized_calib_OE_3_3_0_Q.json" -g "network.c" --load-mdesc "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/configs/stm32n6.mdesc" --load-mpool "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/my_mpools/stm32n6-app2.mpool" --save-mpool-file "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/st_ai_ws/neural_art__network/stm32n6-app2.mpool" --out-dir-prefix "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/st_ai_ws/neural_art__network/" --all-buffers-info --no-hw-sw-parallelism --cache-maintenance --enable-virtual-mem-pools --native-float --optimization 3 --Os --Omax-ca-pipe 4 --Ocache-opt --output-info-file "c_info.json"
<<<< DONE EXECUTING NEURAL ART COMPILER
Exec/report summary (generate)
--------------------------------------------------------------------------------------------------------------
model file : C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\best_model_quantized_calib.onnx
type : onnx
c_name : network
options : allocate-inputs, allocate-outputs
optimization : balanced
target/series : stm32n6npu
workspace dir : C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_ws
output dir : C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output
model_fmt : ss/sa per tensor
model_name : best_model_quantized_calib
model_hash : 0x72a0c3e8b907f5eb00804c4e2a91e8d1
params # : 468,048 items (1.79 MiB)
--------------------------------------------------------------------------------------------------------------
input 1/1 : 'Input_0_out_0', f32(1x128x128x3), 192.00 KBytes, activations
output 1/1 : 'Dequantize_273_out_0', f32(1x2), 8 Bytes, activations
macc : 0
weights (ro) : 513,105 B (501.08 KiB) (1 segment) / -1,359,087(-72.6%) vs float model
activations (rw) : 1,476,608 B (1.41 MiB) (4 segments) *
ram (total) : 1,476,608 B (1.41 MiB) = 1,476,608 + 0 + 0
--------------------------------------------------------------------------------------------------------------
(*) 'input'/'output' buffers are allocated in the activations buffer
Computing AI RT data/code size (target=stm32n6npu)..
-> compiler "gcc:arm-none-eabi-gcc" is not in the PATH
Compilation details
---------------------------------------------------------------------------------
Compiler version: 1.1.1-14
Compiler arguments: -i C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\best_model_quantized_calib_OE_3_3_0.onnx --json-quant-file C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\best_model_quantized_calib_OE_3_3_0_Q.json -g network.c --load-mdesc C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\configs\stm32n6.mdesc --load-mpool C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\my_mpools\stm32n6-app2.mpool --save-mpool-file C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_ws\neural_art__network\stm32n6-app2.mpool --out-dir-prefix C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_ws\neural_art__network/ --all-buffers-info --no-hw-sw-parallelism --cache-maintenance --enable-virtual-mem-pools --native-float --optimization 3 --Os --Omax-ca-pipe 4 --Ocache-opt --output-info-file c_info.json
====================================================================================
Memory usage information (input/output buffers are included in activations)
---------------------------------------------------------------------------------
npuRAM3 [0x34200000 - 0x34270000]: 448.000 kB / 448.000 kB (100.00 % used) -- weights: 0 B ( 0.00 % used) activations: 448.000 kB (100.00 % used)
npuRAM4 [0x34270000 - 0x342E0000]: 448.000 kB / 448.000 kB (100.00 % used) -- weights: 0 B ( 0.00 % used) activations: 448.000 kB (100.00 % used)
npuRAM5 [0x342E0000 - 0x34350000]: 448.000 kB / 448.000 kB (100.00 % used) -- weights: 0 B ( 0.00 % used) activations: 448.000 kB (100.00 % used)
npuRAM6 [0x34350000 - 0x343C0000]: 98.000 kB / 448.000 kB ( 21.88 % used) -- weights: 0 B ( 0.00 % used) activations: 98.000 kB ( 21.88 % used)
octoFlash [0x72880000 - 0x72C80000]: 501.079 kB / 4.000 MB ( 12.23 % used) -- weights: 501.079 kB ( 12.23 % used) activations: 0 B ( 0.00 % used)
hyperRAM [0x90000000 - 0x91000000]: 0 B / 16.000 MB ( 0.00 % used) -- weights: 0 B ( 0.00 % used) activations: 0 B ( 0.00 % used)
Total: 1.898 MB -- weights: 501.079 kB activations: 1.408 MB
====================================================================================
Used memory ranges
---------------------------------------------------------------------------------
npuRAM3 [0x34200000 - 0x34270000]: 0x34200000-0x34270000
npuRAM4 [0x34270000 - 0x342E0000]: 0x34270000-0x342E0000
npuRAM5 [0x342E0000 - 0x34350000]: 0x342E0000-0x34350000
npuRAM6 [0x34350000 - 0x343C0000]: 0x34350000-0x34368800
octoFlash [0x72880000 - 0x72C80000]: 0x72880000-0x728FD460
====================================================================================
Epochs details
---------------------------------------------------------------------------------
Total number of epochs: 119 of which 2 implemented in software
epoch ID HW/SW/EC Operation (SW only)
epoch 1 HW
epoch 2 -SW- ( QuantizeLinear )
epoch 3 HW
epoch 4 HW
epoch 5 HW
epoch 6 HW
epoch 7 HW
epoch 8 HW
epoch 9 HW
epoch 10 HW
epoch 11 HW
epoch 12 HW
epoch 13 HW
epoch 14 HW
epoch 15 HW
epoch 16 HW
epoch 17 HW
epoch 18 HW
epoch 19 HW
epoch 20 HW
epoch 21 HW
epoch 22 HW
epoch 23 HW
epoch 24 HW
epoch 25 HW
epoch 26 HW
epoch 27 HW
epoch 28 HW
epoch 29 HW
epoch 30 HW
epoch 31 HW
epoch 32 HW
epoch 33 HW
epoch 34 HW
epoch 35 HW
epoch 36 HW
epoch 37 HW
epoch 38 HW
epoch 39 HW
epoch 40 HW
epoch 41 HW
epoch 42 HW
epoch 43 HW
epoch 44 HW
epoch 45 HW
epoch 46 HW
epoch 47 HW
epoch 48 HW
epoch 49 HW
epoch 50 HW
epoch 51 HW
epoch 52 HW
epoch 53 HW
epoch 54 HW
epoch 55 HW
epoch 56 HW
epoch 57 HW
epoch 58 HW
epoch 59 HW
epoch 60 HW
epoch 61 HW
epoch 62 HW
epoch 63 HW
epoch 64 HW
epoch 65 HW
epoch 66 HW
epoch 67 HW
epoch 68 HW
epoch 69 HW
epoch 70 HW
epoch 71 HW
epoch 72 HW
epoch 73 HW
epoch 74 HW
epoch 75 HW
epoch 76 HW
epoch 77 HW
epoch 78 HW
epoch 79 HW
epoch 80 HW
epoch 81 HW
epoch 82 HW
epoch 83 HW
epoch 84 HW
epoch 85 HW
epoch 86 HW
epoch 87 HW
epoch 88 HW
epoch 89 HW
epoch 90 HW
epoch 91 HW
epoch 92 HW
epoch 93 HW
epoch 94 HW
epoch 95 HW
epoch 96 HW
epoch 97 HW
epoch 98 HW
epoch 99 HW
epoch 100 HW
epoch 101 HW
epoch 102 HW
epoch 103 HW
epoch 104 HW
epoch 105 HW
epoch 106 HW
epoch 107 HW
epoch 108 HW
epoch 109 HW
epoch 110 HW
epoch 111 HW
epoch 112 HW
epoch 113 HW
epoch 114 HW
epoch 115 HW
epoch 116 HW
epoch 117 HW
epoch 118 HW
epoch 119 -SW- ( DequantizeLinear )
====================================================================================
Generated files (5)
--------------------------------------------------------------------------------------------------------------
C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\best_model_quantized_calib_OE_3_3_0.onnx
C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\best_model_quantized_calib_OE_3_3_0_Q.json
C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\network.c
C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\network_atonbuf.xSPI2.raw
C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\network.h
Creating txt report file C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\network_generate_report.txt
elapsed time (generate): 271.131s
2026-02-19 7:22 AM - edited 2026-02-19 7:24 AM
Hi @Afreen,
First to answer your questions:
Could there be any known STM32N6 NPU runtime considerations that could cause numeric drift compared to PC execution? Yes
Is there any scenario where per-tensor quantization fallback could behave differently on target compared to ONNX Runtime execution of the optimized model? I don't think so
I would suggest updating the core to version 3.0, then please install the new tool replacing X Cube AI to be able to validate your model on 3.0. More info here: Introducing STM32CubeAI Studio - STMicroelectronics Community
I suggest validating your model on target with and without NPU and check the "COS" metric in the report. This should be very close to 1. If not, it indicates that the output of the compiled model is different from the original model. This could be a bug.
Validating the model with and without the NPU allows to know if it is the STM32 libraries or the Neural art library (NPU) that is the problem.
Note that it would be better to use real data than random one while validating.
Have a good day,
Julian