STM32N6 – ONNX model performs well in Python (including ST optimized model) but degrades on target

Afreen · ‎2026-02-15

Hi,

I am compiling a MiniFASNet-based liveness ONNX model for STM32N6 using ST Edge AI Core v2.2.0.

The model behaves correctly in Python, but when deployed on STM32N6 the performance degrades noticeably.

What is confusing is the following:

Original ONNX model → good results in Python
ST-generated optimized model (*_OE_3_3_0.onnx) → also good results in Python (not compared to original model)
Same compiled model running on STM32N6 → significantly worse liveness performance

Model & Compilation Details

Command used:

./stedgeai generate \ --model best_model_quantized_calib.onnx \ --target stm32n6 \ --input-data-type float32 \ --output-data-type float32 \ --inputs-ch-position chlast \ --no-onnx-optimizer \ --verbosity 3

Compilation summary (excerpt):

Input: f32 (1x128x128x3)
Output: f32 (1x2)
Model format: ss/sa per tensor
119 epochs (2 software: QuantizeLinear, DequantizeLinear)
Native float enabled
Activations allocated in NPU RAM regions

(Full log pasted below)

What Has Been Verified

Preprocessing on STM32 matches Python exactly:
- Resize size identical (128x128)
- Same interpolation
- Same normalization
- Same channel order
Postprocessing matches Python:
- Same logit difference logic
- Same threshold
- Same decision rule
ST optimized ONNX (*_OE_3_3_0.onnx) produces correct results in Python.

The discrepancy only appears when executing on STM32N6.

Observed Behavior on Target

Outputs are not random
Inference runs successfully
Logits are reasonable values
However, classification confidence is consistently worse compared to Python

Questions

Could there be any known STM32N6 NPU runtime considerations that could cause numeric drift compared to PC execution?
Is there any scenario where per-tensor quantization fallback could behave differently on target compared to ONNX Runtime execution of the optimized model?

Goal

I want to isolate whether this is:

A runtime configuration issue
A memory/cache issue
Or something specific to the STM32N6 execution environment

Any guidance on how to systematically debug numeric differences between PC and STM32N6 execution would be appreciated.

PS C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows> ./stedgeai generate --model .\best_model_quantized_calib.onnx --target stm32n6 --st-neural-art default@user_neuralart.json --input-data-type float32 --output-data-type float32 --inputs-ch-position chlast --no-onnx-optimizer --verbosity 3
ST Edge AI Core v2.2.0-20266 2adc00962
WARNING: Unsupported keys in the current profile default are ignored: memory_desc
        > memory_desc is not a valid key anymore, use machine_desc instead
>>>> EXECUTING NEURAL ART COMPILER
   C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/atonn.exe -i "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/st_ai_output/best_model_quantized_calib_OE_3_3_0.onnx" --json-quant-file "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/st_ai_output/best_model_quantized_calib_OE_3_3_0_Q.json" -g "network.c" --load-mdesc "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/configs/stm32n6.mdesc" --load-mpool "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/my_mpools/stm32n6-app2.mpool" --save-mpool-file "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/st_ai_ws/neural_art__network/stm32n6-app2.mpool" --out-dir-prefix "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/st_ai_ws/neural_art__network/" --all-buffers-info --no-hw-sw-parallelism --cache-maintenance --enable-virtual-mem-pools --native-float --optimization 3 --Os --Omax-ca-pipe 4 --Ocache-opt --output-info-file "c_info.json"
<<<< DONE EXECUTING NEURAL ART COMPILER

Exec/report summary (generate)
--------------------------------------------------------------------------------------------------------------
model file         :   C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\best_model_quantized_calib.onnx
type               :   onnx
c_name             :   network
options            :   allocate-inputs, allocate-outputs
optimization       :   balanced
target/series      :   stm32n6npu
workspace dir      :   C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_ws
output dir         :   C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output
model_fmt          :   ss/sa per tensor
model_name         :   best_model_quantized_calib
model_hash         :   0x72a0c3e8b907f5eb00804c4e2a91e8d1
params #           :   468,048 items (1.79 MiB)
--------------------------------------------------------------------------------------------------------------
input 1/1          :   'Input_0_out_0', f32(1x128x128x3), 192.00 KBytes, activations
output 1/1         :   'Dequantize_273_out_0', f32(1x2), 8 Bytes, activations
macc               :   0
weights (ro)       :   513,105 B (501.08 KiB) (1 segment) / -1,359,087(-72.6%) vs float model
activations (rw)   :   1,476,608 B (1.41 MiB) (4 segments) *
ram (total)        :   1,476,608 B (1.41 MiB) = 1,476,608 + 0 + 0
--------------------------------------------------------------------------------------------------------------
(*) 'input'/'output' buffers are allocated in the activations buffer

Computing AI RT data/code size (target=stm32n6npu)..
-> compiler "gcc:arm-none-eabi-gcc" is not in the PATH

Compilation details
   ---------------------------------------------------------------------------------
Compiler version: 1.1.1-14
Compiler arguments:  -i C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\best_model_quantized_calib_OE_3_3_0.onnx --json-quant-file C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\best_model_quantized_calib_OE_3_3_0_Q.json -g network.c --load-mdesc C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\configs\stm32n6.mdesc --load-mpool C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\my_mpools\stm32n6-app2.mpool --save-mpool-file C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_ws\neural_art__network\stm32n6-app2.mpool --out-dir-prefix C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_ws\neural_art__network/ --all-buffers-info --no-hw-sw-parallelism --cache-maintenance --enable-virtual-mem-pools --native-float --optimization 3 --Os --Omax-ca-pipe 4 --Ocache-opt --output-info-file c_info.json
====================================================================================
Memory usage information  (input/output buffers are included in activations)
   ---------------------------------------------------------------------------------
        npuRAM3    [0x34200000 - 0x34270000]:    448.000 kB /    448.000 kB  (100.00 % used) -- weights:          0  B (  0.00 % used)  activations:    448.000 kB (100.00 % used)
        npuRAM4    [0x34270000 - 0x342E0000]:    448.000 kB /    448.000 kB  (100.00 % used) -- weights:          0  B (  0.00 % used)  activations:    448.000 kB (100.00 % used)
        npuRAM5    [0x342E0000 - 0x34350000]:    448.000 kB /    448.000 kB  (100.00 % used) -- weights:          0  B (  0.00 % used)  activations:    448.000 kB (100.00 % used)
        npuRAM6    [0x34350000 - 0x343C0000]:     98.000 kB /    448.000 kB  ( 21.88 % used) -- weights:          0  B (  0.00 % used)  activations:     98.000 kB ( 21.88 % used)
        octoFlash  [0x72880000 - 0x72C80000]:    501.079 kB /      4.000 MB  ( 12.23 % used) -- weights:    501.079 kB ( 12.23 % used)  activations:          0  B (  0.00 % used)
        hyperRAM   [0x90000000 - 0x91000000]:          0  B /     16.000 MB  (  0.00 % used) -- weights:          0  B (  0.00 % used)  activations:          0  B (  0.00 % used)

Total:                                             1.898 MB                                  -- weights:    501.079 kB                  activations:      1.408 MB
====================================================================================
Used memory ranges
   ---------------------------------------------------------------------------------
        npuRAM3    [0x34200000 - 0x34270000]: 0x34200000-0x34270000
        npuRAM4    [0x34270000 - 0x342E0000]: 0x34270000-0x342E0000
        npuRAM5    [0x342E0000 - 0x34350000]: 0x342E0000-0x34350000
        npuRAM6    [0x34350000 - 0x343C0000]: 0x34350000-0x34368800
        octoFlash  [0x72880000 - 0x72C80000]: 0x72880000-0x728FD460
====================================================================================
Epochs details
   ---------------------------------------------------------------------------------
Total number of epochs: 119 of which 2 implemented in software

epoch ID   HW/SW/EC Operation (SW only)
epoch 1       HW
epoch 2      -SW-   (   QuantizeLinear   )
epoch 3       HW
epoch 4       HW
epoch 5       HW
epoch 6       HW
epoch 7       HW
epoch 8       HW
epoch 9       HW
epoch 10      HW
epoch 11      HW
epoch 12      HW
epoch 13      HW
epoch 14      HW
epoch 15      HW
epoch 16      HW
epoch 17      HW
epoch 18      HW
epoch 19      HW
epoch 20      HW
epoch 21      HW
epoch 22      HW
epoch 23      HW
epoch 24      HW
epoch 25      HW
epoch 26      HW
epoch 27      HW
epoch 28      HW
epoch 29      HW
epoch 30      HW
epoch 31      HW
epoch 32      HW
epoch 33      HW
epoch 34      HW
epoch 35      HW
epoch 36      HW
epoch 37      HW
epoch 38      HW
epoch 39      HW
epoch 40      HW
epoch 41      HW
epoch 42      HW
epoch 43      HW
epoch 44      HW
epoch 45      HW
epoch 46      HW
epoch 47      HW
epoch 48      HW
epoch 49      HW
epoch 50      HW
epoch 51      HW
epoch 52      HW
epoch 53      HW
epoch 54      HW
epoch 55      HW
epoch 56      HW
epoch 57      HW
epoch 58      HW
epoch 59      HW
epoch 60      HW
epoch 61      HW
epoch 62      HW
epoch 63      HW
epoch 64      HW
epoch 65      HW
epoch 66      HW
epoch 67      HW
epoch 68      HW
epoch 69      HW
epoch 70      HW
epoch 71      HW
epoch 72      HW
epoch 73      HW
epoch 74      HW
epoch 75      HW
epoch 76      HW
epoch 77      HW
epoch 78      HW
epoch 79      HW
epoch 80      HW
epoch 81      HW
epoch 82      HW
epoch 83      HW
epoch 84      HW
epoch 85      HW
epoch 86      HW
epoch 87      HW
epoch 88      HW
epoch 89      HW
epoch 90      HW
epoch 91      HW
epoch 92      HW
epoch 93      HW
epoch 94      HW
epoch 95      HW
epoch 96      HW
epoch 97      HW
epoch 98      HW
epoch 99      HW
epoch 100     HW
epoch 101     HW
epoch 102     HW
epoch 103     HW
epoch 104     HW
epoch 105     HW
epoch 106     HW
epoch 107     HW
epoch 108     HW
epoch 109     HW
epoch 110     HW
epoch 111     HW
epoch 112     HW
epoch 113     HW
epoch 114     HW
epoch 115     HW
epoch 116     HW
epoch 117     HW
epoch 118     HW
epoch 119    -SW-   (  DequantizeLinear  )
====================================================================================

Generated files (5)
--------------------------------------------------------------------------------------------------------------
C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\best_model_quantized_calib_OE_3_3_0.onnx
C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\best_model_quantized_calib_OE_3_3_0_Q.json
C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\network.c
C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\network_atonbuf.xSPI2.raw
C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\network.h

Creating txt report file C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\network_generate_report.txt
elapsed time (generate): 271.131s

Julian E. · ‎2026-02-19

Hi @Afreen,

First to answer your questions:

Could there be any known STM32N6 NPU runtime considerations that could cause numeric drift compared to PC execution? Yes
Is there any scenario where per-tensor quantization fallback could behave differently on target compared to ONNX Runtime execution of the optimized model? I don't think so

I would suggest updating the core to version 3.0, then please install the new tool replacing X Cube AI to be able to validate your model on 3.0. More info here: Introducing STM32CubeAI Studio - STMicroelectronics Community

I suggest validating your model on target with and without NPU and check the "COS" metric in the report. This should be very close to 1. If not, it indicates that the output of the compiled model is different from the original model. This could be a bug.

Validating the model with and without the NPU allows to know if it is the STM32 libraries or the Neural art library (NPU) that is the problem.

Note that it would be better to use real data than random one while validating.

Have a good day,

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.