STM32N657 Neural-ART: int8 residual Add computes wrong (low-biased) result on target - both operand verified correct

Question

Board: NUCLEO-N657X0-Q (MB1940), STM32N657X0H (Cortex-M55 + Neural-ART NPU), Device ID 0x486 Rev Z.

Toolchain: ST Edge AI Core v4.0.1-20581 (STM32Cube.AI 12.0.1-RC2), atonn 1.1.3-275, NetworkRuntime1201_CM55_GCC, arm-none-eabi-gcc 14.3.1 (STM32CubeCLT 1.21.0).

SUMMARY
A small int8 QDQ model (TCN: Conv2D x7 + Gemm x2, macc ~2.37M, int8 weights ~20 KB) generated with "stedgeai generate --target stm32n6 --st-neural-art" runs to completion on the NPU but produces wrong outputs: high-input forecasts collapse to baseline. Independent cross-check: onnxruntime on the same int8 ONNX gives 9.664 (ppm, log-domain output dequantized) on our golden window; the NPU returns ~0.26. Confirmed on hardware with ST's own "stedgeai validate --mode target" (regression head rmse 8.78, cos 0.589 vs the reference).

LOCALIZED ON-TARGET TO A SINGLE OP

Per-layer on-target validation (backbone stage outputs exposed as ONNX outputs):
- input projection Conv: cos 0.999 (correct)
- block 0 (dilated Conv + Slice + residual Add + Relu): cos 0.606 (first divergence; error accumulates through blocks 1-5)
Isolating block-0 internals on target:
- dilated Conv output: cos 0.9996 (correct)
- causal Slice: cos 0.99999 (correct)
- residual Add + Relu: cos 0.606 (WRONG, biased low)
Both Add operands are computed correctly on the NPU; the int8 eltwise combine of two tensors with DIFFERENT quant scales is what diverges. Operand scales 0.052 / 0.099 -> output scale 0.049 (only ~2x ratio, non-pathological).

RULED OUT (output byte-identical across all of these): weight location (external flash / npuRAM3 / npuRAM5), M55 D-cache on/off, NPU CACHEAXI on/off/invalidate, RISAF12, NPU-master RIF secure vs non-secure, clocks/sleep settings, --Ocache-opt, EpochController encryption, input layout. Weights and input verified correct in NPU memory at run entry (SWD reads).

MODEL-SIDE WORKAROUNDS TRIED ON HARDWARE (all failed):
1. Concat + 1x1 Conv replacing the Add (numerically exact in float, max delta = 0): no help (cos 0.610) -> the defect is the different-scale int8 combine, not the Add operator itself (Concat requantizes the same way).
2. int16 activations: partial improvement (cos 0.589 -> 0.887) but a magnitude bias remains.
3. Shared-scale Add operands: not cleanly viable (the residual stream is a chain, forcing one GLOBAL scale ~7x coarser than the relu scales).
4. Residual folded into the conv (x + conv(x) = (conv+I)(x), exact on host, max delta 9.5e-07): fails on target (the int8 conv appears to lose the small delta next to the ~1.0 identity weight).

CURRENT STATUS: we ship the same int8 model on the Cortex-M55 CPU instead (bit-exact vs onnxruntime there, so the model itself is sound) at ~15.7 ms/inference vs the NPU's measured 1.124 ms. A fix would let us return to the NPU with no retrain.

QUESTION
Is the int8 eltwise Add of two different-scale operands a known Neural-ART/ATON issue in v4.0.1? Is there a generate option, or a recommended residual-connection pattern, that the NPU computes correctly?

A full evidence package exists (per-layer validate reports, ONNX cuts isolating the op, network_val_io.npz with on-target vs reference tensors) - happy to attach or share on request.

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded