STM32N6 / CubeAI: Incorrect inference on device despite matching log-mel and correct PC results

sakasita · ‎2026-03-30

Hello ST Community,

I am currently working with STM32N6 and CubeAI, and I would appreciate your guidance on an issue related to incorrect inference on the device.

I am integrating a custom KWS model into the STM32N6-GettingStarted-Audio project, but inference results on the device are incorrect.

I have carefully verified the preprocessing and model behavior as described below.

--------------------------------------------------
Model
--------------------------------------------------

Architecture: DS-CNN (PyTorch)

Input tensor:
[B, C, H, W] = [1, 1, 40, 49]
H: mel bins
W: time frames

--------------------------------------------------
Verified points
--------------------------------------------------

1. Log-mel preprocessing alignment

- Log-mel spectrogram generated on STM32 was dumped using GDB
- Compared with Python implementation

Result:
Nearly identical (very small numerical differences)

2. PC inference with int8 model

- Used dumped MCU log-mel as input
- Ran inference with quantized (QDQ) ONNX model

Result:
Correct classification

3. CubeAI Studio inference

- Input: float32 log-mel (.npy format)
- Shape: (256, 1, 40, 49)

Result:
Correct classification

--------------------------------------------------
Input data consistency
--------------------------------------------------

- The input PCM data was pre-recorded using the MEMS microphone on the STM32N6 evaluation board
- The same PCM data is used consistently across:
- MCU preprocessing
- PC-side log-mel generation
- ONNX Runtime inference

Therefore, differences due to microphone characteristics or recording conditions are excluded.

--------------------------------------------------
Additional implementation details
--------------------------------------------------

- The C preprocessing code was generated using the official GenHeaders.py scripts as a baseline
- I updated the preprocessing parameters to match my training pipeline by replacing:
- Mel filter bank coefficients (from PyTorch)

- I also modified:
- hop_length = 320
- fmin = 50
- fmax = 7500

- These parameters are identical to those used during training in Python
- The resulting log-mel spectrogram on STM32 closely matches the PC implementation

--------------------------------------------------
Current understanding
--------------------------------------------------

The following components appear to be correct:

- Preprocessing (log-mel generation)
- Model itself (both float and int8)
- CubeAI Studio inference

--------------------------------------------------
Problem
--------------------------------------------------

Only inference on the STM32 device is incorrect (mostly wrong predictions)

PC (float logits)
[-1.037, -1.535, -1.369, 3.610, -1.203, -1.784, -1.369, -1.494, -1.079]
PC (quantized output using STAI scale/offset)
[-14, -26, -22, 99, -18, -32, -22, -25, -15]
MCU (int8 raw output)
[-63, -25, -25, -4, -42, -2, -61, -69, -43]

--------------------------------------------------
Questions
--------------------------------------------------

1. Could this be caused by input tensor layout mismatch?
- NCHW vs NHWC
- Or implicit transpose inside the runtime

2. In CubeAI conversion:
- Is it possible that input quantization (QuantizeLinear) is externalized?
- Could this lead to double quantization on the device?

3. In ST AI runtime:
- Is it possible that the reported tensor shape differs from the actual memory layout?

4. Are there any known typical causes for this kind of issue?

--------------------------------------------------
Additional note
--------------------------------------------------

Since preprocessing and inference using the same data are correct on PC,
I suspect the issue may be related to:

- Input layout
- Quantization handling
- Runtime-specific data interpretation

Any advice or suggestions would be greatly appreciated.

Thank you in advance.

Julian E. · ‎2026-04-07

Hi @sakasita,

regarding your questions:

1. Could this be caused by input tensor layout mismatch?
- NCHW vs NHWC
- Or implicit transpose inside the runtime

=> it is possible

2. In CubeAI conversion:
- Is it possible that input quantization (QuantizeLinear) is externalized?
- Could this lead to double quantization on the device?

=> I don't think so

3. In ST AI runtime:
- Is it possible that the reported tensor shape differs from the actual memory layout?

=> It should not be

4. Are there any known typical causes for this kind of issue?

I think inconsistent input shape is a good candidate

Have a good day,

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.