2026-03-30 4:34 AM
Hello ST Community,
I am currently working with STM32N6 and CubeAI, and I would appreciate your guidance on an issue related to incorrect inference on the device.
I am integrating a custom KWS model into the STM32N6-GettingStarted-Audio project, but inference results on the device are incorrect.
I have carefully verified the preprocessing and model behavior as described below.
--------------------------------------------------
Model
--------------------------------------------------
Architecture: DS-CNN (PyTorch)
Input tensor:
[B, C, H, W] = [1, 1, 40, 49]
H: mel bins
W: time frames
--------------------------------------------------
Verified points
--------------------------------------------------
1. Log-mel preprocessing alignment
- Log-mel spectrogram generated on STM32 was dumped using GDB
- Compared with Python implementation
Result:
Nearly identical (very small numerical differences)
2. PC inference with int8 model
- Used dumped MCU log-mel as input
- Ran inference with quantized (QDQ) ONNX model
Result:
Correct classification
3. CubeAI Studio inference
- Input: float32 log-mel (.npy format)
- Shape: (256, 1, 40, 49)
Result:
Correct classification
--------------------------------------------------
Input data consistency
--------------------------------------------------
- The input PCM data was pre-recorded using the MEMS microphone on the STM32N6 evaluation board
- The same PCM data is used consistently across:
- MCU preprocessing
- PC-side log-mel generation
- ONNX Runtime inference
Therefore, differences due to microphone characteristics or recording conditions are excluded.
--------------------------------------------------
Additional implementation details
--------------------------------------------------
- The C preprocessing code was generated using the official GenHeaders.py scripts as a baseline
- I updated the preprocessing parameters to match my training pipeline by replacing:
- Mel filter bank coefficients (from PyTorch)
- I also modified:
- hop_length = 320
- fmin = 50
- fmax = 7500
- These parameters are identical to those used during training in Python
- The resulting log-mel spectrogram on STM32 closely matches the PC implementation
--------------------------------------------------
Current understanding
--------------------------------------------------
The following components appear to be correct:
- Preprocessing (log-mel generation)
- Model itself (both float and int8)
- CubeAI Studio inference
--------------------------------------------------
Problem
--------------------------------------------------
Only inference on the STM32 device is incorrect (mostly wrong predictions)
PC (float logits)
[-1.037, -1.535, -1.369, 3.610, -1.203, -1.784, -1.369, -1.494, -1.079]
PC (quantized output using STAI scale/offset)
[-14, -26, -22, 99, -18, -32, -22, -25, -15]
MCU (int8 raw output)
[-63, -25, -25, -4, -42, -2, -61, -69, -43]
--------------------------------------------------
Questions
--------------------------------------------------
1. Could this be caused by input tensor layout mismatch?
- NCHW vs NHWC
- Or implicit transpose inside the runtime
2. In CubeAI conversion:
- Is it possible that input quantization (QuantizeLinear) is externalized?
- Could this lead to double quantization on the device?
3. In ST AI runtime:
- Is it possible that the reported tensor shape differs from the actual memory layout?
4. Are there any known typical causes for this kind of issue?
--------------------------------------------------
Additional note
--------------------------------------------------
Since preprocessing and inference using the same data are correct on PC,
I suspect the issue may be related to:
- Input layout
- Quantization handling
- Runtime-specific data interpretation
Any advice or suggestions would be greatly appreciated.
Thank you in advance.
2026-04-07 5:46 AM
Hi @sakasita,
regarding your questions:
1. Could this be caused by input tensor layout mismatch?
- NCHW vs NHWC
- Or implicit transpose inside the runtime
=> it is possible
2. In CubeAI conversion:
- Is it possible that input quantization (QuantizeLinear) is externalized?
- Could this lead to double quantization on the device?
=> I don't think so
3. In ST AI runtime:
- Is it possible that the reported tensor shape differs from the actual memory layout?
=> It should not be
4. Are there any known typical causes for this kind of issue?
I think inconsistent input shape is a good candidate
Have a good day,
Julian