2025-06-02 6:31 AM
Hello everyone,
I have followed the steps in this forum post (“Solved: Deploying a deep learning model on NUCLEO-H753ZI”) Solved: Deploying a deep learning model on NUCLEO-H753ZI - Page 2 - STMicroelectronics Communityto try X-CUBE-AI’s validate feature on the STM32N6570-DK board. Below is my workflow and the problem I encountered:
Development Board & Firmware Versions
Board: STM32N6570-DK
X-CUBE-AI Version: 10.1.0
ST Edge AI Core Version: 2.1.0
Models That Have Validated Successfully
I first used the simple example model provided in that forum post, and validation completed with results.
I then tested one of ST’s official STM32N6 AI-demo models; validation completed successfully and reported the expected accuracy metrics.
New Model Under Test
My colleague provided a different, more complex model from the STM32 Model Zoo(The model file is attached at the end of this post.). All earlier steps—generate, compile, download to the board—worked fine. However, during the validation step (validate --mode target), it always times out (LOAD ERROR: STM32 - read timeout), yielding no validation output.
Initially, I suspected this model’s weight/activation size was too large for RAM, causing the NPU to fail loading. But I later tested an even larger model (with greater weight + activation usage), and it validated successfully. Thus, “memory overflow” seems unlikely.
When comparing different models’ network_generate_report, I noticed that in the “failing” model, almost every epoch is marked as SW (software), whereas in the “successful” models, nearly all epochs are marked as HW (hardware offload). Below is an excerpt from the failing model’s epoch report (92 total epochs, 80 executed in software):
Epochs details ------------------------------------------------------------------------------------ Total number of epochs: 92 of which 80 implemented in software epoch ID HW/SW/EC Operation (SW only) epoch 1 HW epoch 2 -SW- ( DequantizeLinear ) epoch 3 -SW- ( Conv ) epoch 4 -SW- ( Conv ) epoch 5 -SW- ( Conv ) epoch 6 -SW- ( Conv ) epoch 7 -SW- ( Conv ) epoch 8 -SW- ( Conv ) epoch 9 -SW- ( Conv ) epoch 10 -SW- ( Conv ) epoch 11 -SW- ( Conv ) epoch 12 -SW- ( QuantizeLinear ) epoch 13 -SW- ( QuantizeLinear ) epoch 14 HW epoch 15 -SW- ( DequantizeLinear ) epoch 16 -SW- ( Conv ) epoch 17 -SW- ( Conv ) epoch 18 -SW- ( Conv ) epoch 19 -SW- ( Conv ) epoch 20 -SW- ( Conv ) epoch 21 -SW- ( Conv ) epoch 22 -SW- ( QuantizeLinear ) epoch 23 -SW- ( QuantizeLinear ) epoch 24 HW epoch 25 -SW- ( DequantizeLinear ) epoch 26 -SW- ( Conv ) epoch 27 -SW- ( Conv ) epoch 28 -SW- ( Conv ) epoch 29 -SW- ( QuantizeLinear ) epoch 30 HW epoch 31 -SW- ( DequantizeLinear ) epoch 32 -SW- ( Conv ) epoch 33 -SW- ( Conv ) epoch 34 -SW- ( Conv ) epoch 35 -SW- ( Conv ) epoch 36 -SW- ( Conv ) epoch 37 -SW- ( Conv ) epoch 38 -SW- ( QuantizeLinear ) epoch 39 -SW- ( QuantizeLinear ) epoch 40 HW epoch 41 -SW- ( DequantizeLinear ) epoch 42 -SW- ( Conv ) epoch 43 -SW- ( Conv ) epoch 44 -SW- ( Conv ) epoch 45 -SW- ( QuantizeLinear ) epoch 46 HW epoch 47 -SW- ( DequantizeLinear ) epoch 48 -SW- ( Conv ) epoch 49 -SW- ( Conv ) epoch 50 -SW- ( Conv ) epoch 51 -SW- ( QuantizeLinear ) epoch 52 HW epoch 53 -SW- ( DequantizeLinear ) epoch 54 -SW- ( Conv ) epoch 55 -SW- ( Conv ) epoch 56 -SW- ( Conv ) epoch 57 -SW- ( Conv ) epoch 58 -SW- ( Conv ) epoch 59 -SW- ( Conv ) epoch 60 -SW- ( QuantizeLinear ) epoch 61 -SW- ( QuantizeLinear ) epoch 62 HW epoch 63 -SW- ( DequantizeLinear ) epoch 64 -SW- ( Conv ) epoch 65 -SW- ( Conv ) epoch 66 -SW- ( Conv ) epoch 67 -SW- ( QuantizeLinear ) epoch 68 HW epoch 69 -SW- ( DequantizeLinear ) epoch 70 -SW- ( Conv ) epoch 71 -SW- ( Conv ) epoch 72 -SW- ( Conv ) epoch 73 -SW- ( Conv ) epoch 74 -SW- ( Conv ) epoch 75 -SW- ( Conv ) epoch 76 -SW- ( Conv ) epoch 77 -SW- ( Conv ) epoch 78 -SW- ( QuantizeLinear ) epoch 79 HW epoch 80 -SW- ( DequantizeLinear ) epoch 81 -SW- ( Conv ) epoch 82 -SW- ( Add ) epoch 83 -SW- ( QuantizeLinear ) epoch 84 -SW- ( Resize ) epoch 85 -SW- ( DequantizeLinear ) epoch 86 HW epoch 87 -SW- ( Conv ) epoch 88 -SW- ( Add ) epoch 89 -SW- ( Conv ) epoch 90 -SW- ( QuantizeLinear ) epoch 91 -SW- ( Resize ) epoch 92 HW
In contrast, for a model that validates successfully, almost all epochs are labeled HW, with only minimal SW-used layers
Could the excessive number of SW-executed layers be causing the timeout during validation?
My understanding is that if a layer is marked SW, X-CUBE-AI will run that layer on the CPU, then copy intermediate activations to the NPU for subsequent HW processing. If too many layers run on SW, the pipelining is disrupted and data transfers delay the flow, leading to timeouts because the NPU never receives data in time. Is this reasoning correct?
How can I push more layers to the NPU (HW) instead of SW?
Do I need to modify the model architecture itself (e.g., fuse certain layers, adjust quantization parameters, reorder operations) so that X-CUBE-AI’s compiler can map more operations to HW?
Or does ST provide any tool/configuration (e.g., in the .mpool or user_neuralart.json files) to explicitly force certain operators onto the NPU?
Can I use the validate result as the definitive indicator of whether a model can run on-board?
If a model successfully completes validate--mode target and the reported “model output vs reference output” metrics (RMSE, MAE, accuracy, etc.) are acceptable, does that guarantee the final firmware will run correctly on the MCU + NPU with similar accuracy/performance?
Conversely, if validate fails or times out, does it mean the model is fundamentally un-runnable on this board/firmware combination, or are there “workarounds” to fix it?
How should I edit the .mpool file—especially RAM bank assignments (AXISRAM1, AXISRAM2, npuRAM3–npuRAM6)?
In ST’s official example projects, some do not use AXISRAM1 or AXISRAM2 at all (all activations go to NPU SRAM banks). Others use AXISRAM2 but not AXISRAM1. Yet others distribute activations across cpuRAM2 (CPU SRAM) and npuRAMx.
What criteria does ST use to decide which RAM bank holds which activation buffers? If I want to manually adjust .mpool, should I base it on each layer’s memory footprint or concurrency? Is there a reference document or guidelines for this assignment?
Why does having so many SW-executed layers lead to validation failure?
How can I confirm that “too many SW epochs” is indeed the root cause? What else might cause a timeout?
How to force more layers onto NPU hardware?
What modifications are typically required in the model (layer fusion, quantization tweaks, etc.) to improve HW offloading?
Does X-CUBE-AI provide any graph-optimization or hardware-mapping presets?
Validity of validate as a proxy for deployability
If validate passes with high similarity metrics, is it safe to assume the final deployed model will work?
If validate fails, must I abandon that model, or is there guidance on “how to fix it”?
Guidelines for .mpool assignments (RAM bank usage)
How does ST decide in example projects which CPU or NPU RAM banks to use?
If a given model’s activations overflow certain NPU RAM banks but CPU RAM has free space, how can I direct some activations to CPU RAM in .mpool?
Thank you in advance for your insights and recommendations! I look forward to learning from your experiences.