2025-08-24 6:37 AM
Dear all,
I'm converting a quantized ONNX model using the STEdgeAI Core CLI, for use on the STM32N6 NPU. However, I cannot explain the very high RAM usage for the activations, as reported by stedgeai.
The network is a simple CNN with 10 convolution layers each followed by a ReLU. I have attached the original (not quantized) model to this post (mvs_model.onnx, see also image below) as well as the quantized version (mvs_model.quant.onnx). Quantization was performed using the ONNXRuntime tools.
When generating code for the N6 NPU (using "stedgeai generate --target stm32n6 --st-neural-art O3@neural_art.json --model mvs_model.quant.onnx --input-data-type uint8 --name mvs" with neural_art.json and stm32n6.mpool as attached) the CLI reports crazy high RAM usage for the activations (full output attached as stedgeai_output.txt):
Requested memory size by section - "stm32n6npu" target
------------------------------- ------- ------------ ------ ------------
module text rodata data bss
------------------------------- ------- ------------ ------ ------------
mvs.o 54 83,203 0 0
NetworkRuntime1020_CM55_GCC.a 0 0 0 0
lib (toolchain)* 0 0 0 0
ll atonn runtime 3,044 2,277 0 13
------------------------------- ------- ------------ ------ ------------
RT total** 3,098 85,480 0 13
------------------------------- ------- ------------ ------ ------------
weights 0 10,060,864 0 0
activations 0 0 0 22,883,328
io 0 0 0 0
------------------------------- ------- ------------ ------ ------------
TOTAL 3,098 10,146,344 0 22,883,341
------------------------------- ------- ------------ ------ ------------
However, manually calculating the activation memory usage I would estimate this to be only around 2.5MB (instead of nearly 23 MB !). Here, I assume the activation buffers of previous layers are re-used when they are not needed anymore when executing subsequent layers, and also that all activations take 1 byte each (since it is 8-bit quantized). I know that some NPUs (not sure about the Neural-ART) need some additional scratch memory, and also that the STEdgeAI inserts padding nodes (see intermediate model graph: mvs_model.quant_OE_3_3_0.onnx) that increase activation memory usage slightly, but none of this can explain this huge memory consumption.
Does anybody have insights into how STEdgeAI-Core allocates the activations, and how one can debug such unexplainably high memory usage? I guess I'm probably just missing some configuration option here, since this is a pretty standard small-ish CNN...
Thanks a lot, Michael
2025-08-26 7:52 AM
Hello @asdfasdf,
Your comment is rational.
Both in SW and in HW predicting the activations is challenging since there are a lot of optimizations which impact on it.
In the other hand I found something strange.
The activation of your quantized model are 2 times bigger than the one of your not quantized models. I don't see why it would be the case.
I opened a bug internally as it may be one.
I'll update you when I get news from the dev team
Have a good day,
Julian
2025-08-27 6:48 AM
Thanks @Julian E. for your reply - I was already starting to question my own sanity :D As you say, I also find it very difficult and frustrating to optimize memory usage of models for the Neural ART. I opened another question here to collect tipps and tricks regarding this, and I'd love to get your input there as well!
Best, Michael
2025-09-08 6:42 AM
Hello @asdfasdf,
Sorry for the delay, but I am still looking for answer internally.
Basically, here is the comment of our team:
It would be interesting to know how the theoretically value has been computed. If this is actually the case, then probably there is some errors/bugs in how the tool is invoked. Otherwise, if the model is definitely too large for the system there is not much we can do, i.e., a model cannot be shrunk indefinitely to fit in the memory
Could you explain how you compute the 2.5MB?
For your other post, I'll answer when I get more info.
Have a good day,
Julian