Optimizing memory consumption of models with STEdgeAI-Core

asdfasdf · ‎2025-08-27

Dear all,

I'm in the process of converting a convolutional neural network from quantized ONNX format to run on the STM32N6 Neural ART NPU, using the STEdgeAI tool. I do not have access to external RAM on our custom hardware, so everything has to fit in the NPU's RAM (AXISRAM3 to AXISRAM6). However, I found it frustratingly difficult to do so using the STEdgeAI CLI. Using just the default options results in a model that takes roughly twice as much RAM as would theoretically be needed to store the activations (and in some extreme cases even 10x as much).

So my question is: are there any tips, tricks, or tools or other command line options that are useful when trying to squeeze a model into the small internal RAM? It looks like the "out-of-the-box" options to STEdgeAI-CLI prioritize execution speed over RAM usage, which is of course unfortunate if that means the model cannot run at all due to RAM restrictions.

In particular:

Is there a mode where one gives STEdgeAI (or the Neural ART compiler) an upper limit of memory usage, and then it finds the optimal configuration while obeying the memory limit? I know there is an auto-mode but that happily increases RAM usage, sometimes orders of magnitudes. I already figured out that using "--Omax-ca-pipe 1 --optimizations 0" generally results in less RAM usage
Sometimes even decreasing the amount of features in the CNN actually increases memory usage, probably because of some heuristics in the compiler kicking in and deciding to split the work up differently. Is there a way to turn all these optimization passes off, in order to get predictable behavior? It's very difficult to modify our CNN architecture for memory usage with this behavior.
When looking at the memory graph in X-CUBE-AI it seems that the allocation of memory buffers is not globally optimal, i.e. smaller intermediate buffers could be placed differently resulting in sometimes quite significant memory savings. Is this correct, that STEdgeAI/atonn compiler does not perform globally optimal buffer allocation? From the graph in X-CUBE-AI it seems to be following a simple greedy heuristic, which obviously leads to suboptimal results. If this is indeed the case, can I somehow manually specify buffer locations in order to make this work?

I'm grateful for any useful advice, the N6-NPU is really a great piece of work, but the process of fitting a model in RAM has turned out to be pretty painful so far (compared to other NPU/TPUs I've used previously).

Thanks, Michael

Julian E. · ‎2025-09-09

Hello @asdfasdf,

Concerning your questions:

1. Is there a mode where one gives STEdgeAI (or the Neural ART compiler) an upper limit of memory usage, and then it finds the optimal configuration while obeying the memory limit? I know there is an auto-mode but that happily increases RAM usage, sometimes orders of magnitudes. I already figured out that using "--Omax-ca-pipe 1 --optimizations 0" generally results in less RAM usage

No there isn't such mode. Even if it you tweak the mempool file to reduce internal RAM to try influence the compiler decision, it will most likely provide the same results but with everything that does not fit in internal RAM, in external memory.

2. Sometimes even decreasing the amount of features in the CNN actually increases memory usage, probably because of some heuristics in the compiler kicking in and deciding to split the work up differently. Is there a way to turn all these optimization passes off, in order to get predictable behavior? It's very difficult to modify our CNN architecture for memory usage with this behavior.

No, we don't have such option.

3.When looking at the memory graph in X-CUBE-AI it seems that the allocation of memory buffers is not globally optimal, i.e. smaller intermediate buffers could be placed differently resulting in sometimes quite significant memory savings. Is this correct, that STEdgeAI/atonn compiler does not perform globally optimal buffer allocation? From the graph in X-CUBE-AI it seems to be following a simple greedy heuristic, which obviously leads to suboptimal results. If this is indeed the case, can I somehow manually specify buffer locations in order to make this work?

We have received some similar complains (not as much as I would have expected personally). There are two possibilities:

a) the compiler is not doing the right thing and then it should be improved (for that I transmitted your model in your other post to see if this is the case)

b) the solution is already optimal, (with what we currently provide)

When asking internally what we could propose to help answer question similar to yours, regarding how to optimize the memory used, I was surprised to see that it is not a common question. I would have expected that we had received similar question in the past, but this is not the case.

There are not really things we can do to be honest, except if this is a bug.

Have a good day,

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.