Dear all,
I'm in the process of converting a convolutional neural network from quantized ONNX format to run on the STM32N6 Neural ART NPU, using the STEdgeAI tool. I do not have access to external RAM on our custom hardware, so everything has to fit in the NPU's RAM (AXISRAM3 to AXISRAM6). However, I found it frustratingly difficult to do so using the STEdgeAI CLI. Using just the default options results in a model that takes roughly twice as much RAM as would theoretically be needed to store the activations (and in some extreme cases even 10x as much).
So my question is: are there any tips, tricks, or tools or other command line options that are useful when trying to squeeze a model into the small internal RAM? It looks like the "out-of-the-box" options to STEdgeAI-CLI prioritize execution speed over RAM usage, which is of course unfortunate if that means the model cannot run at all due to RAM restrictions.
In particular:
- Is there a mode where one gives STEdgeAI (or the Neural ART compiler) an upper limit of memory usage, and then it finds the optimal configuration while obeying the memory limit? I know there is an auto-mode but that happily increases RAM usage, sometimes orders of magnitudes. I already figured out that using "--Omax-ca-pipe 1 --optimizations 0" generally results in less RAM usage
- Sometimes even decreasing the amount of features in the CNN actually increases memory usage, probably because of some heuristics in the compiler kicking in and deciding to split the work up differently. Is there a way to turn all these optimization passes off, in order to get predictable behavior? It's very difficult to modify our CNN architecture for memory usage with this behavior.
- When looking at the memory graph in X-CUBE-AI it seems that the allocation of memory buffers is not globally optimal, i.e. smaller intermediate buffers could be placed differently resulting in sometimes quite significant memory savings. Is this correct, that STEdgeAI/atonn compiler does not perform globally optimal buffer allocation? From the graph in X-CUBE-AI it seems to be following a simple greedy heuristic, which obviously leads to suboptimal results. If this is indeed the case, can I somehow manually specify buffer locations in order to make this work?
I'm grateful for any useful advice, the N6-NPU is really a great piece of work, but the process of fitting a model in RAM has turned out to be pretty painful so far (compared to other NPU/TPUs I've used previously).
Thanks, Michael