cancel
Showing results for 
Search instead for 
Did you mean: 

STM32N6x7 Memory Allocation For ISP, NPU and VENC

Hi,

We're working on integrating the STM32N6 Getting Started Object Detection and the VENC SDCard examples and were wondering if we could ask some advice on memory allocation?

The Object Detection example defines AXISRAM1_S and I think uses this for the camera interface and ISP - if the RAM required by the VENC is added to this then AXISRAM1_S seems to be insufficient and the linker fails.

In the examples, the amount of SRAM available for the application seems to be limited by the position SRAM2 at 0x3418 0000 used by the FSBL then later for NN activations:

https://github.com/STMicroelectronics/STM32N6-GettingStarted-ObjectDetection/blob/main/Doc/Boot-Overview.md#boot-from-flash-with-first-stage-boot-loader

this states "STM32N6570-DK: 1MB of SRAM1 is reserved for the User App (see STM32N657xx.ld) and 1MB of SRAM2 is reserved for the network activations (see stm32n6-app2_STM32N6570-DK.mpool)."

More detail of the higher sections of memory is given in:

https://github.com/STMicroelectronics/STM32N6-GettingStarted-ObjectDetection/blob/main/Model/my_mpools/stm32n6-app2_STM32N6570-DK.mpool

which sets out:

AXISRAM2 cpuRAM2 1024 K,

AXISRAM3 npuRAM3 448 K,

AXISRAM4 npuRAM4 448 K,

AXISRAM5 npuRAM5 448 K,

AXISRAM6 npuRAM6 448K

giving a total of 1024 K of CPU SRAM and 1792 K of NPU SRAM defined in the .mpool file.

I thought I'd ask advice on whether it's possible to use the VENC and have sufficient memory for meaningful NN inference and - if so - how SRAM should be allocated to make enough SRAM available for the VENC and NPU within the 4.2 MB of internal SRAM? This is the first time that I've done this so I thought that it was best to ask advice rather than guessing the best way forward.

https://github.com/svogl/STM32N6-GettingStarted-ObjectDetection/blob/main/Application/STM32N6570-DK/STM32CubeIDE/STM32N657xx.ld defines a region AXISRAM1_S which contains 1.023 MB of the STM32N6x7's 4.2 MB of internal RAM which tried increasing to 1.536 MB:

/* Memories definition */
MEMORY
{
AXISRAM1_S (xrw) : ORIGIN = 0x34000400, LENGTH = 1536K
PSRAM (xrw) : ORIGIN = 0x91000000, LENGTH = 16M
}

The two examples are:

https://github.com/STMicroelectronics/STM32N6-GettingStarted-ObjectDetection

https://github.com/STMicroelectronics/STM32CubeN6/tree/main/Projects/STM32N6570-DK/Applications/VENC/VENC_SDCard

and the repository where where we're integrating them is:

https://github.com/svogl/STM32N6-GettingStarted-ObjectDetection - the integration is in a submodule on the following branch:

git clone https://github.com/svogl/STM32N6-GettingStarted-ObjectDetection
cd STM32N6-GettingStarted-ObjectDetection/
git checkout feature/video-enc
git submodule update --init

It's for an open source wildlife camera for conservation research and we'd like to both write encoded video to SD storage and analyse it via a CNN running on the NPU if possible:

https://new-homes-for-old-friends.cairnwater.com/quick-summary-new-homes-for-old-friends-belgium/

Wishing you all a great Christmas!

Many thanks,

Will

3 REPLIES 3
exarian
Associate III

Hi @Will_Robertson ,

Seasons greetings, thanks for sharing the video, those creatures are super cute, look like mice!

Glad to see you are enjoying working with the video encoder and the NPU, I knows it's a juggle getting everything working together!

My understanding of the .mpool file, is that you are basically telling the `stedgeai` tool "This is my available memory, that I am letting the NPU use". If you want to increase the applications LENGTH of course you can, just adjust the .mpool accordingly to not use that region.

AXISRAM1_S (xrw) : ORIGIN = 0x34000400, LENGTH = 1536K

More than likely they left it at 1536K as there is 16M of external RAM

OR.. make sure the CPU doesn't use the RAM region when an inference is happening.

The NPU memory could be used for other things, as long as an inference is not happening.

Try test adjusting the .mpool and see what happens. Either edgeai will complain that it can't do its magic to use (and reuse) that RAM buffer or it will make do with what its been provided. More than likely this will just impact the inference time, as there are more operations and steps as there is less memory, or perhaps it will use memory which has a slower interface.

It all depends what model you want to end up using. For example with the object detection model, here are two examples of .mpool edits and the outputs:

exarian_0-1767199424785.png

Notice above it used more of the hyperRAM as some RAM regions were set to zero.

exarian_1-1767199524791.png

Here the hyperRAM was set to zero, and the RAM3 region was set to zero. The linker could then be adjusted to use that RAM3 region, and then it could allocate it to CPU buffers. I.e. Remapped the .noncacheable region or assign a new region.

exarian_2-1767203325158.png

I think this should do the trick?

Hi @exarian 

Thank you very much! It's great having your input - the STM32N6x7 is very advanced and has a lot of new functionality compared to earlier MCUs so it's taking us a lot of time and thought to fully understand it.

I've been thinking about what you said about how internal SRAM is allocated to the NPU - since we're still working on training the NNs and we don't know yet how much memory will be required by the NNs it might be an option to leave the internal SRAM regions allocated to NPU and CPU as they are at the moment and put the buffers for the VENC in external PSRAM?

It looks like the external PSRAM should be fast enough to handle the video buffers (the most difficult task seems to be storing the reference frame for the VENC - the luma data from that seem to be read once, the chroma data read twice and the whole frame written once for each frame encoded):

https://github.com/William-Robert-Robertson/WildCamera/blob/main/Candidate%20Technologies/STM32N6x7_MCU/VENC_Video_Encoder.md#st-measured-memory-footprint-for-venc

It looks like the amount of RAM needed by the VENC buffers and reference frame would leave very little SRAM left for the NPU if they were put on SRAM?

I'd looked at the possibility of keeping everything in SRAM but the power consumption of the SRAM in self-refresh seems to be much higher than of a PSRAM chip (for example, the PSRAM chip used in the DK board) in self-refresh so it seems well worth having the PSRAM there as a form of low-power pseudostatic storage.

Thank you for mentioning hyperRAM - that let me find the definition in the network.c file:

/* global pool 7 is ? */

/* index=7 file postfix=xSPI1 name=hyperRAM offset=0x90000000 absolute_mode size=16777208 READ_WRITE 

There seems to be the possibility of having 2 rather than 1 reference frame so that the VENC can handle loss of a frame and comply with HRD (Hypothetical Reference Decoder) - that seems relevant when encoded video is to be passed over a network where a frame might get lost or delayed and so have to be discarded but not for our application where encoded frames would only be written to local storage - not over a potentially lossy network.

It looks like this example doesn't have a .ioc file so the memory and clock structure weren't planned using CubeMX and can't be adjusted using CubeMX - I'm note sure whether to adjust them manually or to try to re-do them entirely in CubeMX then amend them using X-CUBE-AI - that might make future editing easier.

Will

A quick update: It looks like we have to implement EWL_USER_MM to place the reference frame and frame buffers in external PSRAM instead - I've tried to document it fully here:

https://github.com/William-Robert-Robertson/WildCamera/blob/main/Candidate%20Technologies/STM32N6x7_MCU/Thanks_For_The_Memory.md

Is there any guidance on implementing EWL_USER_MM?

It looks like ST's bare metal VENC example just uses malloc for the reference frame and the buffers for the VENC. That will use up a lot of of the two 1024 KB banks of internal SRAM that are available to the MPU on the 400 MHz internal bus.

The solution seems to be to implement EWL_USER_MM to put the reference frame and buffers for the VENC on external PSRAM instead. External PSRAM should be fast enough for this.

Thank you very much for your help!