Slower acces to SRAM2 and CCM SRAM than SRAM1 on STM32G473

treguy · ‎2026-02-16

Hi,

I have an STM32g473 with 120kb or RAM, running at 169,75MHz.
I have some image data stored starting at 0x2000_0000 which comprise of a palette of 235 colors in 00rrggbb format (1 word per color), follower by 304 lines of 370 pixels (one byte each, which is used as an index in the palette to get RGB value.)
The data are copied in RAM to avoid wait states when loading them from flash, which mean most is in SRAM1 but the last 85 lines are in SRAM2 and MCC SRAM. The code is in flash.

The first line of data display correctly, but the last 85 lines get one extra cycle per pixel (most probably due to reading the data).
I could not find anything in the RM0440 that explains where this one cycle penalty come from. The only peripheral I use in addition to SRAMs and Flash (for the code) are the DACs and one GPIO (the gpio is not accessed during the loop that sends the pixels and get the penalties)

Can someone help understanding this?

treguy · ‎2026-02-16

Ok, Thin I partly understood :

- SRAM1 is accessible via DCode to load data, which does not have 1 cycle penalty as accessing data through System Bus does
- On the bus matrix schematic, CCM Sram is shown as accessible via DCode, but this is only true if it has been remapped to the sub 0x1000_0000 addresses, which is not my case, hence it is as slow as SRAM2

If this is the correct explanation, is there a way to force loading data in SRAM1 though System Bus so all data get the same access time ?

treguy · ‎2026-02-16

But the Cortex M4 reference manual says that the Code region, that can be accessed via ICode and DCode, end at 0x1000_0000... I still don't get it.

waclawek.jan · ‎2026-02-17

> But the Cortex M4 reference manual says that the Code region, that can be accessed via ICode and DCode, end at 0x1000_0000

My copy of that manual says, that Code ends at 0x1FFF'FFFF:

(and the ARM manuals agree to this).

> SRAM1 is accessible via DCode to load data, which does not have 1 cycle penalty as accessing data through System Bus does

Only if you access it in the area starting at 0x0000'0000, if SRAM1 is mapped there in SYSCFG_MEMRMP.MEM_MODE.

As you access both SRAM1 and SRAM2/CCMSRAM at addresses >0x2000'0000, both are accessed through the S-bus of the processor. However there may be different arbitration schemes at the bus matrix for each of these memories (e.g. "don't release until other master requires, allowing faster sequential access for one master but imposing higher latencies when masters interleave", vs. "release immediately, resulting in uniform 1 cycle penalty for each access of each master") at the bus matrix, or these memories may impose different waitstates at their own. ST does not publicly document these details at all.

You can try to speed things up by reading multiple pixels at once (as a word, or maybe even as a double-word), or perhaps placing the palette data into the CCMRAM accessed through its 0x1000'0000 address (i.e. through D-bus).

JW

treguy · ‎2026-02-17

Thank you, I did not know about the different arbitration policies.

My goal here was not to get faster, but to get constant execution time so I can fulfill the requirement of the video signal.

I just terminate a big debugging session, and what I found is that:
- the pixels bytes did not all fit in SRAM1, the pixels in SRAM2 and CCM SRAM all take one extra cycle
- moving the palette from before the pixels (in SRAM1) to after the pixels (in CCM SRAM) dot not change the timing for CCM SRAM nor SRAM2 pixels, but slows the pixels in SRAM1 so they take the same time as the other, so that solve my problem (i just removed a NOP in the waiting loop)

Indeed, I could find nothing in the datasheet nor in the DWT register that explains what happens, so but arbitration may be the issue. The LDR for the RGB value takes 3 cycles vs 1 when both the color index and RGB values are in the same memory block.

waclawek.jan · ‎2026-02-17

> Thank you, I did not know about the different arbitration policies.

Don't get me wrong: I am not an ST insider and don't know nothing about the design of bus matrix nor memories. I am just trying to interpret what I see - I was playing with STM32F407 back then when it was new, and I've seen different access times to different SRAMs there, and this was my - possibly incorrect - conclusion.

These chips are simply not intended for cycle-precision timing.

JW