cancel
Showing results for 
Search instead for 
Did you mean: 

STM32H755 CM4: Why is SRAM2 access faster than SRAM1 in benchmarks?

ziad22
Visitor

Product: STM32H755ZI (Nucleo-H755ZI-Q, nucleo_h755zi_q/stm32h755xx/m4)
Core under test: Cortex-M4 (CM4, CPU2), running at 240 MHz (D2 domain clock)
RTOS: Zephyr RTOS (west build system, Zephyr SDK)
Toolchain: arm-zephyr-eabi-gcc (Zephyr SDK)

I am benchmarking different SRAMs on the CM4 core of the STM32H755 and measuring memory access latency across the available SRAM regions using k_cycle_get_64() (Zephyr cycle counter, backed by DWT CYCCNT). My benchmark allocates identically-sized and identically-aligned buffers in SRAM1 (0x30004000), SRAM2 (0x30020000), and SRAM3 (0x30040000) then performs the same read/write pattern against each.

Observation: SRAM2 read access from the CM4 is measurably and reproducibly faster than SRAM1 read access by 1 cycle per 32-bit word, as measured via k_cycle_get_64() (backed by DWT CYCCNT). SRAM1 and SRAM3 exhibit the same read latency as each other. Write cycles are uniform across all three SRAMs. This behavior is consistent across multiple runs with the CM7 core held in stop mode to eliminate cross-domain contention as a variable.

Here are my benchmark results:

 Read - Cycles/wordWrite – Cycles/word
SRAM175
SRAM265
SRAM375

According to RM0399 (STM32H745/755 Reference Manual, Rev 4) Figure 1 (System Architecture), SRAM1, SRAM2, and SRAM3 all reside in the D2 domain and are all connected to the CM4 through the same D2 AHB bus matrix. The documentation lists no difference in access latency, bus width, or wait states between SRAM1, SRAM2 and SRAM3. Both are described as AHB SRAM accessible by the same set of masters.

 

My question is: What is the architectural reason behind SRAM2 having lower read latency than SRAM1 and SRAM3 from the CM4's perspective, given that all three reside in the D2 domain and are nominally connected through the same D2 AHB bus matrix?

3 REPLIES 3
mƎALLEm
ST Employee

Hello @ziad22 and welcome to the ST community,

I don't think there is a difference from architectural stand point.

CM4 is accessing similarly to SRAM1, SRAM2 and SRAM3 with the same performance.

What address ranges are you using for SRAM1, SRAM2 and SRAM3?

To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.

If that's the case, why do I keep seeing that 1 cycle difference per word between SRAM2 and SRAM1/SRAM3? I understand that according to the hardware documents, they all should perform identically but this is not what I am seeing, which came to my surprise. Could you please investigate further?

I don't know.. It could be due to your implementation.

Remove all zephyr stuff and consider using simple access and benchmark..

 

 

To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.