How to properly initialize shared arrays on STM32745?

Hansel · ‎2022-06-09

In my application I am using initialized data in shared memory for the use in a user interface. The memory is in the SRAM2 at 0x30020000. Both cores need to access that data, so I use code of the following form:

__attribute__((section(".sram2"))) uiSelDef uiSelItem[4] = {
	[0] = {"abc", 0.0f, 50, CENTER_CENTER},
	[1] = {"def", 0.0f, 100, CENTER_CENTER},
	[2] = {"ghi", 0.0f, 200, CENTER_CENTER},
	[3] = {"jkl", 0.0f, 500, CENTER_CENTER},
}

This code is placed in the Common directory so that it is picked up during compile time for both CM4 and for CM7 individually. Note that the initialization is not static on purpose.

Why am I doing this? Roughly speaking, CM7 reacts on user inputs via interrupts (using encoders and switches) and shows updates on a screen based on user selection referring data from the array shown above. The CM7 then requests CM4 via semaphores to react to the user input. The CM4 access that very same array.

So far so good. This works very well when I am in debug mode. However, when I compile the code in Release mode, the MCU gets stuck. I don't have a way to debug this, because as stated above, I am not running into this issue during debug mode. So I can only surmise that there is some sort of a clash or a race during the exection of the startup code "startup_stm32h745ihx.s" on each of the two cores.

What ideas do you have to make this work? Why would the two startup codes cause a clash (if my theory is correct)?

For completeness sake, here are the relevant entries of my two linker scripts:

CM7:

MEMORY
 
{
  ITCMRAM (xrw) : ORIGIN = 0x00000000, LENGTH =   64K
  DTCMRAM (xrw) : ORIGIN = 0x20000000, LENGTH =  128K
  AXISRAM (xrw) : ORIGIN = 0x24000000, LENGTH =  256K
  SRAM1   (xrw) : ORIGIN = 0x30000000, LENGTH =  128K
  SRAM2   (xrw) : ORIGIN = 0x30020000, LENGTH =  128K
  SRAM3   (xrw) : ORIGIN = 0x30040000, LENGTH =   32K
  SRAM4   (xrw) : ORIGIN = 0x38000000, LENGTH =   64K
  FLASH   (rx)  : ORIGIN = 0x08000000, LENGTH = 1024K
}

CM4:

MEMORY
{
  AXISRAM (xrw) : ORIGIN = 0x24040000, LENGTH =  256K
  SRAM1   (xrw) : ORIGIN = 0x30000000, LENGTH =  128K
  SRAM2   (xrw) : ORIGIN = 0x30020000, LENGTH =  128K
  SRAM3   (xrw) : ORIGIN = 0x30040000, LENGTH =   32K
  SRAM4   (xrw) : ORIGIN = 0x38000000, LENGTH =   64K
  FLASH   (rx)  : ORIGIN = 0x08100000, LENGTH = 1024K
}

And to ensure proper initialization, both linker scripts have the following lines of code (it's identical in both scripts):

/* Configure SRAM2 */
  .SRAM2 :
  {
    . = ALIGN(4);
    __SRAM2_START__ = .;
    *(.sram2)
    *(.sram2*)
    . = ALIGN(4);
    __SRAM2_END__ = .;
  } >SRAM2

I realize that this leads to both startup routines writing the initialized array to the same location but I have no better idea how to ensure both cores access the corresponding data at the same memory address.

Hansel · ‎2022-06-10

@Andrew Neil, thanks for the hint. After adding the volatiles in the right places, the code started to bail at later points in time. I noticed that the code knew where the variables were stored but the content was incorrect. Then it dawned on me that there must be something wrong with the startup assembly code. In fact, the code to copy over the pre-initialized data from flash memory to SRAM2 was entirely missing. It's not clear to me why the code was actually working in debug mode as the explicit data copy never existed.

Thanks guys for your posts.

For completeness sake, here are the relevant code snippets.

In the linker script:

  _sram2data = LOADADDR(.SRAM2);
 
  /* Configure SRAM2 */
  .SRAM2 :
  {
    . = ALIGN(4);
    __SRAM2_START__ = .;
    *(.sram2)
    *(.sram2*)
    . = ALIGN(4);
    __SRAM2_END__ = .;
  } >SRAM2 AT> FLASH

In the startup code startup_stm32h745xihx.s:

.word _sram2data
.word __SRAM2_START__
.word __SRAM2_END__
 
 
/* Copy the data segment initializers from flash to SRAM2 */
  ldr r0, = __SRAM2_START__
  ldr r1, = __SRAM2_END__
  ldr r2, = _sram2data
  movs r3, #0
  b LoopCopySRAM2Init
 
CopySRAM2Init:
  ldr r4, [r2, r3]
  str r4, [r0, r3]
  adds r3, r3, #4
 
LoopCopySRAM2Init:
  adds r4, r0, r3
  cmp r4, r1
  bcc CopySRAM2Init

View solution in original post

Tesla DeLorean · ‎2022-06-09

Ok, but the M7 is the generator, and the M4 is the consumer? And the M4 is only triggered to look at the (new) content when explicitly told too.

Not clear to me why the M4 would need to initialize it, and the array could be built on the fly.

Now it would make sense to me to use structures and pointer, such that the object had size/count fields, so it wasn't dependent on compile time sizes, or the addresses one linker attributed to the static strings, vs the other.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Hansel · ‎2022-06-09

@Community member, yes, for now the M7 is the generator and the M4 is the consumer. But I'd like to keep my options open in case some data gets "reported back" to the M7. The reason the M4 needs to initialize it is simply because my consumer needs the variable names and the exact same memory addresses. How else would the M4 be able to read the data from, say, index 2 when the M7 reported to use the data from that position?

To be a bit more clear, sticking to position 2 as an example: The M7 uses some data from the array at position 2 and places some text on my LCD. Then it tells the M4 to use the other data from the array at index 2.

What do you mean by "creating the array on the fly"? You mean having some sort of routine that copies the data from an array on the CM7 and use some inter-process communication to transfer the array data to the other core? Seems cumbersome to me and less elegant but maybe this is not what you meant.

I should also note that the M7 manipulates the content of the array for the M4 to pick up. That's why the array is not defined static.

Andrew Neil · ‎2022-06-10

"I am not running into this issue during debug mode."

So try adjusting the optimisation used in the Debug configuration.

If that helps, it most likely indicates a flaw in your code - eg, not having volatile where needed ...

A complex system that works is invariably found to have evolved from a simple system that worked.
A complex system designed from scratch never works and cannot be patched up to make it work.

Hansel · ‎2022-06-10

@Andrew Neil, thanks for the hint. After adding the volatiles in the right places, the code started to bail at later points in time. I noticed that the code knew where the variables were stored but the content was incorrect. Then it dawned on me that there must be something wrong with the startup assembly code. In fact, the code to copy over the pre-initialized data from flash memory to SRAM2 was entirely missing. It's not clear to me why the code was actually working in debug mode as the explicit data copy never existed.

Thanks guys for your posts.

For completeness sake, here are the relevant code snippets.

In the linker script:

  _sram2data = LOADADDR(.SRAM2);
 
  /* Configure SRAM2 */
  .SRAM2 :
  {
    . = ALIGN(4);
    __SRAM2_START__ = .;
    *(.sram2)
    *(.sram2*)
    . = ALIGN(4);
    __SRAM2_END__ = .;
  } >SRAM2 AT> FLASH

In the startup code startup_stm32h745xihx.s:

.word _sram2data
.word __SRAM2_START__
.word __SRAM2_END__
 
 
/* Copy the data segment initializers from flash to SRAM2 */
  ldr r0, = __SRAM2_START__
  ldr r1, = __SRAM2_END__
  ldr r2, = _sram2data
  movs r3, #0
  b LoopCopySRAM2Init
 
CopySRAM2Init:
  ldr r4, [r2, r3]
  str r4, [r0, r3]
  adds r3, r3, #4
 
LoopCopySRAM2Init:
  adds r4, r0, r3
  cmp r4, r1
  bcc CopySRAM2Init

Piranha · ‎2022-06-10

To ensure the structure is laid out the same under all optimizations and configurations, make it packed:

https://stackoverflow.com/questions/4306186/structure-padding-and-packing

Also take a look on this:

https://github.com/MaJerle/stm32h7-dual-core-inter-cpu-async-communication

Consider designing the communication as a message/command queue.

Hansel · ‎2022-06-11

@Piranha thanks for the two links. In my quest to get variables in shared memory working, I've run into another problem which is described here. Based on your statement to ensure that the laid out structure works under all conditions, I've tried the packing of my typedefs, but I still see the physical address reported during runtime differently from what the VMA shows in the ELF. I've run out of ideas.