Why use DMA when you've got shared RAM?

cbcooper · ‎2025-05-30

I've inherited some code, the original owner is long gone, and I'm trying to understand his code which has ZERO comments (I'm sure I'm not the only person who's been in this position)

It's using the STM32H747IGT6 so it's dual core, got one M4 and one M7 core. They are using SRAM2 to communicate, but he's got it set up so the M4 (which is generating the data) writes a single uint32 of data into one location in SRAM2 and then triggers the MDMA to move it to another location in SRAM2 where the M7 can pick it up.

I haven't worked with a dual-core STM32 before, but this feels like overkill. If both cores can read & write SRAM2, why bother with the MDMA? Why not just have the M4 write the data directly to where the M7 will be looking for it and skip the MDMA completely?

And why is he setting the destination bus to AXI (and leaving the source bus at is default of AXI) when the system architecture diagram shows SRAM2 as being in the AHB bus matrix? I'm thinking that since SRAM2 is mapped at the same address in both bus systems, it doesn't actually matter which bus is selected?

Here is his MDMA initialization code, I've started writing comments to explain each step:

   /*
    * Set MDMA first channel BNDT (in MDMA_CxBNDTR) to transfer sizeof(uint32_t) * SHMEM_NUM_SAMPLES bytes
    * "Number of bytes to be transferred (0 up to 65536) in the current block. When the channel is
	*  enabled, this register is read-only, indicating the remaining data items to be transmitted.
	*  During the channel activity, this register decrements, indicating the number of data items
	*  remaining in the current block.
	*  Once the block transfer has completed, this register can either stay at zero or be reloaded
	*  automatically with the previously programmed value if the channel is configured in block
	*  repeat mode."
	*  #define SHMEM_NUM_SAMPLES    4096
    */
   LL_MDMA_SetBlkDataLength(MDMA, LL_MDMA_CHANNEL_0, sizeof(uint32_t) * SHMEM_NUM_SAMPLES);

   /*
    * Set MDMA first channel BRC (in MDMA_CxBNDTR) to zero
    * "This field contains the number of repetitions of the current block (0 to 4095). When the
	*  channel is enabled, this register is read-only, indicating the remaining number of blocks,
	*  excluding the current one. This register decrements after each complete block transfer.
	*  Once the last block transfer has completed, this register can either stay at zero or be
	*  reloaded automatically from memory (in linked-list mode, meaning link address valid)."
	* I think this is set to "not repeat" so that the buffer will wait until the CM7 can grab
	* it before it starts a new pass.  See LCSCM7BSPReleaseSharedData().
    */
   LL_MDMA_SetBlkRepeatCount(MDMA, LL_MDMA_CHANNEL_0, 0);

   /*
    * Set MDMA first channel BRDUM (in MDMA_CxBNDRT) to zero
    * "At the end of a block transfer, the MDMA_DAR register is updated by adding the DUV to
	*  the current DAR value (current destination address)."
    */
   LL_MDMA_SetBlkRepeatDestAddrUpdate(MDMA, LL_MDMA_CHANNEL_0, LL_MDMA_BLK_RPT_DEST_ADDR_INCREMENT);

   /*
    * Set MDMA first channel BRSUM (in MDMA_CxBNDRT) to zero
    * "At the end of a block transfer, the MDMA_SAR register is updated by adding the SUV to
	*  the current SAR value (current source address)."
    */
   LL_MDMA_SetBlkRepeatSrcAddrUpdate(MDMA, LL_MDMA_CHANNEL_0, LL_MDMA_BLK_RPT_SRC_ADDR_INCREMENT);

   /*
    * Set MDMA first channel DUV (Destination address update value) (in MDMA_CxBRUR) to 4
    * "This value is used to update (by addition or subtraction) the current destination address at the end of
	*  a block transfer."
    */
   LL_MDMA_SetBlkRptDestAddrUpdateValue(MDMA, LL_MDMA_CHANNEL_0, 4);

   LL_MDMA_SetBlkRptSrcAddrUpdateValue(MDMA, LL_MDMA_CHANNEL_0, 0);

   /*
    * Set MDMA first channel TLEN (buffer transfer length, in MDMA_CxTCR) to 3
    * "TLEN + 1 value represents the number of bytes
	*  to be transferred in a single transfer."
    */
   LL_MDMA_SetBufferTransferLength(MDMA, LL_MDMA_CHANNEL_0, 3);

   LL_MDMA_SetByteEndianness(MDMA, LL_MDMA_CHANNEL_0, LL_MDMA_BYTE_ENDIANNESS_PRESERVE);
   
   LL_MDMA_SetChannelPriorityLevel(MDMA, LL_MDMA_CHANNEL_0, LL_MDMA_PRIORITY_HIGH);
   
   /*
    * Set MDMA first channel DBUS (Destination bus select, in MDMA_CxTBR) to zero
    * "The system/AXI bus is used as destination (write operation)"
    */
   LL_MDMA_SetDestBusSelection(MDMA, LL_MDMA_CHANNEL_0, LL_MDMA_DEST_BUS_SYSTEM_AXI);

   /*
    * Set MDMA first channel CDAR (channel destination address register, in MDMA_CxDAR)
    * to &shared_mem[SHMEM_SAMPLE_0_INDEX]
    * #define SHMEM_SAMPLE_0_INDEX 256
    */
   LL_MDMA_SetDestinationAddress(MDMA, LL_MDMA_CHANNEL_0, (uint32_t)&shared_mem[SHMEM_SAMPLE_0_INDEX]);

   LL_MDMA_SetDestinationBurstSize(MDMA, LL_MDMA_CHANNEL_0, LL_MDMA_DEST_BURST_SINGLE);
   LL_MDMA_SetDestinationDataSize(MDMA, LL_MDMA_CHANNEL_0, LL_MDMA_DEST_DATA_SIZE_WORD);
   LL_MDMA_SetDestinationIncMode(MDMA, LL_MDMA_CHANNEL_0, LL_MDMA_DEST_INCREMENT);
   LL_MDMA_SetDestinationIncSize(MDMA, LL_MDMA_CHANNEL_0, LL_MDMA_DEST_INC_OFFSET_WORD);

   /*
    * SET MDMA first channel CSAR (source address, in MDMA_CxSAR) to &shared_mem[SHMEM_PIPE_LAUNCH_INDEX]
    * #define SHMEM_PIPE_LAUNCH_INDEX (LCS_SHARED_MEM_LEN - 1)
    */
   LL_MDMA_SetSourceAddress(MDMA, 
                            LL_MDMA_CHANNEL_0, 
                            (uint32_t)&shared_mem[SHMEM_PIPE_LAUNCH_INDEX]);

   LL_MDMA_SetSourceBurstSize(MDMA, LL_MDMA_CHANNEL_0, LL_MDMA_SRC_BURST_SINGLE);
   LL_MDMA_SetSourceDataSize(MDMA, LL_MDMA_CHANNEL_0, LL_MDMA_SRC_DATA_SIZE_WORD);

   /*
    * Set MDMA first channel SINC (source increment mode) (in MDMA_CxTCR) to zero
    * "Source address pointer is fixed."
    */
   LL_MDMA_SetSourceIncMode(MDMA, LL_MDMA_CHANNEL_0, LL_MDMA_SRC_FIXED);

   /*
    * Set the SWRM (software request mode) (in MDMA_CxTCR) to 1
    * "Hardware request are ignored. Transfer is triggered by software writing 1 to the SWRQ bit."
    */
   LL_MDMA_SetRequestMode(MDMA, LL_MDMA_CHANNEL_0, LL_MDMA_REQUEST_MODE_SW);

   /*
    * Set the TRGM (trigger mode) (in MDMA_CxTCR) to zero
    * "Each MDMA request (software or hardware) triggers a buffer transfer."
    * Above, we set TLEN to 3 so 4 bytes are sent each time a request is triggered
    */
   LL_MDMA_SetTriggerMode(MDMA, LL_MDMA_CHANNEL_0, LL_MDMA_BUFFER_TRANSFER);

   // LL_MDMA_SetHWTrigger(MDMA, LL_MDMA_CHANNEL_0, LL_MDMA_REQ_DMA1_STREAM0_TC);

   LL_MDMA_EnableIT_CTC(MDMA, LL_MDMA_CHANNEL_0);
   LL_MDMA_EnableChannel(MDMA, LL_MDMA_CHANNEL_0);

TDK · ‎2025-05-30

One reason would be so the receiver can modify the data, or wait to process it, while the sender updates the buffer with new data.

I would spend more time on thinking about what you want and how to make it happen than trying to reason out every decision that may or may not have been made with the current code. A lot of times, you're just trying to get something to work. The choice of bus doesn't really matter if it works. Maybe suboptimal, but who cares, it's functioning.

If you feel a post has answered your question, please click "Accept as Solution".

cbcooper · ‎2025-06-02

You raise good points, I should have explained a little more in my original post.

You're right that there's a strong case to be made for "if it ain't broke, don't fix it" but in this case I've been asked to make some substantial changes to the data being sent between the two cores and I need to figure out whether I need to stick with the MDMA solution or if I can use a different solution that better suits the data I need to send.

As I thought about this problem more over the weekend, I realize my questions are really around the hardware.

1) In the System Architecture diagram in the Reference Manual, SRAM2 is in the D2 domain and the manual later says "AHB SRAM2 is mapped at address 0x3002 0000 and accessible by all system masters except BDMA through D2 domain AHB matrix. AHB SRAM2 can be used as DMA buffers to store peripheral input/output data in D2 domain, or as read-write segment for application running on Cortex®-M4 CPU."

It's concerning that it talks about M4 but completely ignores M7. In contrast, the description for SRAM3 says "AHB SRAM3 can be used as ... as shared memory between the two cores."

Is there any reason (at the hardware level) that the M7 shouldn't access SRAM2 directly? Are SRAM2 and SRAM3 implemented differently such that SRAM3 is a better choice for inter-core shared memory?

2) There's got to be some kind of coordination at the SRAM chip so the two cores don't make simultaneous conflicting accesses (right?), is that anything I need to be aware of while writing code? For example, if the M4 is writing to the SRAM and that causes M7 to be totally locked out of all reads for some period of time, that could be a problem.

cbcooper · ‎2025-06-02

It appears that at least part of the answer involves the cache on the M7. The code calls SCB_InvalidateDCache_by_Addr() on occasion which only applies to the M7 and not the M4 (apparently). If the M4 writes data that the M7 needs to see, it has to invalidate the M7's cache. But it seems like the opposite is not true - if the M7 writes data that the M4 needs to see, it just writes it.

TDK · ‎2025-06-02

You need to manage cache correctly whether or not you use DMA, reading and writing. Cache is per-processor. Only the M7 has a data cache. The M4 can't do anything about the M7's cache.

Level 1 cache on STM32F7 Series and STM32H7 Series

If you feel a post has answered your question, please click "Accept as Solution".

cbcooper · ‎2025-06-02

What I'm seeing in the existing code, is that when data is being copied from the M4 to the M7, the M4 writes the data into SRAM2 and then triggers MDMA to copy that data elsewhere in SRAM2 where the M7 is expecting it. There aren't any calls to affect any caches. So since the M4 has no cache, it writes immediately to SRAM2, and then MDMA obviously doesn't have a cache, so those values are showing up immediately in SRAM2. And then I'm guessing that the M7's cache is smart enough to notice this and clear any cached values of the bytes updated by the MDMA?

But when the M7 needs to send data to the M4, it grabs a hardware semaphore, copies the data into SRAM2, calls SCB_CleanDCache_by_Addr(), and then releases the hardware semaphore. So if I'm understanding correctly, if the M7 didn't call SCB_CleanDCache_by_Addr(), the values it wrote to SRAM2 might just be sitting in the M7's cache where the M4 obviously can't see them.

Does that sound about right?

cbcooper · ‎2025-06-02

Update: I just found a call to SCB_InvalidateDCache_by_Addr() that happens after the MDMA has copied the data but before the M7 looks at it. So no, the M7's cache does not "notice" that SRAM2 has been updated, it's up to the M7's code to manually invalidate the M7's cache.