STM32H7 - MDMA inits differenly in step-by-step or run mode

tarzan2 · ‎2025-10-20

Hello,

I'm trying to transfer a 4000 uint16_t buffer from SRAM4 to DTCM. This is working with a basic configuration of the MDMA module :

- Buf size = 32

- Block len = 4000

- No burst mode

- in/out size = half word.

The MDMA needs 135µs to transfer the whole buffer, and it looks quite slow. I'm questioning about the different way to improve the tranfer, using burst mode and/or packed mode.

When implementing the burst mode, I have a strange behavior of the MDMA initialization.

  hmdma_mdma_channel0_sw_0.Instance = MDMA_Channel0;
  hmdma_mdma_channel0_sw_0.Init.Request = MDMA_REQUEST_SW;
  hmdma_mdma_channel0_sw_0.Init.TransferTriggerMode = MDMA_BLOCK_TRANSFER;
  hmdma_mdma_channel0_sw_0.Init.Priority = MDMA_PRIORITY_VERY_HIGH;
  hmdma_mdma_channel0_sw_0.Init.Endianness = MDMA_LITTLE_ENDIANNESS_PRESERVE;
  hmdma_mdma_channel0_sw_0.Init.SourceInc = MDMA_SRC_INC_HALFWORD;
  hmdma_mdma_channel0_sw_0.Init.DestinationInc = MDMA_DEST_INC_HALFWORD;
  hmdma_mdma_channel0_sw_0.Init.SourceDataSize = MDMA_SRC_DATASIZE_HALFWORD;
  hmdma_mdma_channel0_sw_0.Init.DestDataSize = MDMA_DEST_DATASIZE_HALFWORD;
  hmdma_mdma_channel0_sw_0.Init.DataAlignment = MDMA_DATAALIGN_PACKENABLE;
  hmdma_mdma_channel0_sw_0.Init.BufferTransferLength = 32;
  hmdma_mdma_channel0_sw_0.Init.SourceBurst = MDMA_SOURCE_BURST_128BEATS;
  hmdma_mdma_channel0_sw_0.Init.DestBurst = MDMA_DEST_BURST_128BEATS;

If running in debug mode, the live watch of MDMA_C0TCR.SBURST/DBUST shows 0 and the transfer is slow.

If running the HAL init function of MDMA in step-by-step mode, MDMA_C0TCR is correctly initialized. After what running the program works fine and the transfer time is really reduced (85µs)

I have similar init issues if trying to use packet mode, and in all cases with the TRGM bit : The register init is ok when debuging step-by-step, and wrong if running the program.

Is there a real issue or is it another dirty trick of the complex superscalar Cortex-M7 architecture ?

I'm not watching the register immediately after the write, but a long time after, to get rid of the propagation time between trough the different busses.

Any help appreciated :)

Thanks

TDK · ‎2025-10-20

When the program is running, MDMA_C0TCR may be being changed before you see it. Live watch isn't updating immediately.

> The MDMA needs 135µs

How are you timing this exactly? Be specific.

8 kB in 135 us is 475 Mbps. Reasonable. ST doesn't publish exact values here because it's too complicated.

If you feel a post has answered your question, please click "Accept as Solution".

tarzan2 · ‎2025-10-21

> When the program is running, MDMA_C0TCR may be being changed before you see it. Live watch isn't updating immediately.

I am watching it a long time after the init. Printing it over UART shows the same values.

>>The MDMA needs 135µs

>How are you timing this exactly? Be specific.

Using sev instruction and eventout pin, at the start of the MDMA Tx (soft trig) and the start of the ISR (highest prio, ITCM RAM including vector table, background task is a while(1) in ITCM RAM).

> 8 kB in 135 us is 475 Mbps. Reasonable.

Erratum in my original post : I have 2000 uint16_t to Tx, so 4000 bytes.

Not sure Mbps is relevant here. The bus is 32 bits wide and clocked at 80MHz, unpacked data is 16 bits. Idealistic max transfer rate with 100% bus efficiency would be 80MT/s, with T = 32-bits word Tx. 4000 half-words in 135µs is 30MT/s.

I measured the MDMA Tx time with different config :

- half-word mode, burst 1 : 135µs (30MT/s)
- half-word mode, burst 128 : 85µs (47MT/s)
- word mode, burst 1 : 73µs (55MT/s)
- word mode, burst 128 : 46µs (87MT/s)

Since TLEN = 31 (buffer Tx len = 32 bytes), I have 125 buffers to Tx (1 block of 125x 32 bytes buffer = 4000 bytes). In this case, I'm note sure if I am allowed to use the burst mode x128 - although it seems to works. If no, an easy workaround will be to use 4096 bytes in/out buffer and transmit 4000 useful + 96 dummy bytes.

Anyway, my main issue concerns the MDMA init. I stopped to program in debug mode. In release mode, watching the registers with a UART Tx:

- MDMA init is fine if optimization = 0 or 1

- MDMA int is wrong if optimization = speed

I mean that the MX_MDMA_Init(void); don't works when compiler optimizes for speed.

I continue to investigate.

tarzan2 · ‎2025-10-21

I added some debug in the HAL MDMA init function:

static void MDMA_Init(MDMA_HandleTypeDef *hmdma)
{
  uint32_t blockoffset;
  volatile uint32_t tmp;

  /* Prepare the MDMA Channel configuration */
  hmdma->Instance->CCR = hmdma->Init.Priority  | hmdma->Init.Endianness;

  /* Write new CTCR Register value */
  /*hmdma->Instance->CTCR =  hmdma->Init.SourceInc      | hmdma->Init.DestinationInc | \
                           hmdma->Init.SourceDataSize | hmdma->Init.DestDataSize   | \
                           hmdma->Init.DataAlignment  | hmdma->Init.SourceBurst    | \
                           hmdma->Init.DestBurst                                   | \
                           ((hmdma->Init.BufferTransferLength - 1U) << MDMA_CTCR_TLEN_Pos) | \
                           hmdma->Init.TransferTriggerMode;*/
  asm volatile("" ::: "memory");
  tmp =                    hmdma->Init.SourceInc      | hmdma->Init.DestinationInc | \
                           hmdma->Init.SourceDataSize | hmdma->Init.DestDataSize   | \
                           hmdma->Init.DataAlignment  | hmdma->Init.SourceBurst    | \
                           hmdma->Init.DestBurst                                   | \
                           ((hmdma->Init.BufferTransferLength - 1U) << MDMA_CTCR_TLEN_Pos) | \
                           hmdma->Init.TransferTriggerMode;

  asm volatile("" ::: "memory");
  mdma_tcr_dbg = tmp;
  hmdma->Instance->CTCR = tmp;
  asm volatile("" ::: "memory");

mdma_tcr_dbg is declared as volatile.

When printing the registers over UART (program free-running)

- mdma_tcr_dbg = 0x127FFAAA

- TCR = 0xFA7C055A !!!

When printing the registers over UART (breakpoint at line 7, free-running before and after), TCR is ok.

tarzan2 · ‎2025-10-21

Solved. I didn't see that the MDMA init was also called by the M4 processor with wrong parameters. The M4 was modifying the MDMA register just after the M7 initialized it.

Only one question persists:

> Since TLEN = 31 (buffer Tx len = 32 bytes), I have 125 buffers to Tx (1 block of 125x 32 bytes buffer = 4000 bytes). In this case, I'm note sure if I am allowed to use the burst mode x128 - although it seems to works. If no, an easy workaround will be to use 4096 bytes in/out buffer and transmit 4000 useful + 96 dummy bytes.