FAQ: DMA is not working on STM32H7 devices

Document created by Adam Berlinger Employee on Apr 4, 2018Last modified by Adam Berlinger Employee on Jun 12, 2018
Version 8Show Document
  • View in full screen mode

Q:

DMA is not working on STM32H7 devices, or the transmitted/received data are corrupted. Polling and interrupt based methods for the same peripheral configuration are working.

 

A:

The problem is related two things: memory layout on STM32H7 and internal data cache (D-Cache) of the Cortex-M7 core. 

In summary these can be the possible issues:

  • Memory placed in DTCM RAM for D1/D2 peripherals. Unfortunately this memory is used as default in some projects including examples.
  • Memory not placed in D3 SRAM4 for D3 peripherals.
  • D-Cache enabled for DMA buffers, different content in cache and in SRAM memory.
  • Starting the DMA just after writing the data to TX buffer, without placing __DSB() instruction between.

For Ethernet related problems, please see separate FAQ: FAQ: Ethernet not working on STM32H7x3 

Explanation: memory layout

The STM32H7 device consists of three bus matrix domains (D1, D2 and D3) as seen on the picture below. The D1 and D2 are connected through bus bridges, both can also access data in D3 domain. However there is no connection from D3 domain to D1 or D2 domain.

The DMA1 and DMA2 controllers are located in D2 domain and can access almost all memories with exception of ITCM and DTCM RAM (located at 0x20000000). This DMA is used in most cases. 

BDMA controller is located in D3 domain and can access only SRAM4 and backup SRAM in the D3 domain.

MDMA controller is located in D1 domain and can access all memories, including ITCM/DTCM. This controller is mainly used for handling D1 peripherals and memory to memory transfers.

Bus matrix connections inside the STM32H7 device

From performance point of view it is better to put DMA buffers inside D2 domain (SRAM1, SRAM2 and SRAM3), since the D2-to-D1 bridge can add additional delay.

Explanation: handling DMA buffers with D-Cache enabled

The Cortex-M7 contains two internal caches, I-Cache for loading instructions and D-Cache for data. The D-Cache can affect the functionality of DMA transfers, since it will hold the new data in the internal cache and don't write them to the SRAM memory. However the DMA controller loads the data from SRAM memory and not D-Cache.

In case the DMA transfer is started just after writing the data to the tx_buffer in the code, it can happen that the tx_buffer data will be still in write-buffer inside the CPU, while the DMA is already started. Solution can be to set the tx_buffer as device type and force CPU to order the memory operations, or add __DSB() instruction before starting the DMA.

There are several ways how to keep manage DMA buffers with D-Cache:

  • Disable D-Cache globally. It is the most simple solution, but not effective one, since you can loose great part of performance. Can be useful for debugging, to analyze if the problem is related to D-Cache.
  • Disable D-Cache for part of the memory. This can be done by configuring the memory protection unit (MPU). The downside is that the MPU regions have certain alignment restrictions and you need to place the DMA buffers to specific parts of memory. Each toolchain (GCC, IAR, KEIL) needs to be configured in different way.
    • Note that MPU regions can overlap and the higher region number has priority. Together with subregion disable bits, this can be useful to soften the alignment and size restrictions.
    • Note that Device and Strongly ordered memory types not allow unaligned access to the memory.
  • Configure part of memory as write-through. Can be used only for TX DMA. Similar to the previous option.
  • Use cache maintenance operations. It is possible to write data stored in cache back to memory ("clean" operation) for specific address range, and also discard data stored in cache ("invalidate" operation).
    • The downside is that these operations work withe cache-line size which is 32 bytes, so you can't clean or invalidate single byte from the cache. This can lead to errors when RX buffer "shares" the cache line with other data or TX buffer (please see the picture below).
    • Beware that with uninitialized D-Cache, the maintenance operations "clean" or "clean and invalidate" can lead to BusFault exception. This is caused by uninitialized ECC (error correction code) after power-on reset. If you have project with a lot of maintenance operations and want to disable D-Cache temporarily, you can use SCB_InvalidateDCache function, which will clean the cache and set correct ECC, without enabling it.

Below are the possible MPU configurations. Green are configurations suitable for DMA buffers, blue is suitable only for TX-only DMA buffer and red are forbidden. Other configurations are not suitable for DMA buffers and will require cache maintenance oprations:

TEXCacheableBufferableMemory TypeDescriptionShareable

000

00Strongly OrderedStrongly OrderedYes
00001DeviceShared DeviceYes
00010NormalWrite through, no write allocateS bit
00011NormalWrite-back, no write allocateS bit
00100NormalNon-cacheableS bit
00101ReservedReservedReserved
00110UndefinedUndefinedUndefined
00111NormalWrite-back, write and read allocateS bit
01000DeviceNon-shareable deviceNo
01001ReservedReservedReserved

Solution example 1: Simple placement of all memory to D1 domain

GCC (Atollic TrueStudio/System Workbench for STM32/Eclipse)

Replace DTCMRAM with RAM_D1 for section placement in linkerscript (.ld file extension). E.g. like this:

.data : 
{
  ... /* Keep same */
} >RAM_D1 AT> FLASH

this should be done also for .bss and ._user_heap_stack sections.

In some linkerscripts, the initial stack is defined separately. So you either need to update it with the section, or define it inside the section like:

._user_heap_stack :
{
. = ALIGN(8);
PROVIDE ( end = . );
PROVIDE ( _end = . );
. = . + _Min_Heap_Size;
. = . + _Min_Stack_Size;
_estack = .; /* <<<< line added */
. = ALIGN(8);
} >RAM_D1

and remove the original _estack definition.

IAR (in project settings):

For Keil:

Solution example 2: Placing buffers in separated memory part

NOTE: IAR compiler and Keil compiler version <= 5 allow placing variables at absolute address in code using compiler specific extensions.

C code:

Define placement macro:

#if defined( __ICCARM__ )
  #define DMA_BUFFER \
      _Pragma("location=\".dma_buffer\"")

#else
  #define DMA_BUFFER \
      __attribute__((section(".dma_buffer")))

#endif

Specify DMA buffers in code:

DMA_BUFFER uint8_t rx_buffer[256];

GCC linkerscript (*.ld file)

Place section to D2 RAM (you can also specify your own memory regions in linkerscript file):

.dma_buffer : /* Space before ':' is critical */
{
  *(.dma_buffer)
} >RAM_D2

This is without default value initialization. Otherwise you need to place special symbols and add your own initialization code.

IAR linker file (*.icf file)

define region D2_SRAM2_region   = mem:[from 0x30020000 to 0x3003FFFF];
place in D2_SRAM2_region { section .dma_buffer};
initialize by copy { section .dma_buffer}; /* optional initialization of default values */

Keil scatter file (*.sct file)

LR_IROM1 0x08000000 0x00200000  {    ; load region size_region
  ER_IROM1 0x08000000 0x00200000  {  ; load address = execution address
   *.o (RESET, +First)
   *(InRoot$$Sections)
   .ANY (+RO)
  }
  RW_IRAM2 0x24000000 0x00080000  {  ; RW data
   .ANY (+RW +ZI)
  }
  ; Added new region
  DMA_BUFFER 0x30040000 0x200 {
  *(.dma_buffer)
  }
}

Generation of scatter file should be disabled in Keil:

Solution example 3: Use Cache maintenance functions

Transmitting data:

#define TX_LENGTH  (16)
uint8_t tx_buffer[TX_LENGTH];

/* Write data */
tx_buffer[0] = 0x0;
tx_buffer[1] = 0x1;

/* Clean D-cache */
/* Make sure the address is 32-byte aligned and add 32-bytes to length, in case it overlaps cacheline */
SCB_CleanDCache_by_Addr((uint32_t*)(((uint32_t)tx_buffer) & ~(uint32_t)0x1F), TX_LENGTH+32);

/* Start DMA transfer */
HAL_UART_Transmit_DMA(&huart1, tx_buffer, TX_LENGTH);

Receiving data:

#define RX_LENGTH  (16)
uint8_t rx_buffer[RX_LENGTH];

/* Invalidate D-cache before reception */
/* Make sure the address is 32-byte aligned and add 32-bytes to length, in case it overlaps cacheline */
SCB_InvalidateDCache_by_Addr((uint32_t*)(((uint32_t)rx_buffer) & ~(uint32_t)0x1F), RX_LENGTH+32);

/* Start DMA transfer */
HAL_UART_Transmit_DMA(&huart1, rx_buffer, RX_LENGTH);
/* No access to rx_buffer should be made before DMA transfer is completed */

Please note that in case of reception there can be problem if rx_buffer is not aligned to the size of cache-line (32-bytes), because during the invalidate operation another data sharing the same cache-line(s)  with rx_buffer can be lost.

Reference

Attachments

    Outcomes