Skip to main content
Senior
May 6, 2023
Question

"Pain" with D-cache cleaning&invalidating on STM32F7 family

  • May 6, 2023
  • 1 reply
  • 3295 views

I've read often about strange problems with activated D cache on M7 core and how it can be solved here in forum.

But the STM32F7xx family has implemented a fast DTCM RAM that is accessible from CPU core and DMA peripherals.

This DTCM memory should have also the same speed as D cache, so there is no performance lost and no extra cleaning/invalidating necessary.

So shouldn't be placed all buffers with DMA access in this not cacheable DTCM memory area with no extra cache handling simply?

I can see only advantages of using DTCM memory for fixed peripheral DMA buffers (Ethernet, SPI, UART, ADC, ...).

Only the linker file has to be split in DTCM memory and other cacheable memory and all DMA buffers or other buffers with external access has to be placed in DTCM memory region.

Can anyone tell me if there are disadvantages or problems in speed/performance with DTCM RAM and DMA on F7 family?

And why is DTCM not used explicitely in F7 HAL functions by ST? Theoretically it should also increase performance without unnecessary cache functions.

(Please don't confuse the possibilities of F7 family with H7 family, because MDMA access to DTCM RAM seems to be not possible there unfortunately)

This topic has been closed for replies.

1 reply

Tesla DeLorean
Guru
May 6, 2023

Well I've historically used DTCM for DMA on the F7, mainly for SDMMC (so SD Cards and eMMC)

I wouldn't want to use it for video, or hard continuous data with long lasting contention.

DCache Invalidate is very dangerous, I'd only use the ByAddr form, and I'd be very wary of buffer alignment, and data that might abut.

The ST team are often not sophisticated, and things that could be done in examples often seem to be ignored, so the code that replicates endlessly is the source of headaches. Perhaps less dangerous than broken code.

The startup.s code could manage memory initialization via tables, not global symbols.

Hard Fault and Error Handlers could generate useful information by default, not mindless while(1) loops.

The linker could use symbols to fix things up, like the vector table, and NOT use defines.

That the heap and stack can operate in entirely different memory spaces.

Tips, Buy me a coffee, or three.. PayPal VenmoUp vote any posts that you find helpful, it shows what's working..
SStorAuthor
Senior
May 6, 2023

Thank you for sharing your experiences!

My first experiences with M7 cache on STM32F7xx were also very strange:

Because the CubeMX generated linker file don't differentiate between DTCM memory and other memory a DMA buffer is placed randomized anywhere in whole memory.

So a little change like adding a new variable can cause very strange side effects, because a DMA buffer is moved from inside DTCM to outside by linker and nothing worked anymore.

I think many other users have similar experiences and don't now why. Completelely disabling D cache can help to find out that this has anything to do with cache and leading to use clean&invalidate cache functions before and/or after DMA accesss.

Therefore I've begun to place the buffers with external access by DMA controller in DTCM explicitely and most of the F7 HAL functions worked without problems, also the often problematic ethernet driver worked perfect:

/* USER CODE BEGIN 1 */
#pragma default_variable_attributes = @ "DTCMRAM"
/* USER CODE END 1 */
 
/* Private variables ---------------------------------------------------------*/
#if defined ( __ICCARM__ ) /*!< IAR Compiler */
 #pragma data_alignment=4
#endif
__ALIGN_BEGIN ETH_DMADescTypeDef DMARxDscrTab[ETH_RXBUFNB] __ALIGN_END;/* Ethernet Rx MA Descriptor */
 
#if defined ( __ICCARM__ ) /*!< IAR Compiler */
 #pragma data_alignment=4
#endif
__ALIGN_BEGIN ETH_DMADescTypeDef DMATxDscrTab[ETH_TXBUFNB] __ALIGN_END;/* Ethernet Tx DMA Descriptor */
 
#if defined ( __ICCARM__ ) /*!< IAR Compiler */
 #pragma data_alignment=4
#endif
__ALIGN_BEGIN uint8_t Rx_Buff[ETH_RXBUFNB][ETH_RX_BUF_SIZE] __ALIGN_END; /* Ethernet Receive Buffer */
 
#if defined ( __ICCARM__ ) /*!< IAR Compiler */
 #pragma data_alignment=4
#endif
__ALIGN_BEGIN uint8_t Tx_Buff[ETH_TXBUFNB][ETH_TX_BUF_SIZE] __ALIGN_END; /* Ethernet Transmit Buffer */
 
/* USER CODE BEGIN 2 */
#pragma default_variable_attributes =
/* USER CODE END 2 */

I suppose using DTCM RAM should be even faster than extra time consuming cache handling, because read from & write back to cache should take also time.

I think CubeMX has a very good intention:

For hardware design the pins for peripherals can be configured very fast and a start project with working peripherals can be generated already.

The HAL driver functions allows a simple porting from one famly to another and are fast enough in most cases. Especially for configuring, startup, calibration, etc. without learning about all registers in detail.

It's a pity that there are many bugs in it. Probably the STM HAL driver project would be improved as open source project...

Pavel A.
Super User
May 7, 2023

> Therefore I've begun to place the buffers with external access by DMA controller in DTCM explicitely 

A question.... is the DTCM memory of F7 shareable? In order to work correctly with two bus masters (CPU and DMA) it should be shareable? But then what is the point of making it core-coupled?

IIRC, CM7 core-coupled memory is not cacheable and cache management functions are no-op on it?