"Pain" with D-cache cleaning&invalidating on STM32F7 family

SStor · ‎2023-05-06

I've read often about strange problems with activated D cache on M7 core and how it can be solved here in forum.

But the STM32F7xx family has implemented a fast DTCM RAM that is accessible from CPU core and DMA peripherals.

This DTCM memory should have also the same speed as D cache, so there is no performance lost and no extra cleaning/invalidating necessary.

So shouldn't be placed all buffers with DMA access in this not cacheable DTCM memory area with no extra cache handling simply?

I can see only advantages of using DTCM memory for fixed peripheral DMA buffers (Ethernet, SPI, UART, ADC, ...).

Only the linker file has to be split in DTCM memory and other cacheable memory and all DMA buffers or other buffers with external access has to be placed in DTCM memory region.

Can anyone tell me if there are disadvantages or problems in speed/performance with DTCM RAM and DMA on F7 family?

And why is DTCM not used explicitely in F7 HAL functions by ST? Theoretically it should also increase performance without unnecessary cache functions.

(Please don't confuse the possibilities of F7 family with H7 family, because MDMA access to DTCM RAM seems to be not possible there unfortunately)

Tesla DeLorean · ‎2023-05-06

Well I've historically used DTCM for DMA on the F7, mainly for SDMMC (so SD Cards and eMMC)

I wouldn't want to use it for video, or hard continuous data with long lasting contention.

DCache Invalidate is very dangerous, I'd only use the ByAddr form, and I'd be very wary of buffer alignment, and data that might abut.

The ST team are often not sophisticated, and things that could be done in examples often seem to be ignored, so the code that replicates endlessly is the source of headaches. Perhaps less dangerous than broken code.

The startup.s code could manage memory initialization via tables, not global symbols.

Hard Fault and Error Handlers could generate useful information by default, not mindless while(1) loops.

The linker could use symbols to fix things up, like the vector table, and NOT use defines.

That the heap and stack can operate in entirely different memory spaces.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

SStor · ‎2023-05-06

Thank you for sharing your experiences!

My first experiences with M7 cache on STM32F7xx were also very strange:

Because the CubeMX generated linker file don't differentiate between DTCM memory and other memory a DMA buffer is placed randomized anywhere in whole memory.

So a little change like adding a new variable can cause very strange side effects, because a DMA buffer is moved from inside DTCM to outside by linker and nothing worked anymore.

I think many other users have similar experiences and don't now why. Completelely disabling D cache can help to find out that this has anything to do with cache and leading to use clean&invalidate cache functions before and/or after DMA accesss.

Therefore I've begun to place the buffers with external access by DMA controller in DTCM explicitely and most of the F7 HAL functions worked without problems, also the often problematic ethernet driver worked perfect:

/* USER CODE BEGIN 1 */
#pragma default_variable_attributes = @ "DTCMRAM"
/* USER CODE END 1 */
 
/* Private variables ---------------------------------------------------------*/
#if defined ( __ICCARM__ ) /*!< IAR Compiler */
  #pragma data_alignment=4
#endif
__ALIGN_BEGIN ETH_DMADescTypeDef  DMARxDscrTab[ETH_RXBUFNB] __ALIGN_END;/* Ethernet Rx MA Descriptor */
 
#if defined ( __ICCARM__ ) /*!< IAR Compiler */
  #pragma data_alignment=4
#endif
__ALIGN_BEGIN ETH_DMADescTypeDef  DMATxDscrTab[ETH_TXBUFNB] __ALIGN_END;/* Ethernet Tx DMA Descriptor */
 
#if defined ( __ICCARM__ ) /*!< IAR Compiler */
  #pragma data_alignment=4
#endif
__ALIGN_BEGIN uint8_t Rx_Buff[ETH_RXBUFNB][ETH_RX_BUF_SIZE] __ALIGN_END; /* Ethernet Receive Buffer */
 
#if defined ( __ICCARM__ ) /*!< IAR Compiler */
  #pragma data_alignment=4
#endif
__ALIGN_BEGIN uint8_t Tx_Buff[ETH_TXBUFNB][ETH_TX_BUF_SIZE] __ALIGN_END; /* Ethernet Transmit Buffer */
 
/* USER CODE BEGIN 2 */
#pragma default_variable_attributes =
/* USER CODE END 2 */

I suppose using DTCM RAM should be even faster than extra time consuming cache handling, because read from & write back to cache should take also time.

I think CubeMX has a very good intention:

For hardware design the pins for peripherals can be configured very fast and a start project with working peripherals can be generated already.

The HAL driver functions allows a simple porting from one famly to another and are fast enough in most cases. Especially for configuring, startup, calibration, etc. without learning about all registers in detail.

It's a pity that there are many bugs in it. Probably the STM HAL driver project would be improved as open source project...

Pavel A. · ‎2023-05-07

> Therefore I've begun to place the buffers with external access by DMA controller in DTCM explicitely

A question.... is the DTCM memory of F7 shareable? In order to work correctly with two bus masters (CPU and DMA) it should be shareable? But then what is the point of making it core-coupled?

IIRC, CM7 core-coupled memory is not cacheable and cache management functions are no-op on it?

SStor · ‎2023-05-07

I've left the MPU in standard configuration (no special MPU parameterization for DTCM). So whole RAM memory should be "non-shareable" by default, if I'm not wrong.

With my understanding "shareable" means to resolve write-write and read-write conflicts between multiple masters (CPU core and DMA controller) with parallel memory access so that 32-bit access is always atomic (not broken in old and new data parts).

WW conflicts shouldn't be occure between core and DMA because data transfer goes actually always in one direction: either CPU write data and DMA controller read it or vice versa (CPU read and DMA write).

To avoid real existing RW conflicts CPU should either wait until DMA operation is complete or consider current DMA pointer because it's not sure if current pointer address is already processed or not, e.g. circular buffers with continous DMA access.

But with D cache clean/invalidate there would be also a similar problem: it takes a memory shot to or from cache during active DMA process with forward moving pointer and unexpected data behavior...