Hint: DMA and cache coherency

Torsten Jaekel · ‎2016-02-18

Posted on February 19, 2016 at 00:36

To share experience with all:

STM32F7 has DMAs and caches (DCACHE here in mind). You can use a DMA for Peripheral-to-Memory or even Memory-to-Memory (I use as HW-based 'background' memcpy() ).

But you should bear in mind: DMA transfer does not go through MCU DCACHE. It writes directly to memory. If DCACHE is enabled, the same memory location already hosted in cache - any update on memory (done by DMA) is not 'visible' for MCU. MCU will still see the 'old' content because it is read from cache.

It means: DMA is not coherent , they do not force an update on DCACHE

(not a Cache Coherency Interconnect, CCI in the system).

There are some conclusions:

1) before you send something via DMA from memory - a need to do a Cache Clean maintenance operation - force to let update the memory with cache content

2) when something was received in memory via DMA - a need to do Cache Invalidate maintenance operation - force to let update caches again with memory content to see the changes

But, I think there is a faster (and easier way): use the DTCM memory region:

If you manage to have the buffers for DMAs on DTCM then you should be fine: there is not the DCACHE involved, it is tightly coupled for MCU and DMA has dedicated path to it as well.

On this DTCM you will have 'coherency', no need to deal with cache maintenance.

Regular memories with DCACHE 'in between' might look like 'some data missing' (not coherent).

BTW: even the C keyword 'volatile' might be ''tricky'': it tells compiler not to optimize, to read and update variable all the time again (in order to see the 'side effect'). But it is not related to caches, it is not a cache maintenance operation:

if such a volatile variable is updated by a non-coherent master (DMAs are such one) - the MCU might still not see a new value in it, even volatile used and really read again (but from cache, not memory).

DCACHE in system might need careful consideration what does it mean for specific features such as DMAs.

#dcache #dma #stm32f7

Amel NASRI · ‎2016-12-26

Posted on December 26, 2016 at 11:28

Hi

â€Œ,

Thanks for your helpful hint.

More hints on the same context with farther details may be found in the

http://www.st.com/content/ccc/resource/technical/document/application_note/group0/08/dd/25/9c/4d/83/43/12/DM00272913/files/DM00272913.pdf/jcr:content/translations/en.DM00272913.pdf

(Level 1 cache on STM32F7 Series).

-Amel-

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

GreenGuy · ‎2018-05-04

Posted on May 04, 2018 at 19:22

BTW, I have a project using H743i-Eval board and I am using the RAM_D1 area for .data and .bss. I found that it did not matter if cache was enabled or not. If I set up a buffer to send data out the UART via DMA HAL call, the only way to make sure the data out the UART matched the data in the buffer was to use Invalidate Cache just before the call to start the DMA process. Clean Cache did not fix the incoherence issue. Previous to that I thought all was working because my initial tests used const data as in placing a literal string in the call. Interesting about that is the data there would be coming from Flash and it worked 100%. Thanks for pointing to AN4839. Just know that the sentence claiming the Clean Cache is a solution is not always true.

Torsten Jaekel · ‎2018-05-07

Posted on May 07, 2018 at 20:51

I am also working on a STM32H7 Nucleo project: SPI with DMA. Yes, DMA is not coherent (no CCI on bus matrix), so DMA from/to memory and MCU with caches needs careful cache maintenance.

I have realized:

a) using DTCM is not possible: DMA cannot access, it will result in a DMA error if buffer is on DTCM (obvious)

b) not enabling cache works fine for me: DMA will transfer properly in both directions (direct access the physical memory by MCU as well as DMA)

c) with caches enabled - we had to use Clean and/or Invalidate. It works for me, using CleanDCache_by_Addr etc.

BTW: using InvalidateDCache_by_Addr (or similar CleanDCache_by_Addr) needs to make sure that the buffer is aligned with the Cache Line Size, on a 32-byte-boundary address:

__ALIGNED(32)

is needed for the buffer definition.

Which one to use depends on the direction: if MCU generates/updates a buffer which should be transferred afterwards via DMA (Mem-to-Peri) - we had to Clean the cache (let's update the memory with the modified cache content, write it back). If DMA receives and writes into memory (Peri-to-Mem) - the MCU (cache) must be 'informed' about an 'out-of-sync' state (cache does not match memory anymore, force to refill cache). Than an Invalidate is needed ('forget' the current cache content and refill cache again).

It works fine for me.

Example code from my project:

//align buffer with cache line size

uint8_t uartRxBuf[UART_RX_STR_SIZE] __ALIGNED(32) __attribute__((section('.ram1')));

//...

if (xSemaphoreTake(xSemaphoreSPI1Tx, portMAX_DELAY /*1000*/) == pdTRUE)

{

//clean the buffer for DMA to see it

SCB_CleanDCache_by_Addr((uint32_t *)txBuf,

((len+31)/32)*32

);

if (HAL_SPI_Receive_DMA(&hspi4, rxBuf, len) != HAL_OK)

{

Error_Handler();

}

if (HAL_SPI_Transmit_DMA(&hspi1, txBuf, len) != HAL_OK)

{

Error_Handler();

}

//wait for Rx complete

if (xSemaphoreTake(xSemaphoreSPI4Rx, portMAX_DELAY /*1000*/) == pdTRUE)

{

//invalidate to see DMA results

SCB_InvalidateDCache_by_Addr((uint32_t *)rxBuf,

((len+31)/32)*32

);

return HAL_OK;

}

BTW:

We could also use and initialize MPU: we could configure one RAM region w/o caches enabled, or as 'write-through' (for MCU -> DMA -> Peripheral).

Just to bear in mind: with caches enabled, the MCU uses the Cache content, but a DMA uses physical memory. They can be 'out-of-sync' and we need cache maintenance operations in order to make DMA coherent with MCU (cache).

Remark: if you use a lot of other data memory and DCache, it could look like it works (because cache is often updated if we have a lot of other data memory used, some cache lines are ripped out so that an updated memory done by a DMA is reloaded because it is not in cache anymore). But it will fail, if all the data memories we use fit into DCache. In this case the MCU runs completely on cache only and caches are not in sync anymore with memory. Therefore: make sure when using DMAs not to forget to handle the cache (Clean to update memory - before a DMA is launched, Invalidate to update cache from memory - after a DMA was done).

GreenGuy · ‎2018-05-08

Posted on May 08, 2018 at 18:29

Thanks Jaekel I have confirmed the same behavior using the UART DMA. However for me not enabling the Cache by not calling SCB_EnableDCache() does not work. In fact there is a SCB_DisableDCache() and using it does not fix the data corruption. The only thing that works for me is to enable cache and judiciously use Clean of Invalidate depending on the direction just before or after the DMA call.

GreenGuy · ‎2018-05-08

Posted on May 08, 2018 at 19:52

Just for completeness and to make sure I was not missing something about the MPU operation I went back to see if I caould manage to make the SRAM_D1 region at 2400000 to exhibit write-through without buffering. I went to the extent to break-point the HAL code after the MPU is set for the region and double check C,B,S, and TEX were set to the values given in the AN4838 for Normal, Write-back, no write allocate, with he Share bit on. I also did it for the Strongly Ordered operation. Regardless of the setup, the UART data does not match the expected data. The only solution that works is to not place the SRAM_D1 under MPU control, enable cache, and use the Clean and Invalidate Cache calls at the right time.

Manish Sharma · ‎2018-05-31

Posted on May 31, 2018 at 12:49

Hi All,

I was doing SPI DMA Transmit Operation and i captured some observations which confused me. Please help.

Observation 1:

I was using global buffer ( uint8_t txBuf[5] ; ) and I enabled (using STM32CubeMx) D-Cache inside main() then to perform DMA, I need to call SCB_CleanDCache_by_Addr((uint32_t*)&txBuf[0], 5) before HAL_SPI_Transmit_DMA(&hspi4, txBuf, 5) otherwise DMA doesn't work or need to configure MPU_Config() for the DMA to work.

Observation 2:

I was using global buffer ( uint8_t txBuf[10] ; ) and I didn't enable (using STM32CubeMx) D-Cache inside main() then to perform DMA , I don't need to call SCB_CleanDCache_by_Addr((uint32_t*)&txBuf[0], 5) before HAL_SPI_Transmit_DMA(&hspi4, txBuf, 5)

and DMA works fine.

I am confused with the results and i checked it 10-20 times. I am unable to reach to conclusion as i am new to it.

Regards

Manish

Torsten Jaekel · ‎2018-05-31

Posted on May 31, 2018 at 18:09

Actually, your observations seem to be correct. Just to bear in mind: DMA is a master, like the MCU (in terms of memory access). But the DMAs in such MCUs are 'NOT COHERENT'. It means: there is not a CCI (Cache Coherency Interface) on the bus fabrics. A DMA will access the memory directly, w/o any caches involved. But the MCU reads the memory with caches involved, not 'really' from/to memory.

So, with caches enabled - the MCU reads quite likely from cache whereby DMA reads and writes directly to physical memory. Any update on memory content, e.g. DMA writes new memory content but MCU reads still from cache (A) or MCU writes new memory content (B) but it 'hangs' still in cache and DMA will not see yet on physical memory, needs these Cache Maintenance functions called (if cache is enable or it is not configured as 'write back').

You, as software engineer, have to do and bear in mind how to make it coherent between MCU (caches) and DMA (memories).

Comment:

When you do cache maintenance via CleanDCache (which is used to let write MCU caches to memory, before a DMA is kicked off, see(B)) or InvalidateDCache (which is used after a DMA done to let MCU caches refill again, see (A)) - you have to bear in mind the ALIGNMENT with the Cache Line Size (here 32 bytes).

These _byAddr functions clean or invalidate Cache Lines! So, the start address of your buffer should be 32 byte aligned, e,g, use __ALIGNED(32) on definition, or you should take the address of your buffer and round it down to the next lower 32byte boundary address when you call the function (plus length parameter as rounded up multiples of 32!).

If you don't do and the address parameter for the cache maintenance function call _byAddr is not aligned, or length is not multiple of 32: a) nothing will be done (due to wrong Cache Line aligned address) or part of your buffer, e.g. the first bytes, first Cache Line, are not updated.

==> align with the Cache Line Size and make sure length covers all needed Cache Lines (multiples of 32 bytes)

Torsten Jaekel · ‎2018-05-31

Posted on May 31, 2018 at 18:20

sorry, I think cache should be configured as 'write-through'. So, if MCU writes - DMA should see updated memory even w/o cache maintenance function called (any MCU write goes directly to memory).

But still a need to use Invalidate for the other direction (MCU reads from cache but DMA wrote on memory - still not coherent).

So, MPU configuration seems to be needed if caches are enabled (and DMAs used). Check the manual what the default w/o MMU enabled is: different SRAM regions can differ on the default cache modes.

I suggest, if caches are enabled and DMAs are used:

a) do and check the MPU configuration (regions)

b) check which DMA can access which memory (esp. which memory is NOT access-able by DMA, e.g. DTCM).

(BTW: DTCM does not have caches involved, could be 'coherent'. But not all DMA engines can access DTCM)

Manish Sharma · ‎2018-05-31

Posted on June 01, 2018 at 05:55

Great Explanation !!

What is the use case of using MPU_Config() if i am able to DMAing (transmit) using SCB_CleanDCache_by_Addr(). I did one more test:

Observation:

I used global buffer (uint8_t txBuf[5]) and I enabled (using STM32CubeMx) D-Cache inside main() then to perform DMA(transmit), I enabled MPU and configured it (below is snippet). My question is that even if i configured it as MPU_ACCESS_NOT_CACHEABLE OR MPU_ACCESS_CACHEABLE OR MPU_ACCESS_SHAREABLE MPU_ACCESS_NOT_SHAREABLE OR MPU_ACCESS_BUFFERABLE OR MPU_ACCESS_NOT_BUFFERABLE, i am able to transmit my buffer in all combination of attributes. So what is the significance of configuring these attributes in MPU while there are recommendations which says that we need to configure 'transmit buffer' as 'MPU_ACCESS_NOT_CACHEABLE' for DMA to work. I referred AN4839 and AN4838 which talks about it.

void MPU_Config(void)

{

MPU_Region_InitTypeDef MPU_InitStruct;

/* Disables the MPU */

HAL_MPU_Disable();

/**Initializes and configures the Region and the memory to be protected

*/

MPU_InitStruct.Enable = MPU_REGION_ENABLE;

MPU_InitStruct.Number = MPU_REGION_NUMBER0;

MPU_InitStruct.BaseAddress = 0x30040000;

MPU_InitStruct.Size = MPU_REGION_SIZE_256B;

MPU_InitStruct.SubRegionDisable = 0x0;

MPU_InitStruct.TypeExtField = MPU_TEX_LEVEL0;

MPU_InitStruct.AccessPermission = MPU_REGION_FULL_ACCESS;

MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_ENABLE;

MPU_InitStruct.IsShareable = MPU_ACCESS_NOT_SHAREABLE;

MPU_InitStruct.IsCacheable = MPU_ACCESS_CACHEABLE;

MPU_InitStruct.IsBufferable = MPU_ACCESS_NOT_BUFFERABLE;

HAL_MPU_ConfigRegion(&MPU_InitStruct);

/* Enables the MPU */

HAL_MPU_Enable(MPU_HFNMI_PRIVDEF);

}

/external-link.jspa?url=http%3A%2F%2Fwww.st.com%2Fcontent%2Fccc%2Fresource%2Ftechnical%2Fdocument%2Fapplication_note%2Fgroup0%2F08%2Fdd%2F25%2F9c%2F4d%2F83%2F43%2F12%2FDM00272913%2Ffiles%2FDM002729pdf%2Fjcr%3Acontent%2Ftranslations%2Fen.DM002729pdf

The data coherency between the core and the DMA is ensured by:

1. Either making the SRAM1 buffers not cacheable - ( I didn't do it but still able to perform DMA (transmit).

2. Or modifying the SRAM1 region in the MPU attribute to a shared region. ( I didn't do it but still able to perform DMA (transmit).

3. Or making the SRAM1 buffer cache enabled with write-through policy. ( How can we do it using STM32CubeMx ?)

I'd Referred this

https://community.st.com/thread/30147

link and has same confusion oon the points which are mentioned in this link. Below are the points which are mentioned in the link and have same confusion :

In the particular case when DMA is used, we have following recommendations:

If the software is using cacheable memory regions for the DMA source/or destination buffers. The software must trigger a cache clean before starting a DMA operation to ensure that all the data are committed to the subsystem memory. After the DMA transfer complete, when reading the data from the peripheral, the software must perform a cache invalidate before reading the DMA updated memory region. ( I am able to transmit without Cache Maintenance Techniques like DClean etc)

Always better to use non-cacheable regions for DMA buffers. The software can use the MPU to set up a non-cacheable memory block to use as a shared memory between the CPU and DMA. ( I didn't make it as cacheable but still able to transmit so what is good case of doing it ? )

I am pretty confused as my experiments speak different thing other than what is written in reference and said by expertise.