Hint: DMA and cache coherency

Torsten Jaekel · ‎2016-02-18

Posted on February 19, 2016 at 00:36

To share experience with all:

STM32F7 has DMAs and caches (DCACHE here in mind). You can use a DMA for Peripheral-to-Memory or even Memory-to-Memory (I use as HW-based 'background' memcpy() ).

But you should bear in mind: DMA transfer does not go through MCU DCACHE. It writes directly to memory. If DCACHE is enabled, the same memory location already hosted in cache - any update on memory (done by DMA) is not 'visible' for MCU. MCU will still see the 'old' content because it is read from cache.

It means: DMA is not coherent , they do not force an update on DCACHE

(not a Cache Coherency Interconnect, CCI in the system).

There are some conclusions:

1) before you send something via DMA from memory - a need to do a Cache Clean maintenance operation - force to let update the memory with cache content

2) when something was received in memory via DMA - a need to do Cache Invalidate maintenance operation - force to let update caches again with memory content to see the changes

But, I think there is a faster (and easier way): use the DTCM memory region:

If you manage to have the buffers for DMAs on DTCM then you should be fine: there is not the DCACHE involved, it is tightly coupled for MCU and DMA has dedicated path to it as well.

On this DTCM you will have 'coherency', no need to deal with cache maintenance.

Regular memories with DCACHE 'in between' might look like 'some data missing' (not coherent).

BTW: even the C keyword 'volatile' might be ''tricky'': it tells compiler not to optimize, to read and update variable all the time again (in order to see the 'side effect'). But it is not related to caches, it is not a cache maintenance operation:

if such a volatile variable is updated by a non-coherent master (DMAs are such one) - the MCU might still not see a new value in it, even volatile used and really read again (but from cache, not memory).

DCACHE in system might need careful consideration what does it mean for specific features such as DMAs.

#dcache #dma #stm32f7

Torsten Jaekel · ‎2018-06-01

Posted on June 01, 2018 at 19:59

Hi Manish,

hard to say w/o to know your memory layout (linker script).

What I see:

You configure MPU for a very small region, 256 bytes, somewhere at the end of SRAM

D2

(0x30040000 is after 256 KB of D2 start). If you use D1 (AXISRAM, starting at 0x24000000) for all of your data, buffers etc. - this MPU config does not have any effect. (comment: based on a STM32H7 MCU, not STM32F7!)

You had to know your memory layout, the linker script: where is your txBuf located after linking all files?

I would assume, your MPU config does not match with the memory layout (different memories and regions used).

I guess also, with caches enabled but MPU not configured - the SRAMs use a default cache policy. And in case the default is write-through - all looks OK and working. (I do not know anymore where I have found the default memory cache policies, sorry).

Try to figure out which memory you really use at the end (check the MAP file created by linker or compare with linker script). Configure additional MPU regions (I guess up to 16 are possible) and make sure the SRAM really used, where the txBuf is located after linker was running is configured via the MPU (see below my code piece).

The CubeMX cannot generate any MPU code: it generates source code. But for the MPU config the final memory layout is important to know. So, the linker script used is important. CubeMX does not have any clue about your linker script. Bear in mind that just after compiling and linker was running (real BIN/ELF generated) - the address regions are mapped, the final physical addresses are known. On source code level nothing can be figured out how the MPU config should look like. It is heavily related to linker script.

After building your project - check the MAP file generated by linker and try to find where your txBuf, stack, DATA, BSS sections etc. are located. Or check your linker script where your data, bss, stack etc. are placed. Then compare and check your MPU configuration if it matches and makes sense.

You cannot automate, e.g. MPU config source code takes info from linker and linker script: MPU config is done on source code level, but the final address of memory regions is know just after last linker step. So, you had to know how your memory layout would look like at the end or you configure all potential memory regions, large enough (check the .Size parameter), even the D2, D3 etc. are configured but maybe never used.

Assuming you use D1 (AXISRAM) as I do via my linker script, my MPU config (part of it) looks like this:

Also make sure the region size (you have just 256B) covers really the SRAM region where your txBuf is located at the end (BTW: adding any code can change the physical address).

/* Configure the MPU attributes as Cacheable write through

for our main memory */

MPU_InitStruct.Enable = MPU_REGION_ENABLE;

MPU_InitStruct.BaseAddress =

0x24000000

;

MPU_InitStruct.Size = MPU_REGION_SIZE_512KB;

MPU_InitStruct.AccessPermission = MPU_REGION_FULL_ACCESS;

MPU_InitStruct.IsBufferable = MPU_ACCESS_NOT_BUFFERABLE;

MPU_InitStruct.IsCacheable = MPU_ACCESS_CACHEABLE;

MPU_InitStruct.IsShareable = MPU_ACCESS_NOT_SHAREABLE;

MPU_InitStruct.Number = MPU_REGION_NUMBER2;

MPU_InitStruct.TypeExtField = MPU_TEX_LEVEL0;

MPU_InitStruct.SubRegionDisable = 0x00;

MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_ENABLE;

HAL_MPU_ConfigRegion(&MPU_InitStruct);

My Linker Script looks like this (D1 is used, starting at 0x24000000). My MPU config covers entire 512 KB of D1.

/* Initialized data sections goes into RAM, load LMA copy after code */

.data :

{

. = ALIGN(4);

_sdata = .; /* create a global symbol at data start */

*(.data) /* .data sections */

*(.data*) /* .data* sections */

. = ALIGN(4);

_edata = .; /* define a global symbol at data end */

}

>RAM_D1

AT> FLASH

BTW: the MPU_ACCESS_NOT_SHAREABLE should not have really a meaning here. It is (for my understanding) useful if you have a multi-core system (we don't). This MPU attribute is for a multi-master system where different cores can access the same memory in parallel. But this is not the case here on our single-core MCUs.

Manish Sharma · ‎2018-06-03

Posted on June 04, 2018 at 06:32

The original post was too long to process during our migration. Please click on the provided URL to read the original post. https://st--c.eu10.content.force.com/sfc/dist/version/download/?oid=00Db0000000YtG6&ids=0680X000006I6sp&d=%2Fa%2F0X0000000bxE%2FtYTp5jOR.l.PJbYiY9ZPprfsfaGxp5Z4HPpbkaungEw&asPdf=false

GreenGuy · ‎2018-06-04

Posted on June 04, 2018 at 18:59

Observation 1:

Conclusion: Does it mean that the default cache policy is write-through?

I would concur with this based on my observations.

As well,

garsi.khouloud

has confirmed that on the H7 devices:

So then using a statement like:

MPU_InitStruct.IsShareable = MPU_ACCESS_SHAREABLE;

will override:

MPU_InitStruct.IsCacheable = MPU_ACCESS_CACHEABLE;

in the same region making that region NOT cacheable.

In the same vein:

MPU_InitStruct.IsCacheable = MPU_ACCESS_NOT_CACHEABLE;

MPU_InitStruct.IsShareable = MPU_ACCESS_SHAREABLE;

in the same region is redundant.

See thread:

https://community.st.com/message/197787-an4838-s-field-equivalent-to-non-cacheable

Manish Sharma · ‎2018-06-18

Posted on June 19, 2018 at 06:46

Thanks for your reply.

What if i do like :

MPU_InitStruct.IsShareable = MPU_ACCESS_NOT_SHAREABLE

MPU_InitStruct.IsCacheable = MPU_ACCESS_CACHEABLE;

Is it cacheable ?

I would love if you can address above one in context with DMA:

My Observation:

1) Allocate global buffer uint8_t __attribute__((section ('.RAM_D2'))) txBuf[1000] ;

2) I enabled D-Cache.

3) I enabled MPU ( Cacheable , Not Shareable)

For DMA to work, we need to configure MPU as 'not cacheable' but in my observation, i configured it as 'cacheable' and DMA (write operation) still works.

Regards,

Manish

Torsten Jaekel · ‎2018-06-19

Posted on June 19, 2018 at 23:10

Hi Manish,

a) shareable vs. non-shareable:

The sentence 'The STM32F7 Series and STM32H7 Series do not support hardware coherency. the S field is equivalent to non-cacheable mem' in

http://www.st.com/content/ccc/resource/technical/document/application_note/group0/bc/2d/f7/bd/fb/3f/48/47/DM00272912/files/DM00272912.pdf/jcr:content/translations/en.DM00272912.pdf

means for my understanding:

The MPU is an ARM IP core:

https://developer.arm.com/docs/ddi0439/latest/memory-protection-unit/about-the-mpu

.

It is capable to support a Multi-Core-System (or a system with several masters which can have their own L1 cache). In this case shareable or non-shareable matters. These are HW signals which can be used with a CCI (Cache Coherency Interconnect). In case there would be a CCI (and several cores/masters) - the caches would get be informed to 'synchronize' so that 'coherency' between different cores/masters and (their) L1-caches is guaranteed.

But STM has not integrated such a CCI (so complex for a small MCU, CM4/CM7 system). So, the sharable attribute (S-bit) does not have any function or meaning.

'is equivalent' means here for my understanding: 'if you have shareable memory, e.g. shared between MCU and DMA - you had to run/configure as non-cacheable. Then it is a shared memory, coherent for both cores/masters. Shared without a CCI means non-cacheable.'

So, I assume this S-bit does not have any effect (because not wired from MPU). As shareable memory you had to use non-cacheable - this is 'equivalent'.

b) DMA still working even with cacheable:

It is not enough just to see MPU_InitStruct.IsCacheable = MPU_ACCESS_CACHEABLE;. You had to check also the other bits, e.g. TEX, C and B fields!

Cacheable has different 'flavors' (policies), e.g.

Write-Back

,

Write-Trough

(WBWA, WTNWA etc.). See here on page 7:

http://www.st.com/content/ccc/resource/technical/document/application_note/group0/bc/2d/f7/bd/fb/3f/48/47/DM00272912/files/DM00272912.pdf/jcr:content/translations/en.DM00272912.pdf

If you enable cache, but the policy is 'Write-Through' - the DMA will still work (in one direction): if you write data from MCU - with 'write-through' it ends up immediately in memory (writing is like non-cached, MCU writes are 'coherent' with DMA and memory).

So, not enough just to say 'cacheable' and cache is enabled:

the policy matters much more

.

I would suggest:

- try to understand how the MPU can be configured (all the modes)

- what do all these bits mean, esp. what is the difference between 'write-back' and 'write-through', 'write-allocate' vs. 'non-write-allocate'

- use 'write-through' for DMAs (maybe you did already, therefore 'still working'), if MCU places into memory and DMA should grab it, DMA Tx - 'write-through' works with caches enabled and w/o cache maintenance called

- or use the cache maintenance function (anyway my recommendation, it might not hurt to do always cache maintenance, Clear and Invalidate, even cache is maybe not enabled, but if you would enable cache - it will still work)

- check also the original ARM TRM:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0646a/BIHJJABA.html

Manish Sharma · ‎2018-06-19

Posted on June 20, 2018 at 05:26

Awesome Explanation !!

I will surely look into the above points. Thanks for devoting time and addressing issue.

Regards,

Manish

faharintisar · ‎2024-07-25

Very good explanation here, I was invalidating cache right after entering uart rx callback (dma with idle intr) and clean it before leaving, well it works well with debug build tag, but in release it was not invalidating cache...Resolved by just aligning uart dma buffer: uint8_t uart_reception_dma[1024]__attribute__((aligned(32)));

Fahar