Maintaining CPU data cache coherence for DMA buffers

Piranha · ‎2022-05-19

This topic is inspired by discussions in ST forum and ARM forum, where a proper cache maintenance was sorted out and an example of a real-life speculative read was detected. Also there is another discussion, where a real-life example of cache eviction was detected.

For Tx (from memory to peripheral) transfers the maintenance is rather simple:

// Application code.
GenerateDataToTransmit(pbData, nbData);
// Prepare and start the DMA Tx transfer.
SCB_CleanDCache_by_Addr(pbData, nbData);
DMA_TxStart(pbData, nbData);

For Rx (from peripheral to memory) transfers the maintenance is a bit more complex:

#define ALIGN_BASE2_CEIL(nSize, nAlign)  ( ((nSize) + ((nAlign) - 1)) & ~((nAlign) - 1) )
 
uint8_t abBuffer[ALIGN_BASE2_CEIL(67, __SCB_DCACHE_LINE_SIZE)] __ALIGNED(__SCB_DCACHE_LINE_SIZE);
 
// Prepare and start the DMA Rx transfer.
SCB_InvalidateDCache_by_Addr(abBuffer, sizeof(abBuffer));
DMA_RxStart(abBuffer, sizeof(abBuffer));
 
// Later, when the DMA has completed the transfer.
size_t nbReceived = DMA_RxGetReceivedDataSize();
SCB_InvalidateDCache_by_Addr(abBuffer, nbReceived);
// Application code.
ProcessReceivedData(abBuffer, nbReceived);

The first cache invalidation at line 6 before the DMA transfer ensures that during the DMA transfer the cache has no dirty lines associated to the buffer, which could be written back to memory by cache eviction. The second cache invalidation at line 11 after the DMA transfer ensures that the cache lines, which during the DMA transfer could be read from memory by speculative reads, are discarded. Therefore cache invalidation for Rx buffers must be done before and after DMA transfer and skipping any of these will lead to Rx buffer corruption.

Doing cache invalidation on arbitrary buffer can corrupt an adjacent memory before and after the particular buffer. To ensure that it does not happen, the buffer has to exactly fill an integer number of cache lines. For that to be the case, the buffer address and size must be aligned to the size of cache line. CMSIS defined constant for data cache line size is __SCB_DCACHE_LINE_SIZE and it is 32 bytes for Cortex-M7 processor. The __ALIGNED() is a CMSIS defined macro for aligning the address of a variable. And the ALIGN_BASE2_CEIL() is a custom macro, which aligns an arbitrary number to the nearest upper multiple of a base-2 number. In this example the 67 is aligned to a multiple of 32 and respectively the buffer size is set to 96 bytes.

Unfortunately for Cortex-M processors ARM doesn't provide a clear explanation or example, but they do provide a short explanation for Cortex-A and Cortex-R series processors.

waclawek.jan · ‎2022-05-20

Thanks for summing things up cleanly for us.

To me "speculative access" sounds like there's absolutely no guarantee the processor won't overwrite the physical memory once it's cached. In other words, to me it sounds like the fact that you explicitly evict a properly aligned DMA Rx buffer before starting the Rx won't guarantee that the processor won't re-read and re-evict any cache line related to given buffer during the DMA Rx process.

I'd like to ask you to discuss merits of having DMA buffers cached at all. Additionally, do you think it is a bad idea to switch DMA buffers "cachedness" dynamically in MPU?

JW

Danish1 · ‎2022-05-20

My approach for stm32f7 is deliberately not to buffer or cache any buffers that I DMA in-to or out-of.

I set up a section of memory for all the buffers, then program an MPU region for that with:

MPU_InitStruct.IsBufferable = MPU_ACCESS_NOT_BUFFERABLE;
MPU_InitStruct.IsCacheable = MPU_ACCESS_NOT_CACHEABLE;
MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_DISABLE;

I don't know how that compares, in terms of overall processor performance, with invalidating/flushing caches as appropriate prior-to and on-completion-of DMA operations. I assume (rightly or wrongly) that the processor doesn't need to make many reads / writes to each location in a DMA buffer, so the benefit of cacheing such locations is relatively small.

Regards, Danish

Bob S · ‎2022-05-20

My understanding of "speculative access" is that access is reading memory. As long as the memory served by that cache line is only written to by the DMA, there is no danger of the speculatively cached data being marked as "dirty" and flushed back to RAM (which would overwrite DMA data). If that cache line needs to be evicted, it would simply be discarded.

waclawek.jan · ‎2022-05-20

> If that cache line needs to be evicted, it would simply be discarded.

Oh, yes. Silly me. Thanks.

JW

Piranha · ‎2022-05-21

For a buffers significantly smaller than the cache line size (32 B on Cortex-M7), cache maintenance can be inefficient because of two factors. First, the CPU time spent on SCB_***() calls could be more than the gain from using cache. Second, on Rx side the buffers will still have to fill the whole cache line and therefore waste relatively (to the actually used part of buffer) large amounts of memory. In such scenarios indeed disabling the cache on those buffers could be a better choice. But, on a buffers larger than the cache line size and especially in a zero-copy solutions, the cache will have a major positive impact on a performance - much larger than a loss because of an SCB_***() calls. For example, Ethernet DMA descriptors with a size of 32 B or 16 B perform better, when put in non-cacheable memory, but Rx data buffers with a typical required size of 1522 B gain from D-cache tremendously.

Also it's a misconception that cache improves performance only for repetitive accesses. For example, imagine that code reads an array byte-by-byte and does some simple processing on each byte. After the first byte was read, the other 31 bytes are already in a cache. Without cache the CPU would have to wait on 31 additional separate reads going through buses to the memory, which could also be used by other bus masters at that moment. Also advanced CPUs have a data prefetch feature, which detects data access patterns and reads the next/previous cache lines speculatively before they are accessed by code. At least Cortex-A55, Cortex-A9 and even the Cortex-A5 does it. Seems that such a feature is not present in Cortex-M7, but it's definitely coming in Cortex-M55 and Cortex-M85. Anyway, in a project similar to my demo firmware on STM32F7 I am receiving Ethernet frames and doing just a full linear forward scan once on every frame, but still enabling the D-cache approximately halves the CPU time.

Reconfiguring the MPU dynamically should be possible, but my guess is that it will still require the cache cleaning and invalidation at least after disabling the cacheability of the buffer. And, if that is the case, then it results in a less performance and more code, which is irrational.

Piranha · ‎2022-05-25

Let's fix the documentation related to these issues. For ST it's the AN4839 rev 2 that must be corrected and updated.

Section 3.2, page 7:

"Another case is when the DMA is writing to the SRAM1 and the CPU is going to read data from the SRAM1. To ensure the data coherency between the cache and the SRAM1, the software must perform a cache invalidate before reading the updated data from the SRAM1."

Section 4, page 8:

"After the DMA transfer complete, when reading the data from the peripheral, the software must perform a cache invalidate before reading the DMA updated memory region."

Those explanations are wrong and should be updated according to the explanation presented in the head post of this topic. For Rx transfers the cache invalidation must be done before and after the DMA transfer and the address and size of Rx buffers must be aligned to the cache line size.

It is pretty strange that the section 3.2 explains how to manage cache coherence for Tx with different options and suggestions, but for Rx there is only a single poor sentence. It recommends SCB_CleanDCache() function for Tx cache maintenance, but nothing for Rx. It seems that the person, who wrote it, already saw that something is wrong here and deliberately did not recommend anything specific. And there is a good reason for it - the seemingly opposite function SCB_InvalidateDCache() unavoidably corrupts unrelated memory and is not practically usable. The SCB_CleanInvalidateDCache() can be used, but it still requires the Rx buffers to be properly aligned. Anyway, operations on the whole cache are terribly inefficient and that is why SCB_***_by_Addr() functions were introduced by ARM and should be used as presented. Those functions were there in year 2015 and the application note was written in year 2016, but, as always, ST used old code. Therefore the table 1 "CMSIS cache functions" should also be updated.

@Imen DAHMEN, @Amel NASRI, let's make ST the first one of the MCU manufacturers, who have corrected this mess of a global scale! ;)

Piranha · ‎2022-05-25

Also here is a review of an incorrect and incomplete related documents from other manufacturers.

NXP (Freescale) AN12042 rev 1. Sections 4.3.1 and 6.
Atmel AN-15679. Slides 25, 27 and code example on slide 42.

Both documents does not inform of the necessity to invalidate the cache before the DMA Rx transfer and the necessity of an Rx buffer size to be aligned to a cache line size.

Microchip (Atmel) TB3195. Section 4, page 7 and code examples on pages 9 and 10.

Does not inform of the necessity to invalidate the cache before the DMA Rx transfer. Also it informs of an alignment requirements for addr and dsize parameters of SCB_***_by_Addr() functions, which is not required since this improvement.

Infineon (Cypress) AN224432 rev E. Section 6.4.2.1 "Solution 2: Use cache maintenance APIs" on page 44 and "Code Listing 31" on page 38.
Infineon AN234254 rev A. Section 5.4.2.1 "Solution 2: Use cache maintenance APIs" on page 33 and "Code Listing 3" on page 27.

Both documents does not inform of the necessity to invalidate the cache before the DMA Rx transfer, but instead shows an example of a cache clean done on a destination (Rx) buffer, which is suboptimal. Also it informs of an alignment requirements for addr parameter of SCB_***_by_Addr() functions, which is not required since this improvement. The code shows an address alignment for source (Tx) buffer, which is not necessary. The text does not inform of the necessity of an Rx buffer size to be aligned to a cache line size.

So, from the seven companies, which have been involved in making MCUs with Cortex-M7, all seven have failed with this. And since 2014, when the Cortex-M7 has been announced, ARM also haven't stepped up and helped to fix the situation by giving a clear explanation and examples.

waclawek.jan · ‎2022-05-25

Thanks for the comments.

Sounds much like this is might be case for benchmarking (as yourself did).

JW

ABett.3 · ‎2022-09-03

Hi Danish,

I'm interested in this approach! Can you share a more complete example?

Thanks Andrea