cancel
Showing results for 
Search instead for 
Did you mean: 

SCB_InvalidateDCache_by_Addr not operating correctly

FrankNatoli
Associate III

Using twin STM32H7B3I EVAL boards to develop master and slave firmware.

Found that slave board was experiencing overruns on SPI slave receive.

Revised SPI slave implementation to use DMA.

Found incredibly bizarre data corruption, after some number of good packets moved from EVAL master to EVAL slave.

SCB_InvalidateDCache_by_Addr was called after each DMA receive completion to invalidate the data cache for the receive buffer.

However, once data cache was completely disabled, data corruption of receive DMA packets ceased.

Are there any timing or special considerations for the use of SCB_InvalidateDCache_by_Addr?

26 REPLIES 26

> #define DMA_BUFFER _Pragma("location=\".dma_buffer\"")

What compiler do you use? In GNU C (CubeIDE) this would be __attribute__((section(".dma_buffer")))

IAR C/C++ Compiler for ARM

8.50.9.278

FrankNatoli
Associate III

Have resolved the issue.

The thread's main loop was zeroing two buffers that DMA would write into, then start the DMA, then wait for DMA complete, then call SCB_InvalidateCache_by_Addr for one or the other buffer, then look at the data.

What apparently was happening was that the buffer zeroing, at the top of the thread's main loop, was not immediately flushing to SDRAM, but apparently dawdling in data cache.

By some race condition, sometimes the CPU beat the DMA to writing the buffers, sometimes the DMA beat the CPU.

Once I put SCB_CleanInvalidateCache_by_Addr immediately after the two memset-zeros of the buffers, everything started working perfectly.

The moral of the story is: if CPU has written memory, call SCB_CleanInvalidateCache_by_Addr before starting any DMA operation writing to the same memory.

(answering to myself)

The definitive answer can be found in CM7 TRM v. r1p2 , section 5.2

* "Speculative data reads can be initiated to any Normal, read/write, or read-only memory address. In some rare cases, this can occur regardless of whether there is any instruction that causes the data read."

* "Speculative cache linefills are never made to Non-cacheable memory addresses"

My conclusion from this:

Cache invalidate before DMA read cannot avoid speculative linefills during DMA read If the buffer is in normal cacheable memory.

if DMA read buffer is defined non-cacheable, speculative read can occur, but it won't pollute D-cache, so harmless. And speculative linefills from non-cacheable memory are forbidden.

So, defining a DMA buffer non-cacheable, or non-normal (SO, device) should prevent D-cache pollution.

Cache invalidate *after* DMA read should work too, given that any writes to the buffer have been flushed or invalidated before DMA start (double invalidate, before and after).

Comments are welcome.

No need to apologize! This is a very important topic, which has no clearly explained correct examples - it's the ARM and MCU manufacturers who have to apologize here...

I guess you meant this link from the CM7 TRM. This issue was in "to do" list in my mind for some time, but you forced to research it now. Searching for more information, I found your topic on ARM forum. The links given by the user "a.surati" are interesting. The link [1] is an application note from Atmel/Microchip and is just another broken example, but the link [2] in slides 28-33 actually provides the explanation and solution. I can add that at least in my mind I knew but ignored the fact that cache eviction writes back to memory only the dirty lines, but clean lines are just invalidated. Because of that it seemed that there is no solution, but that would be absurd. Disabling cache with MPU for the receive buffers, while being technically correct, is not really a solution, because it would have a major impact on performance.

So indeed it turns out that the correct solution is doing invalidation before and after the DMA read. We can start reporting to ST, Microchip and everyone that their application notes, examples and other code are broken... "Nice"!

Zeroing and cleaning is not required. If it is required specifically in your implementation, then I suggest zeroing only the part of the buffer, which was not written by DMA, after the read operation.

Also read my discussion with Pavel. It turns out that invalidating the buffer before the DMA read is also not enough...

Pavel A.
Evangelist III

@Piranha​ 

> So indeed it turns out that the correct solution is doing invalidation before and after the DMA read.

Yes, I'm still pondering on the answer on the ARM forum. Still reading.

Double invalidation is too much trouble. It has overhead, after all, if you look at the source.

Maybe only invalidation after the DMA read is enough, if the buffer (cache) was clean before starting the DMA.

Making the buffer non-cacheable seems better (especially small, so overhead of SCB_Invalidate... is same as non-cached read) - but managing the MPU is not too easy on CM7. I've read that MPU management will be much easier in the new CM-85...

By the way: SCB_Invalidate.... can hard-fault when cache is disabled (at least, I'm seeing this on H7), and other developers can leave the cache disabled. Because of that I wrap the SCB... calls with "if (D-cache enabled {...}".

Did a test on my demo firmware and compared the CPU load on invalidation "before" vs "before and after" scenarios. The CPU load on Rx test is 14,65% vs 16,00% respectively. So, comparing relatively, the correct code is not gaining the illegal 8,44%, but those are only 1,35% of the total CPU time. And that's at the maximum 94,9 Mbps while receiving 8127 frames per second on F7. In most real life scenarios the loss will be proportionally smaller to the smaller Rx traffic and on H7 it will be also proportionally smaller to the higher CPU clock speed.

For the cache to be clean, it requires cleaning, but cleaning writes back dirty lines, which takes additional time, loads buses and is generally useless. Invalidation just marks all of the relevant lines invalid regardless of their previous state (invalid, clean, dirty) and doesn't involve any memory access at all. Also invalidation frees those cache lines instead of keeping them allocated, which could be an overall advantage.

For a buffers significantly smaller than the cache line size (32 B on Cortex-M7), cache management is probably an inefficient choice also because of the fact that one will have to waste 32 B of memory on each buffer anyway. In such scenarios indeed disabling the cache on those buffers could be a better choice. But, on a buffers larger than the cache line size and especially in a zero-copy solutions, the cache will have a major positive impact on a performance - much larger than a loss because of an additional SCB_InvalidateDCache_by_Addr() call.

When D-cache is disabled, I remember having issues with older SCB_***() function implementations, but the current ones seem to be harmless. I can even enable/disable the whole D-cache at runtime repeatedly without any issue. On H7 even the ST are shipping up-to-date versions for some time. Contrary to F7, where those turtles are still shipping the 4 year old broken ones... Are you using up-to-date implementations or still the old broken ones and wrapping those in an additional code layer?

I'm using CMSIS from CubeH7 lib v. 1.8...1.10. This is ver. V5.6.0 (Core-M ver. 5.3.0 ... whatever this means)

Piranha
Chief II

Here is another interesting fact. I have a project on F7 with FreeRTOS, where the MCU is continuously receiving a stream of at least a 100 data packets per second through Ethernet with lwIP. The same project also has a web server based on the "httpd" app provided by lwIP, and the web page does AJAX requests back to the MCU once per second. Almost all was working fine, except for one thing. When the web page was open simultaneously with the data transfer, the AJAX requests were causing missing data packets for the data transfer. Keeping the web page open for a minute (60 requests), it was causing approximately 55 missed data packets. Reloading the web page repeatedly was causing even more missing packets. So it was more than a 90% chance of a data packet loss happening because of any web request simultaneously with the data transfer.

Because those requests are processed in a callback from the thread, on which the whole network processing depends, and those requests involve some non-trivial processing, my initial blind guess was that the request processing just takes too long. And, as the web page is only for configuration and not necessary for day-to-day use, it is not a significant issue for the specific usage scenario.

Till now my Ethernet Rx code was doing D-cache invalidation only before the DMA reception. Now I added the second invalidation after the DMA reception. I did set up the data stream reception and a browser with the open web page with AJAX requests once per second running simultaneously and left it overnight. The result - not a single packet missed! As I can easily repeat this test with an absolute reliability, it basically shows that the speculative reads are not just a theoretical concept but actually do happen. And, at least in some scenarios, they are not a rare occasions also.