SD card memory corruption due to overly aggressive cache maintenance

Rick Sladkey · ‎2018-03-08

Posted on March 08, 2018 at 21:27

I just spent two harrowing days debugging random memory corruption in my application and the problem turned out to be overly aggressive cache invalidation.

In sd_diskio.c this section of code (STM32Cube_FW_F7_V1.9.0):

#if (ENABLE_SD_DMA_CACHE_MAINTENANCE == 1)
 /*
 the SCB_InvalidateDCache_by_Addr() requires a 32-Byte aligned address,
 adjust the address and the D-Cache size to invalidate accordingly.
 */
 alignedAddr = (uint32_t)buff & ~0x1F;
 SCB_InvalidateDCache_by_Addr((uint32_t*)alignedAddr, count*BLOCKSIZE + ((uint32_t)buff - alignedAddr));
#endif�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?

This invalidates more memory than the actual size of the buffer by up to 31 bytes in each direction. In particular, if the cache is dirty in the memory surrounding the buffer, and the cache is invalidated, then valid memory contents will be discarded. The symptom of this occurring is that memory surrounding the SD card read buffer may suddenly appear to revert to the contents it had earlier in the execution of the program.

There are several ways to fix this problem, but one is to clean the cache surrounding the read buffer before the DMA operation is performed:

#if (ENABLE_SD_DMA_CACHE_MAINTENANCE == 1)
 uint32_t alignedAddr;
#if 1
 // Clean cache surrounding buffer so we don't accidentally invalidate valid cached memory.
 alignedAddr = (uint32_t)buff & ~0x1F;
 SCB_CleanDCache_by_Addr((uint32_t*)alignedAddr, 32);
 SCB_CleanDCache_by_Addr((uint32_t*)(alignedAddr + count*BLOCKSIZE), 32);
#endif
#endif
�?�?�?�?�?�?�?�?�?

I hope this is useful to others experiencingthe same problem or to users trying to understand the subtleties of the interaction between the use of DMA and presence of a data cache.

Tesla DeLorean · ‎2018-03-08

Posted on March 08, 2018 at 22:24

The 32-byte alignment has been mentioned peripherally by ST, but is one of those bear-traps in the design. Mainly caching and write-thru settings. The polled versions don't have the coherency issues, but in the DSI cases susceptible to underrun errors due to the higher bus loadings/contention

There have been some MMU Configurations shared to change the cacheability and shareability of assorted memory regions.

Personally for the H7 and F7 I make sure to put the DMA buffers into DTCM RAM, which are single cycle and not cached.

Tips, buy me a coffee, or three.. PayPal Venmo Up vote any posts that you find helpful, it shows what's working..

Rick Sladkey · ‎2018-03-08

Posted on March 09, 2018 at 03:16

Thanks, Clive. The DMA requires the buffer to be aligned on a 4-byte boundary, but as the read buffer for FatFs is contained inside the FIL structure, using CubeMX to generate code for FatFs + SD card allows no provision for further alignment of the buffer to a coarser boundary. The DMA cache maintenance code above attempts to correctly handle read buffers that are not on a 32-byte boundary. Let's say the buffer is 512 bytes (as with FatFs) and is at an address that ends with 0x10. In that case the 16 bytes before the buffer (which are critical fields in the FatFs FIL structure) and the 16 bytes after the buffer (potentially anything or perhaps heap data structure fields) will be corrupted if either cache line was dirty before the SD card read operation.

Naturally, this problem was intermittent and very hard to debug, but gdb hardware watchpoints came to the rescue to reveal that the FIL fields were being corrupted by the cache invalidation code above!

The symptoms are intermittent failure of FatFs API calls like f_open, f_close, f_read, or f_lseek, subsequent heap failure and/or asserts, or hard faults when corrupted memory addresses are dereferenced.

In my mind, this is a very serious bug in the F7 V1.9.0 firmware and will hopefully be fixed in the next release, perhaps using the solution I provided.

stm322399 · ‎2018-03-09

Posted on March 09, 2018 at 09:41

Rick, thank you for sharing this very good catch. The FW definitively requires a fix for this.

Unfortunately the proposed solution still has a flaw, and a safer solution could be to allocate DMA buffers so that they do not share their cache lines with anything else, in other words align buffer head and tail on 32-bytes boundaries.

Cleaning the cache before the DMA read operation prevents data corruption of data sharing the head of the first cache line used by the buffer, but it has a drawback. If that data is accessed after the clean, and before the DMA fills into the buffer, the cache line is refilled, and at some point later the DMA will update the main memory. At that moment you just hit a condition where valid data of the same 32 byte line is spread between cache and main memory, being impossible to merge.

Of course this condition might be unlikely to happen because we know that the FIL structure is unlikely to be accessed between cache clean and DMA read ... unless there is an interrupt, and a context switch, and some other thread trying to read that data, or the cache trying to anticipate a line fill because the line before is also accessed ... Well this sound like to be a bit paranoid, but paranoia often saved me from big headaches, in particular when dealing with caches.

Rick Sladkey · ‎2018-03-11

Posted on March 12, 2018 at 04:39

Laurent,

I see your point and the only solution under my control is to ensure all read buffers for FatFs are 32 byte aligned.

This can be accomplished by ensuring three buffers are aligned: the 'win' (window) field in FATFS, the 'buf' field in FIL, and the 'buff' argument to the f_read API call.

The 'win' field can be aligned by dynamically allocating the memory as follows (assuming working malloc):

 // Align the win field within the FATFS struct onto a 32 byte boundary.
 void *p = malloc(sizeof(FATFS) + 32);
 size_t offset = (32 - (uintptr_t)((FATFS *)p)->win % 32) % 32;
 FATFS *fatfs = p + offset;
�?�?�?�?�?�?�?�?�?

The 'buf' field can be similarly aligned:

 // Align the buf field within the FIL struct onto a 32 byte boundary.
 void *p = malloc(sizeof(FIL) + 32);
 size_t offset = (32 - (uintptr_t)((FIL *)p)->buf % 32) % 32;
 FIL *fp = p + offset;
�?�?�?�?

and the 'buff' argument to the API call can use __attribute__((align(32)) or similar pointer arithmetic.

With those changes, SD_read can assert proper alignment:

 if ((uint32_t)buff % 32 != 0)
 {
 Error_Handler();
 }
�?�?�?�?

Rick Sladkey · ‎2018-03-30

Posted on March 30, 2018 at 07:57

Laurent,

A variation on the race condition you described occurred in my code and it is just boggling my mind how fragile and broken the so-called 'cache maintenance' is. And this case occurs even with perfectly aligned buffers!

Here is the scenario (stay with me):

The 512-byte read buffer is aligned to 32 bytes and is in cacheable RAM
The read buffer is zeroed
All 16 32-byte lines in the read buffer are now dirty, but won't be written back unless displaced by caching elsewhere
The sector read using DMA occurs and, let's say all 512-bytes of backing RAM are now non-zero
Other code executes, which can be:
- The finishing steps of the read operation itself leading back to the completion of SD_read
- Any interrupt handlers for interrupts that occur
- Arbitrary code from another task if a task switch occurs
The 16 cache lines are invalidated

If at any time during the execution of 'other code' one of the 16 dirty cache lines is displaced, those 32 zeros with be written to RAM and overwrite the non-zero bytes read by the DMA operation. Invalidating the cache doesn't correct the problem because both the cache and the RAM contain zeros.

If this sounds like a hypothetical problem, observethat FatFs f_open for append (when the file size is not on a sector boundary) will do exactly this:

#if !_FS_TINY
mem_set(fp->buf, 0, _MAX_SS);/* Clear sector buffer */
#endif
if ((mode & FA_SEEKEND) && fp->obj.objsize > 0) {/* Seek to end of file if FA_OPEN_APPEND is specified */
...
if (res == FR_OK && ofs % SS(fs)) {/* Fill sector buffer if not on the sector boundary */
if ((sc = clust2sect(fs, clst)) == 0) {
res = FR_INT_ERR;
} else {
fp->sect = sc + (DWORD)(ofs / SS(fs));
#if !_FS_TINY
if (disk_read(fs->drv, fp->buf, fp->sect, 1) != RES_OK) res = FR_DISK_ERR;
#endif
}�?�?�?�?�?�?�?�?�?�?�?�?�?�?

Th net effect is that there can be no dirty cache lines in even an aligned DMA bufferbefore the read operation! In all likelihood, they will either stay dirty until the clean operation or they will be displaced before the DMA operation, in which case no corruption occurs. But if they are displaced (by the normal execution of the cache) in the short window between the DMA read and the cache invalidation, boom, corrupted data.

The STM documentation suggests simply 'invalidate after read, clean before write' and provides no hint whatsoever at how dirty cache lines before a read are a ticking time bomb.

stm322399 · ‎2018-03-30

Posted on March 30, 2018 at 10:45

Rick, welcome in the wonderful world of cache management. Spectre and meltdown are just the latest flaws induced by caches, other will follow.

You right to say that a dirty read buffer must be invalided *before* launching the DMA, for reason you clearly depicted. Anyway, when your design make sure that read buffer are only read, never written, the suggestion to invalidate after DMA wrote and before CPU reads is valid. But do not make too many assumptions, the best practice is to invalidate *before* starting the DMA.

Did I already mention that one must be paranoid in cache management ?