Missing D-Cache clean in "sd_diskio.c" SD_Read() causes SD card FATFS libraries to fail

mantisrobot · ‎2022-01-13

I've been working with an F7 processor, using RTOS with the FATFS and an SD card in 4bit mode on SDMMC1.

STM32CubeIDE V1.8

STM32CubeMX V6.4.0

Within the sd_diskio.c file there is the following option to enable D-Cache maintenance when D-Cache is used:

/*
 * when using cacheable memory region, it may be needed to maintain the cache
 * validity. Enable the define below to activate a cache maintenance at each
 * read and write operation.
 * Notice: This is applicable only for cortex M7 based platform.
 */
/* USER CODE BEGIN enableSDDmaCacheMaintenance */
#define ENABLE_SD_DMA_CACHE_MAINTENANCE  1
/* USER CODE END enableSDDmaCacheMaintenance */

However, there seems to be a D-Cache clean missing within SD_read(), this is the generated code:

/* USER CODE BEGIN beforeReadSection */
/* can be used to modify previous code / undefine following code / add new code */
/* USER CODE END beforeReadSection */
/**
  * @brief  Reads Sector(s)
  * @param  lun : not used
  * @param  *buff: Data buffer to store read data
  * @param  sector: Sector address (LBA)
  * @param  count: Number of sectors to read (1..128)
  * @retval DRESULT: Operation result
  */
 
DRESULT SD_read(BYTE lun, BYTE *buff, DWORD sector, UINT count)
{
  uint8_t ret;
  DRESULT res = RES_ERROR;
  uint32_t timer;
#if (osCMSIS < 0x20000U)
  osEvent event;
#else
  uint16_t event;
  osStatus_t status;
#endif
#if (ENABLE_SD_DMA_CACHE_MAINTENANCE == 1)
  uint32_t alignedAddr;
#endif
  /*
  * ensure the SDCard is ready for a new operation
  */
 
  if (SD_CheckStatusWithTimeout(SD_TIMEOUT) < 0)
  {
    return res;
  }
 
#if defined(ENABLE_SCRATCH_BUFFER)
  if (!((uint32_t)buff & 0x3))
  {
#endif
    /* Fast path cause destination buffer is correctly aligned */
    ret = BSP_SD_ReadBlocks_DMA((uint32_t*)buff, (uint32_t)(sector), count);
 
    if (ret == MSD_OK) {
#if (osCMSIS < 0x20000U)
    /* wait for a message from the queue or a timeout */
    event = osMessageGet(SDQueueID, SD_TIMEOUT);
 
    if (event.status == osEventMessage)
    {
      if (event.value.v == READ_CPLT_MSG)
      {
        timer = osKernelSysTick();
        /* block until SDIO IP is ready or a timeout occur */
        while(osKernelSysTick() - timer <SD_TIMEOUT)
#else
          status = osMessageQueueGet(SDQueueID, (void *)&event, NULL, SD_TIMEOUT);
          if ((status == osOK) && (event == READ_CPLT_MSG))
          {
            timer = osKernelGetTickCount();
            /* block until SDIO IP is ready or a timeout occur */
            while(osKernelGetTickCount() - timer <SD_TIMEOUT)
#endif
            {
              if (BSP_SD_GetCardState() == SD_TRANSFER_OK)
              {
                res = RES_OK;
#if (ENABLE_SD_DMA_CACHE_MAINTENANCE == 1)
                /*
                the SCB_InvalidateDCache_by_Addr() requires a 32-Byte aligned address,
                adjust the address and the D-Cache size to invalidate accordingly.
                */
                alignedAddr = (uint32_t)buff & ~0x1F;
                SCB_InvalidateDCache_by_Addr((uint32_t*)alignedAddr, count*BLOCKSIZE + ((uint32_t)buff - alignedAddr));
#endif
                break;
              }
            }
#if (osCMSIS < 0x20000U)
          }
        }
#else
      }
#endif
    }
 
#if defined(ENABLE_SCRATCH_BUFFER)
    }
    else
    {
      /* Slow path, fetch each sector a part and memcpy to destination buffer */
      int i;
 
      for (i = 0; i < count; i++)
      {
        ret = BSP_SD_ReadBlocks_DMA((uint32_t*)scratch, (uint32_t)sector++, 1);
        if (ret == MSD_OK )
        {
          /* wait until the read is successful or a timeout occurs */
#if (osCMSIS < 0x20000U)
          /* wait for a message from the queue or a timeout */
          event = osMessageGet(SDQueueID, SD_TIMEOUT);
 
          if (event.status == osEventMessage)
          {
            if (event.value.v == READ_CPLT_MSG)
            {
              timer = osKernelSysTick();
              /* block until SDIO IP is ready or a timeout occur */
              while(osKernelSysTick() - timer <SD_TIMEOUT)
#else
                status = osMessageQueueGet(SDQueueID, (void *)&event, NULL, SD_TIMEOUT);
              if ((status == osOK) && (event == READ_CPLT_MSG))
              {
                timer = osKernelGetTickCount();
                /* block until SDIO IP is ready or a timeout occur */
                ret = MSD_ERROR;
                while(osKernelGetTickCount() - timer < SD_TIMEOUT)
#endif
                {
                  ret = BSP_SD_GetCardState();
 
                  if (ret == MSD_OK)
                  {
                    break;
                  }
                }
 
                if (ret != MSD_OK)
                {
                  break;
                }
#if (osCMSIS < 0x20000U)
              }
            }
#else
          }
#endif
#if (ENABLE_SD_DMA_CACHE_MAINTENANCE == 1)
          /*
          *
          * invalidate the scratch buffer before the next read to get the actual data instead of the cached one
          */
          SCB_InvalidateDCache_by_Addr((uint32_t*)scratch, BLOCKSIZE);
#endif
          memcpy(buff, scratch, BLOCKSIZE);
          buff += BLOCKSIZE;
        }
        else
        {
          break;
        }
      }
 
      if ((i == count) && (ret == MSD_OK ))
        res = RES_OK;
    }
#endif
  return res;
}

I have added a cache clean prior to the BSP_SD_ReadBlocks_DMA call which resolved my connection issues:

#if defined(ENABLE_SCRATCH_BUFFER)
  if (!((uint32_t)buff & 0x3))
  {
#endif
#if (ENABLE_SD_DMA_CACHE_MAINTENANCE == 1)
    alignedAddr = (uint32_t)buff & ~0x1F;
    // Clean whole aligned buffer from data cache
    SCB_CleanDCache_by_Addr((uint32_t*)alignedAddr, count*BLOCKSIZE + ((uint32_t)buff - alignedAddr));
#endif
    /* Fast path cause destination buffer is correctly aligned */
    ret = BSP_SD_ReadBlocks_DMA((uint32_t*)buff, (uint32_t)(sector), count);
 
    if (ret == MSD_OK) {
#if (osCMSIS < 0x20000U)
    /* wait for a message from the queue or a timeout */
    event = osMessageGet(SDQueueID, SD_TIMEOUT);
 
    if (event.status == osEventMessage)
    {
      if (event.value.v == READ_CPLT_MSG)
      {
        timer = osKernelSysTick();
        /* block until SDIO IP is ready or a timeout occur */
        while(osKernelSysTick() - timer <SD_TIMEOUT)
#else

This is working for me, without this I can't use SD library with D-Cache enabled.

mantisrobot · ‎2022-01-18

Thanks, I'll take a look! :)

mantisrobot · ‎2022-01-18

Well I can't thank you enough! Your version sd_diskio.c has done the trick!

I've not looked into what is doing differently, but I have tested most of the compilation switches as follows:

I have it working with DCache enabled and disabled, obviously with DCache enabled I used the comp switch: ENABLE_SD_DMA_CACHE_MAINTENANCE

I also need the ENABLE_SCRATCH_BUFFER which was expected on this processor, in fact without this enabled I get the same data corruption as I was seeing with the ST version of the code.

I tested both with FORCE_SCRATCH_BUFFER enabled and disabled, it works either way for me as my write buffers are 32byte aligned.

I can't tell you how much happier I am knowing I can now continue with some kind of sanity! I owe you a beer! And ST need to sort their libraries out!

Matt.

mantisrobot · ‎2022-01-23

I've just started to look at adding the DTCMRAM section to my project. I'm using an STM32F765IIK processor which according to the data sheet has 128 of TCM ram. However, when I create my project I only have two sections which are 512Kb of RAM and 2MB FLASH.

Am I misunderstanding something? Am I supposed to add the TCM section into the liker script manually?

I'm using STM32CubeIDE, this is the linker script:

/*
******************************************************************************
**
** @file        : LinkerScript.ld (debug in RAM dedicated)
**
** @author      : Auto-generated by STM32CubeIDE
**
** @brief       : Linker script for STM32F765IIKx Device from STM32F7 series
**                      2048Kbytes FLASH
**                      512Kbytes RAM
**
**                Set heap size, stack size and stack location according
**                to application requirements.
**
**                Set memory bank area and size if external memory is used
**
**  Target      : STMicroelectronics STM32
**
**  Distribution: The file is distributed as is, without any warranty
**                of any kind.
**
******************************************************************************
** @attention
**
** Copyright (c) 2022 STMicroelectronics.
** All rights reserved.
**
** This software is licensed under terms that can be found in the LICENSE file
** in the root directory of this software component.
** If no LICENSE file comes with this software, it is provided AS-IS.
**
******************************************************************************
*/
 
/* Entry Point */
ENTRY(Reset_Handler)
 
/* Highest address of the user mode stack */
_estack = ORIGIN(RAM) + LENGTH(RAM); /* end of "RAM" Ram type memory */
 
_Min_Heap_Size = 0x200; /* required amount of heap */
_Min_Stack_Size = 0x400; /* required amount of stack */
 
/* Memories definition */
MEMORY
{
  RAM    (xrw)    : ORIGIN = 0x20000000,   LENGTH = 512K
  FLASH    (rx)    : ORIGIN = 0x8000000,   LENGTH = 2048K
}
 
/* Sections */
SECTIONS
{
  /* The startup code into "RAM" Ram type memory */
  .isr_vector :
  {
    . = ALIGN(4);
    KEEP(*(.isr_vector)) /* Startup code */
    . = ALIGN(4);
  } >RAM
 
  /* The program code and other data into "RAM" Ram type memory */
  .text :
  {
    . = ALIGN(4);
    *(.text)           /* .text sections (code) */
    *(.text*)          /* .text* sections (code) */
    *(.glue_7)         /* glue arm to thumb code */
    *(.glue_7t)        /* glue thumb to arm code */
    *(.eh_frame)
    *(.RamFunc)        /* .RamFunc sections */
    *(.RamFunc*)       /* .RamFunc* sections */
 
    KEEP (*(.init))
    KEEP (*(.fini))
 
    . = ALIGN(4);
    _etext = .;        /* define a global symbols at end of code */
  } >RAM
 
  /* Constant data into "RAM" Ram type memory */
  .rodata :
  {
    . = ALIGN(4);
    *(.rodata)         /* .rodata sections (constants, strings, etc.) */
    *(.rodata*)        /* .rodata* sections (constants, strings, etc.) */
    . = ALIGN(4);
  } >RAM
 
  .ARM.extab   : {
    . = ALIGN(4);
    *(.ARM.extab* .gnu.linkonce.armextab.*)
    . = ALIGN(4);
  } >RAM
 
  .ARM : {
    . = ALIGN(4);
    __exidx_start = .;
    *(.ARM.exidx*)
    __exidx_end = .;
    . = ALIGN(4);
  } >RAM
 
  .preinit_array     :
  {
    . = ALIGN(4);
    PROVIDE_HIDDEN (__preinit_array_start = .);
    KEEP (*(.preinit_array*))
    PROVIDE_HIDDEN (__preinit_array_end = .);
    . = ALIGN(4);
  } >RAM
 
  .init_array :
  {
    . = ALIGN(4);
    PROVIDE_HIDDEN (__init_array_start = .);
    KEEP (*(SORT(.init_array.*)))
    KEEP (*(.init_array*))
    PROVIDE_HIDDEN (__init_array_end = .);
    . = ALIGN(4);
  } >RAM
 
  .fini_array :
  {
    . = ALIGN(4);
    PROVIDE_HIDDEN (__fini_array_start = .);
    KEEP (*(SORT(.fini_array.*)))
    KEEP (*(.fini_array*))
    PROVIDE_HIDDEN (__fini_array_end = .);
    . = ALIGN(4);
  } >RAM
 
  /* Used by the startup to initialize data */
  _sidata = LOADADDR(.data);
 
  /* Initialized data sections into "RAM" Ram type memory */
  .data :
  {
    . = ALIGN(4);
    _sdata = .;        /* create a global symbol at data start */
    *(.data)           /* .data sections */
    *(.data*)          /* .data* sections */
 
    . = ALIGN(4);
    _edata = .;        /* define a global symbol at data end */
 
  } >RAM
 
  /* Uninitialized data section into "RAM" Ram type memory */
  . = ALIGN(4);
  .bss :
  {
    /* This is used by the startup in order to initialize the .bss section */
    _sbss = .;         /* define a global symbol at bss start */
    __bss_start__ = _sbss;
    *(.bss)
    *(.bss*)
    *(COMMON)
 
    . = ALIGN(4);
    _ebss = .;         /* define a global symbol at bss end */
    __bss_end__ = _ebss;
  } >RAM
 
  /* User_heap_stack section, used to check that there is enough "RAM" Ram  type memory left */
  ._user_heap_stack :
  {
    . = ALIGN(8);
    PROVIDE ( end = . );
    PROVIDE ( _end = . );
    . = . + _Min_Heap_Size;
    . = . + _Min_Stack_Size;
    . = ALIGN(8);
  } >RAM
 
  /* Remove information from the compiler libraries */
  /DISCARD/ :
  {
    libc.a ( * )
    libm.a ( * )
    libgcc.a ( * )
  }
 
  .ARM.attributes 0 : { *(.ARM.attributes) }
}

SStor · ‎2022-01-24

It's the simple standard linker file and this makes no difference between TCM and none TCM RAM.

I use IAR IDE and not CubeIDE and the linker script formats are different (see my IAR linker file attached in last post above). Therefore I cannot help with CubeIDE unfortunately.

Look for a linker script template for any STM32F7/STM32H7 with ITCM/DTCM definitions in your CubeIDE path and adapt it for your processor manually.

It was also discussed in other threads here in forum, e.g. ITCM and DTCM with STM32F7 Discovery

See also AN4839 to understand internal memory structure in STM32F7/STM32H7 series: Level 1 cache on STM32F7 Series and STM32H7 Series

AMiro.1 · ‎2022-02-16

Problem is simple. FatFS reads two first sectors of FAT partition by calling SD_read twice, like this:

SD_read(sector=8192, count=1)

SD_read(sector=8193, count=1)

In both calls it uses as input buffer the field win of its internal FATFS-structure:

typedef struct {
   ...
  DWORD volbase;	/* Volume base sector */
  DWORD fatbase;	/* FAT base sector */
  DWORD dirbase;	/* Root directory base sector/cluster */
  DWORD database;/* Data base sector */
  DWORD winsect;	/* Current sector appearing in the win[] */
  BYTE win[_MAX_SS]; 	
} FATFS;

After some project growth the field win drops out of the DTCM region and happens to be not aligned at 32 byte boundary, so this code starts to work:

alignedAddr = (uint32_t)buff & ~0x1F;
 SCB_InvalidateDCache_by_Addr((uint32_t*)alignedAddr, count*BLOCKSIZE + ((uint32_t)buff - alignedAddr));

This invalidation erases cache between alignedAddr and buff.

In fact after the reading of first sector FatFS fills some fields in the FATFS-structure and than reads the second sector. And that SD_read invalidates up to 28 bytes of cache in front of the win field. So some newly filled fields of FATFS-structure (like fatbase!) can get erased since at this moment they are only in (invalidated) cache. And surely after that every FatFS operation will fail.

So cleaning down only 32 bytes оf cache between alignedAddr and buff should be enough, no need for full scale SCB_CleanDCache_by_Addr((uint32_t*)alignedAddr, count*BLOCKSIZE + ((uint32_t)buff - alignedAddr));

SStor · ‎2022-02-16

Don't forget to clean also the memory at the end of the buffer.

The problem with loosing cached data exist BEFORE and BEHIND the buffer, which is not 32byte aligned.

If you don't want to clean whole buffer you have to clean begin and end separately:

(I don't know which method of cleaning is faster?)

    /* Clean data cache to write additional aligned data BEFORE DMA buffer */
    SCB_CleanDCache_by_Addr((uint32_t*)alignedAddr, 32);
    /* Clean data cache to write additional aligned data BEHIND DMA buffer */
    SCB_CleanDCache_by_Addr((uint32_t*)(((uint32_t)buff + count*BLOCKSIZE) & ~0x1F), 32);

AMiro.1 · ‎2022-02-16

You are right, I have missed this point.

>I don't know which method of cleaning is faster?

   int32_t op_size = dsize;
    uint32_t op_addr = (uint32_t) addr;
     int32_t linesize = 32;                /* in Cortex-M7 size of cache line is fixed to 8 words (32 bytes) */
     __DSB();
    while (op_size > 0) {
      SCB->DCCMVAC = op_addr;
      op_addr += (uint32_t)linesize;
      op_size -=           linesize;
    }
    __DSB();
    __ISB();

An algorithm inside SCB_CleanDCache_by_Addr seems to be linear of buffer length. So two times of 32 should be faster than one time of 512.

And, if we are talking about speed, it might be worth to wrapper the cleaning in a conditional statement like this

if (alignedAddr < (uint32_t)buff) { 
   /* Clean data cache to write additional aligned data BEFORE DMA buffer */
    SCB_CleanDCache_by_Addr((uint32_t*)alignedAddr, 32);
    /* Clean data cache to write additional aligned data BEHIND DMA buffer */
    SCB_CleanDCache_by_Addr((uint32_t*)(((uint32_t)buff + count*BLOCKSIZE) & ~0x1F), 32);
}

Piranha · ‎2022-02-18

Good point about cleaning the both ends! Unfortunately, guys, this is still not a complete workaround. ST's code invalidates the buffer memory after the DMA read operation, but that is a flaw - the buffer memory must be invalidated before passing it to DMA. Read more from my answer to this topic:

https://community.st.com/s/question/0D53W00000oXSzySAG/different-cache-behavior-between-stm32h7-and-stm32f7

So as a minimum after those 2 clean operations you should also do invalidation on the whole buffer before starting the DMA. As the invalidation call has to loop through the whole buffer anyway, it's probably easier to replace those 3 calls with a single SCB_CleanInvalidateDCache_by_Addr() call on the whole buffer. Anyway, after this modification ST's invalidation code after the DMA operation can be removed as useless (and flawed anyway).

SStor · ‎2022-02-19

Yes, cache handling with DMA is complex and error prone, especially on unaligned DMA buffers.

Therefore consider also the 2nd option with using a DMA scratch buffer in DTCM RAM to avoid cache handling completely (see posts below in this thread). It requires a CPU memcpy between memory areas (sector buffer and DMA scratch buffer) instead of cleaning or invalidating the cache but it works in a safe manner for aligned an unaligned data to read or write.

Anyway it requires a scratch buffer for RW data that is not DWORD aligned, because DMA can handle only DWORD aligned data.

mantisrobot · ‎2022-02-22

This is what I need to do. I do have the DMA behaving now using cache cleaning but I'm not sure how efficient this is. Ultimately I would like to get rid of the cache clean headache as it has caused mothing but trouble, but this is my first STM32 project and as yet I've not figured out how to set up the DTCM RAM in the linker. I will investigate this further when I have time to spend on this.. or when my cache clean approach fails again :D