cancel
Showing results for 
Search instead for 
Did you mean: 

Missing D-Cache clean in "sd_diskio.c" SD_Read() causes SD card FATFS libraries to fail

mantisrobot
Associate III

I've been working with an F7 processor, using RTOS with the FATFS and an SD card in 4bit mode on SDMMC1.

STM32CubeIDE V1.8

STM32CubeMX V6.4.0

Within the sd_diskio.c file there is the following option to enable D-Cache maintenance when D-Cache is used:

/*
 * when using cacheable memory region, it may be needed to maintain the cache
 * validity. Enable the define below to activate a cache maintenance at each
 * read and write operation.
 * Notice: This is applicable only for cortex M7 based platform.
 */
/* USER CODE BEGIN enableSDDmaCacheMaintenance */
#define ENABLE_SD_DMA_CACHE_MAINTENANCE  1
/* USER CODE END enableSDDmaCacheMaintenance */

However, there seems to be a D-Cache clean missing within SD_read(), this is the generated code:

/* USER CODE BEGIN beforeReadSection */
/* can be used to modify previous code / undefine following code / add new code */
/* USER CODE END beforeReadSection */
/**
  * @brief  Reads Sector(s)
  * @param  lun : not used
  * @param  *buff: Data buffer to store read data
  * @param  sector: Sector address (LBA)
  * @param  count: Number of sectors to read (1..128)
  * @retval DRESULT: Operation result
  */
 
DRESULT SD_read(BYTE lun, BYTE *buff, DWORD sector, UINT count)
{
  uint8_t ret;
  DRESULT res = RES_ERROR;
  uint32_t timer;
#if (osCMSIS < 0x20000U)
  osEvent event;
#else
  uint16_t event;
  osStatus_t status;
#endif
#if (ENABLE_SD_DMA_CACHE_MAINTENANCE == 1)
  uint32_t alignedAddr;
#endif
  /*
  * ensure the SDCard is ready for a new operation
  */
 
  if (SD_CheckStatusWithTimeout(SD_TIMEOUT) < 0)
  {
    return res;
  }
 
#if defined(ENABLE_SCRATCH_BUFFER)
  if (!((uint32_t)buff & 0x3))
  {
#endif
    /* Fast path cause destination buffer is correctly aligned */
    ret = BSP_SD_ReadBlocks_DMA((uint32_t*)buff, (uint32_t)(sector), count);
 
    if (ret == MSD_OK) {
#if (osCMSIS < 0x20000U)
    /* wait for a message from the queue or a timeout */
    event = osMessageGet(SDQueueID, SD_TIMEOUT);
 
    if (event.status == osEventMessage)
    {
      if (event.value.v == READ_CPLT_MSG)
      {
        timer = osKernelSysTick();
        /* block until SDIO IP is ready or a timeout occur */
        while(osKernelSysTick() - timer <SD_TIMEOUT)
#else
          status = osMessageQueueGet(SDQueueID, (void *)&event, NULL, SD_TIMEOUT);
          if ((status == osOK) && (event == READ_CPLT_MSG))
          {
            timer = osKernelGetTickCount();
            /* block until SDIO IP is ready or a timeout occur */
            while(osKernelGetTickCount() - timer <SD_TIMEOUT)
#endif
            {
              if (BSP_SD_GetCardState() == SD_TRANSFER_OK)
              {
                res = RES_OK;
#if (ENABLE_SD_DMA_CACHE_MAINTENANCE == 1)
                /*
                the SCB_InvalidateDCache_by_Addr() requires a 32-Byte aligned address,
                adjust the address and the D-Cache size to invalidate accordingly.
                */
                alignedAddr = (uint32_t)buff & ~0x1F;
                SCB_InvalidateDCache_by_Addr((uint32_t*)alignedAddr, count*BLOCKSIZE + ((uint32_t)buff - alignedAddr));
#endif
                break;
              }
            }
#if (osCMSIS < 0x20000U)
          }
        }
#else
      }
#endif
    }
 
#if defined(ENABLE_SCRATCH_BUFFER)
    }
    else
    {
      /* Slow path, fetch each sector a part and memcpy to destination buffer */
      int i;
 
      for (i = 0; i < count; i++)
      {
        ret = BSP_SD_ReadBlocks_DMA((uint32_t*)scratch, (uint32_t)sector++, 1);
        if (ret == MSD_OK )
        {
          /* wait until the read is successful or a timeout occurs */
#if (osCMSIS < 0x20000U)
          /* wait for a message from the queue or a timeout */
          event = osMessageGet(SDQueueID, SD_TIMEOUT);
 
          if (event.status == osEventMessage)
          {
            if (event.value.v == READ_CPLT_MSG)
            {
              timer = osKernelSysTick();
              /* block until SDIO IP is ready or a timeout occur */
              while(osKernelSysTick() - timer <SD_TIMEOUT)
#else
                status = osMessageQueueGet(SDQueueID, (void *)&event, NULL, SD_TIMEOUT);
              if ((status == osOK) && (event == READ_CPLT_MSG))
              {
                timer = osKernelGetTickCount();
                /* block until SDIO IP is ready or a timeout occur */
                ret = MSD_ERROR;
                while(osKernelGetTickCount() - timer < SD_TIMEOUT)
#endif
                {
                  ret = BSP_SD_GetCardState();
 
                  if (ret == MSD_OK)
                  {
                    break;
                  }
                }
 
                if (ret != MSD_OK)
                {
                  break;
                }
#if (osCMSIS < 0x20000U)
              }
            }
#else
          }
#endif
#if (ENABLE_SD_DMA_CACHE_MAINTENANCE == 1)
          /*
          *
          * invalidate the scratch buffer before the next read to get the actual data instead of the cached one
          */
          SCB_InvalidateDCache_by_Addr((uint32_t*)scratch, BLOCKSIZE);
#endif
          memcpy(buff, scratch, BLOCKSIZE);
          buff += BLOCKSIZE;
        }
        else
        {
          break;
        }
      }
 
      if ((i == count) && (ret == MSD_OK ))
        res = RES_OK;
    }
#endif
  return res;
}

I have added a cache clean prior to the BSP_SD_ReadBlocks_DMA call which resolved my connection issues:

#if defined(ENABLE_SCRATCH_BUFFER)
  if (!((uint32_t)buff & 0x3))
  {
#endif
#if (ENABLE_SD_DMA_CACHE_MAINTENANCE == 1)
    alignedAddr = (uint32_t)buff & ~0x1F;
    // Clean whole aligned buffer from data cache
    SCB_CleanDCache_by_Addr((uint32_t*)alignedAddr, count*BLOCKSIZE + ((uint32_t)buff - alignedAddr));
#endif
    /* Fast path cause destination buffer is correctly aligned */
    ret = BSP_SD_ReadBlocks_DMA((uint32_t*)buff, (uint32_t)(sector), count);
 
    if (ret == MSD_OK) {
#if (osCMSIS < 0x20000U)
    /* wait for a message from the queue or a timeout */
    event = osMessageGet(SDQueueID, SD_TIMEOUT);
 
    if (event.status == osEventMessage)
    {
      if (event.value.v == READ_CPLT_MSG)
      {
        timer = osKernelSysTick();
        /* block until SDIO IP is ready or a timeout occur */
        while(osKernelSysTick() - timer <SD_TIMEOUT)
#else

This is working for me, without this I can't use SD library with D-Cache enabled.

28 REPLIES 28

you saved my life :smiling_face_with_heart_eyes:

I’m glad it helped! This drove me mad for weeks! :\

Piranha
Chief II

Unfortunately the cache maintenance examples in previous posts of this topic are broken or incomplete. My previous posts are also incomplete. The correct solution is shown and explained here:

https://community.st.com/s/question/0D53W00001Z9K9TSAV/maintaining-cpu-data-cache-coherence-for-dma-buffers

I second that this was enough to solve my issues. Thank you.

As explained several times, it's still incomplete and broken. The read operation needs a D-cache invalidation both before and after, not a single cleaning. And a write operation needs a cleaning, which is completely missing in this example. The correct D-cache maintenance principles are explained and shown as an example in a link provided in my latest post down there.

Yes you are right, there are some serious bugs in HAL functions for many years, especially with cache on M7 core.

But what do you think about using DTCM RAM instead of cached areas as suggested in my other post below?

According to my understanding DTCMRAM area has same speed as D cache itself but needs no cache handling like cleaning and invalidating.

So DTCM seems to be intended and predestined for DMA buffers that can be placed in DTCRAM explicitely.

DMA buffers shouldn't be placed in cached areas to avoid time consuming clean&invalidate cache trouble as far as possible...

As I recommended, read my article and the comments there, which also explain the performance impact of non-cacheable memories, cache maintenance functions etc. Also take a note that, contrary to F7, on H7 only the CPU and MDMA has an access to DTCM.

I've read your comments in the other thread, but I've wrote here in this thread for F7 family because thread owner is using a STM32F765IIK.

And for F7 family DTCM should be the better choice against the whole complex D cache stuff.

DTCM is not cacheable because it seems to have the same speed as the D cache memory itself. So there should be also no performance impact at all. On the contrary using DTCM should be even faster than D cache with extra cleaning and invalidating before or/and after DMA access. Or have you any other information about speed of the different memories?

Maybe the H7 family is a complete other thing with MDMA (I don't use H7). So the HAL functions should have also an other structure for F7 and H7 and not only copied from H7 to F7 to decrease F7 performance...

Hello Piranha!

i read your article but i have to say i didnt understend weel he modification needed to the BSP_SD_ReadBlocks_DMA and BSP_SD_WriteBlocks_DMA to make sure the modification is complete. there is a chance that you can provide an example?

thank you