STM32H7 with FatFS on eMMC sometimes failing

TVare.1 · ‎2024-01-26

Hello everyone,

I have a project using the STM32H7, using FreeRTOS and with a FatFS (ChaN) runnin on a 4GB eMMC (Kingston), using peripheral SDMMC1.

Normally everything works fine. SDMMC1 and FatFS (FAT32) are initialized at main, before OS, and is used later on for many things such as creating and writing/reading files, etc.

However, we found some devices that were working fine and suddenly (after a couple of thousands of writings in the eMMC, I'd assume not too intense) we start having a constant and reproducible error. What happens is that the FatFS initializes OK but as soon as we want to write something (still at main, before OS), it fails. Basically when trying to open a file (that already exists) it fails with FR_DISK_ERR and we cannot use the FatFS anymore. Trying to debug a bit, I saw that on the lowest level HAL_MMC_ReadBlocks (from stm32h7xx_hal_mmc.c) is failing with HAL_TIMEOUT.

However, if I send a "reset" command to our device (it basically just executes a __NVIC_SystemReset of CMSIS on the STM32), on the following boot everything works fine. Both the issue and the "fix" are constant and once they happen once they start happening every time. FYI, the eMMC is also powered off/on when booting (we control this with dedicated circuits controlled by main uC).

I'm aware there are many layers where the failure could be:
1) HW? but then why most devices work fine and those failing are consistent once they start failing?
2) ST HAL: could be some casuistic triggering any bugs?
3) FatFS: something related to the FAT gets corrupted inside the eMMC? We do mount and access the eMMC via USB and so I tried formatting it (from PC with Windows) and didn't make a difference...
4) A combination of electrical/eMMC/firmware conditions ¿?

I suspect that the conditions are somehow also related to using Standby low-power mode, since I cannot reproduce it if I just use Stop mode. Initially I don't see why this could be affecting, since as far as I understand when waking up from Standby mode the STM32 reboots and therefore everything would be initialized the same way.

Since it has this characteristic of stop happening after a SW reset, but then happening again on following boot when powered off/on, I figured maybe someone would identify any other potential issue... Any ideas of what could be or what should I check?

Thanks in advance!

TVare.1 · ‎2024-06-05

Hi!

As an update: I managed to further debug and identify the exact moment when the issue was happening. Basically what I saw is:

The first write operation on the eMMC after powering it, takes a lot of time. And this time seems to increase proportional to the level of usage of the eMMC.
Once first operation finishes, following write attempts are normal. This was the reason my reset was "fixing" the issue, because during the reset the boot of my uC was so fast that the eMMC was always powered ON (thanks to some capacitors) and therefore the "first write operation" already happened.
The failure happens when this first write operation takes more than 100ms, because of this piece of code in the HAL (HAL_MMC_WriteBlocks():(

while (!__HAL_MMC_GET_FLAG(hmmc,
                               SDMMC_FLAG_TXUNDERR | SDMMC_FLAG_DCRCFAIL | SDMMC_FLAG_DTIMEOUT | SDMMC_FLAG_DATAEND))
    {
      if (__HAL_MMC_GET_FLAG(hmmc, SDMMC_FLAG_TXFIFOHE) && (dataremaining >= 32U))
      {
        /* Write data to SDMMC Tx FIFO */
        for (count = 0U; count < 8U; count++)
        {
          data = (uint32_t)(*tempbuff);
          tempbuff++;
          data |= ((uint32_t)(*tempbuff) << 8U);
          tempbuff++;
          data |= ((uint32_t)(*tempbuff) << 16U);
          tempbuff++;
          data |= ((uint32_t)(*tempbuff) << 24U);
          tempbuff++;
          (void)SDMMC_WriteFIFO(hmmc->Instance, &data);
        }
        dataremaining -= 32U;
      }

      if (((HAL_GetTick() - tickstart) >=  Timeout) || (Timeout == 0U))
      {
        /* Clear all the static flags */
        __HAL_MMC_CLEAR_FLAG(hmmc, SDMMC_STATIC_FLAGS);
        hmmc->ErrorCode |= errorstate;
        hmmc->State = HAL_MMC_STATE_READY;
        return HAL_TIMEOUT;
      }
    }

The TXFIFOHE flag takes too long to be set and therefore the code is stuck here until eventually fails because of Timeout being configured as 100ms. I guess this delay is somehow related to the hardware flow control and how the eMMC and the SDMMC controller communicates. Again, this only happens during the first write operation on the eMMC.

My guess is that the eMMC internal controller has some kind of wear leveling/garbage collection/whatever algorithm that runs on the first operation after powering ON. After that, performance gets "normal". Could someone confirm me this? or could be another reason?

I'll try to measure these times using other eMMCs, to see if this is due to the quality of the card itself, or if it's something that will always happen and I'll have to live with it (i.e. increasing timeout value).

@Tesla DeLorean I'm not sure if implementing DMA will benefit me. When you say that polling mode is fragile you are talking about this issue that I'm experiencing with the timeout? is your recommendation related to the benefit of having the CPU free for other operations while the eMMC writing is taking long? If this is the case, maybe is not that important for my use case. But please let me know if there are other reasons why you recommend DMA.

Thanks!

TVare.1 · ‎2024-06-05

Hi!

As an update: I managed to further debug and identify the exact moment when the issue was happening. Basically what I saw is:

The first write operation on the eMMC after powering it, takes a lot of time. And this time seems to increase proportional to the level of usage of the eMMC.

Once first operation finishes, following write attempts are normal. This was the reason my reset was "fixing" the issue, because during the reset the boot of my uC was so fast that the eMMC was always powered ON (thanks to some capacitors) and therefore the "first write operation" already happened.

The failure happens when this first write operation takes more than 100ms, because of this piece of code in the HAL (HAL_MMC_WriteBlocks()):

while (!__HAL_MMC_GET_FLAG(hmmc,
SDMMC_FLAG_TXUNDERR | SDMMC_FLAG_DCRCFAIL | SDMMC_FLAG_DTIMEOUT | SDMMC_FLAG_DATAEND))
{
if (__HAL_MMC_GET_FLAG(hmmc, SDMMC_FLAG_TXFIFOHE) && (dataremaining >= 32U))
{
/* Write data to SDMMC Tx FIFO */
for (count = 0U; count < 8U; count++)
{
data = (uint32_t)(*tempbuff);
tempbuff++;
data |= ((uint32_t)(*tempbuff) << 8U);
tempbuff++;
data |= ((uint32_t)(*tempbuff) << 16U);
tempbuff++;
data |= ((uint32_t)(*tempbuff) << 24U);
tempbuff++;
(void)SDMMC_WriteFIFO(hmmc->Instance, &data);
}
dataremaining -= 32U;
}

if (((HAL_GetTick() - tickstart) >= Timeout) || (Timeout == 0U))
{
/* Clear all the static flags */
__HAL_MMC_CLEAR_FLAG(hmmc, SDMMC_STATIC_FLAGS);
hmmc->ErrorCode |= errorstate;
hmmc->State = HAL_MMC_STATE_READY;
return HAL_TIMEOUT;
}
}

The TXFIFOHE flag takes too long to be set and therefore the code is stuck here until eventually fails because of Timeout being configured as 100ms. I guess this delay is somehow related to the hardware flow control and how the eMMC and the SDMMC controller communicates. Again, this only happens during the first write operation on the eMMC.

My guess is that the eMMC internal controller has some kind of wear leveling/garbage collection/whatever algorithm that runs on the first operation after powering ON. After that, performance gets "normal". Could someone confirm me this? or could be another reason?

I'll try to measure these times using other eMMCs, to see if this is due to the quality of the card itself, or if it's something that will always happen and I'll have to live with it (i.e. increasing timeout value).

@Tesla DeLorean I'm not sure if implementing DMA will benefit me. When you say that polling mode is fragile you are talking about this issue that I'm experiencing with the timeout? is your recommendation related to the benefit of having the CPU free for other operations while the eMMC writing is taking long? If this is the case, maybe is not that important for my use case. But please let me know if there are other reasons why you recommend DMA.

Thanks!

tjaekel · ‎2024-06-06

I think, the first write can be delayed a lot due to the File System.
Before you write new data into a file (a sector) - the File System has to scan the entire FAT in order to know where to write this sector. The File System will cash (read) the MBR, FAT, find the cluster where the sector to write is in it. Not sure, if it would also read the entire cluster (I think, not necessary, SD card sectors can be written in 512byte chunks).

But what will happen as well: if you write less than a sector size (512 bytes), this sector must be read first, modified (overwrite your data) and then write it back. Also possible, that this write back is delayed: written data still sitting in a sector cache. If you write more data into a sector - written into cache memory. If not a flush forced then the cached sector is maybe written after a time out (elapsed time to write back from cache to SD card sector).

So, consider also the overhead in File System, what happens in particular on a write. You will potentially see a lot of reads before the first write.
After the first write done and continuous writes into same file can be way faster: the FAT and all the needed information, e.g. cluster number, next sector number, are known. Just if you cross a cluster it might be delayed again a bit. It depends how much of the FAT and sectors is cached by the File System. Also a write will be done on a cached sector in memory before it will be written back.

I would assume, this behavior is related to the overhead in the File System and how it works to figure out which sector to write (several iterations over the FAT which must be read first).

TVare.1 · ‎2024-06-06

Just to be sure I understand you, when you talk about the File System, are you referring to the FatFS library running on the STM32? because if this is the case, I should clarify that the place where I saw the delays is not there but in the lowest level, already when the FatFS library calls a disk_write.

In my case, this happens when doing the close of the first file I write (I write less than 512 bytes), which is the moment when the f_sync happens and the actual communication with the eMMC takes place. This is why I'm pretty sure that the delay comes from the communication with the eMMC itself, because I measured the time that was being spend already in the lowest layer, at HAL_MMC_WriteBlocks (once the file system already decided in which cluster to write and so on).

After this first direct write into the eMMC happens, following writes, either in the same file or different files, take way less time.

I must say that in some cases I did see a slight delay that I think I managed to identify being caused by the FatFS library looking for a cluster to write, but if I'm not wrong this delay was never significant and was not breaking anything of the normal behaviour of the device and libraries.

tjaekel · ‎2024-06-06

OK, you measure the duration for the first write itself. If this first writes takes longer as usual, there is something else to bear in mind:

When the SD Card gets a write command, for one sector, this sector has to be erased first. The SD card might be "smart" to realize: before it can write - this sector has to be erased first (all to 0xFF). This could block the actual write command to be done until the erase command has finished (inside the SD card).

To be honest: it does not explain why all following writes are faster. All writes to a sector (with full sector size) should have this delay.

Just what I mean: do not blame the MCU and FW first, maybe it is also the SD card itself.
Is there a difference between a fresh formatted and empty SD card vs. one where you want to write (overwrite) an existing file?
Or the SD card needs time on the first time to prepare for erase and write, e.g. to increase internal voltages needed for the erase and write. If the SD card has an internal power management: I could imagine that a first write takes longer in order to prepare the SD card that it will be written (and needs also always an erase cycle).

Personally, I would accept that the very first write is slow, as long as all following writes are within the spec. And write will be always slower as read. Maybe this feature is there to use SD cards as fast Read-Only memory, e.g. for booting from it, before doing any write.

I think, it could be the SD card: do you have chance to trace the signals and see the write command? There is potentially a status read for the FW function doing the write. The FW might poll the status for the "write done" and at the first time it takes longer to get the bit set in the "status polling".
Or: check if your FW write command does such a status polling: check there how often the status read will be done, how long does it take to get a "write done" status. If this varies between first and following writes: it is the SD card which slows it down (not your FW).

tjaekel · ‎2024-06-06

I found this: a random read/write is slower as a sequential read/write.
It means in your case:
the first write is random (it is a different sector, after all the File System FAT reads done). But writing a file will be done in a quite consecutive (sequential) way: usually, files on media are managed by a File System using clusters: a cluster is larger as a sector (e.g. 4K or even larger if entire media is large) and consists of N consecutive sectors. And writing more as one cluster (N sectors) will potentially find also a consecutive (free) cluster (and therefore consecutive sectors). See also "fragmentation" of drives which slows down when all clusters are spread randomly, for best speed: all clusters (sectors) as a sequence on media (and access in a sequence).

So, the first write is a random write, all the following writes (within the same cluster) are consecutive. This could also explain why the first write is "slower" as the following writes.

And as I understand: the specs. of SD cards provide often just the speed for consecutive writes. So, a random write is not specified (figure not provided), but assume it is for sure slower.

So, I think: it is the SD card (and how it works - why the first write is slower as all other writes (potentially all following writes are consecutive)).

AScha.3 · ‎2024-06-06

And see the speed test of an "good" sd-card : writing 1MB blocks, in red; read is good, blue;

and access time 0,7ms, but one is 10x slower (green dots).

So afair the medium is called "ok", if there is no access time > 300ms !

From my tests, i can say: even from same type of card, same supplier, you can get one card with "perfect" timing, the next looks like the pic.

Did you try some media from other manufacturer and supplier ?

If you feel a post has answered your question, please click "Accept as Solution".

TVare.1 · ‎2024-06-07

Exactly, everything seems to point to the eMMC itself. There is indeed a difference between a fresh eMMC and a used one. So far as I observed, the more used the eMMC, the longer this "first write" takes. So far, I got around 200ms in worst cases and that would be doable for my use case. However, if this delay of the first write keeps raising too much (idk, >500ms) it could affect the user experience of my device. At the moment I'm doing some stress tests to do many writes and I'll check if this time keeps increasing or eventually reaches a maximum constant delay.

I also suspected about being something electrical, but I've add delays between powering ON the eMMC and the first write operation, to give some time to voltages to stabilize and so on, but I didn't see any change.

It's interesting what you mention about the eMMC doing something different (random write) on the first operation. That's actually my guess as well, but I cannot find actual information to confirm it.

At the moment I was not able to trace the signals, I just can tell you that the delay comes from inside HAL_MMC_WriteBlocks, specifically this loop:

while (!__HAL_MMC_GET_FLAG(hmmc,

SDMMC_FLAG_TXUNDERR | SDMMC_FLAG_DCRCFAIL | SDMMC_FLAG_DTIMEOUT | SDMMC_FLAG_DATAEND))

{

if (__HAL_MMC_GET_FLAG(hmmc, SDMMC_FLAG_TXFIFOHE) && (dataremaining >= 32U))

{

/* Write data to SDMMC Tx FIFO */

for (count = 0U; count < 8U; count++)

{

data = (uint32_t)(*tempbuff);

tempbuff++;

data |= ((uint32_t)(*tempbuff) << 8U);

tempbuff++;

data |= ((uint32_t)(*tempbuff) << 16U);

tempbuff++;

data |= ((uint32_t)(*tempbuff) << 24U);

tempbuff++;

(void)SDMMC_WriteFIFO(hmmc->Instance, &data);

}

dataremaining -= 32U;

}

if (((HAL_GetTick() - tickstart) >= Timeout) || (Timeout == 0U))

{

/* Clear all the static flags */

__HAL_MMC_CLEAR_FLAG(hmmc, SDMMC_STATIC_FLAGS);

hmmc->ErrorCode |= errorstate;

hmmc->State = HAL_MMC_STATE_READY;

return HAL_TIMEOUT;

}

I'm also researching the eMMC standard, to see if there is some configuration/cmd or approach I can use to mitigate this first operation delay.

@AScha.3 at the moment I'm trying to get some PCBs with another eMMC from different suppliers (or at least one with Sandisk). Then I'll do the same tests and write operations with it and share the results, hopefully just by changing the component it gets better...

TVare.1 · ‎2024-06-07

@tjaekel Exactly, everything seems to point to the eMMC itself. There is indeed a difference between a fresh eMMC and a used one. So far as I observed, the more used the eMMC, the longer this "first write" takes. So far, I got around 200ms in worst cases and that would be doable for my use case. However, if this delay of the first write keeps raising too much (idk, >500ms) it could affect the user experience of my device. At the moment I'm doing some stress tests to do many writes and I'll check if this time keeps increasing or eventually reaches a maximum constant delay.

I also suspected about being something electrical, but I've add delays between powering ON the eMMC and the first write operation, to give some time to voltages to stabilize and so on, but I didn't see any change.

It's interesting what you mention about the eMMC doing something different (random write) on the first operation. That's actually my guess as well, but I cannot find actual information to confirm it.

At the moment I was not able to trace the signals, I just can tell you that the delay comes from inside HAL_MMC_WriteBlocks, specifically the while loop that calls SDMMC_WriteFIFO. (sorry I don't post the code snippet, but is getting marked as SPAM and erased...)

I'm also researching the eMMC standard, to see if there is some configuration/cmd or approach I can use to mitigate this first operation delay.

@AScha.3 not yet, at the moment I'm trying to get some PCBs with another eMMCs from different manufacturers (or at least one with Sandisk). Then I'll do the same tests and write operations with it and share the results, hopefully just by changing the component it gets better...

Tesla DeLorean · ‎2024-06-07

The Data Phase is intolerant of any wandering off task, the FIFO can provide some protection but can't handle persistent slowness in responsiveness.

ST's solution to misalignment is the worst possible one. It does it always, not in the small subset of situations where you've passed in a misaligned 32-bit buffer. Ideally FatFS and application level code will use aligned buffers always.

DMA is preferable because it's not slow, and can't be distracted. There might be some variability due to contention issues, but these will be brief and absorbed by the FIFO. Will require 32-bit aligned buffer, for cache coherency, 32-byte.

Failure cascades, earlier failures manifest as ignoring / failing subsequent commands, and thus timeouts rather than over/under runs.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..