STM32H7 with FatFS on eMMC sometimes failing

TVare.1 · ‎2024-01-26

Hello everyone,

I have a project using the STM32H7, using FreeRTOS and with a FatFS (ChaN) runnin on a 4GB eMMC (Kingston), using peripheral SDMMC1.

Normally everything works fine. SDMMC1 and FatFS (FAT32) are initialized at main, before OS, and is used later on for many things such as creating and writing/reading files, etc.

However, we found some devices that were working fine and suddenly (after a couple of thousands of writings in the eMMC, I'd assume not too intense) we start having a constant and reproducible error. What happens is that the FatFS initializes OK but as soon as we want to write something (still at main, before OS), it fails. Basically when trying to open a file (that already exists) it fails with FR_DISK_ERR and we cannot use the FatFS anymore. Trying to debug a bit, I saw that on the lowest level HAL_MMC_ReadBlocks (from stm32h7xx_hal_mmc.c) is failing with HAL_TIMEOUT.

However, if I send a "reset" command to our device (it basically just executes a __NVIC_SystemReset of CMSIS on the STM32), on the following boot everything works fine. Both the issue and the "fix" are constant and once they happen once they start happening every time. FYI, the eMMC is also powered off/on when booting (we control this with dedicated circuits controlled by main uC).

I'm aware there are many layers where the failure could be:
1) HW? but then why most devices work fine and those failing are consistent once they start failing?
2) ST HAL: could be some casuistic triggering any bugs?
3) FatFS: something related to the FAT gets corrupted inside the eMMC? We do mount and access the eMMC via USB and so I tried formatting it (from PC with Windows) and didn't make a difference...
4) A combination of electrical/eMMC/firmware conditions ¿?

I suspect that the conditions are somehow also related to using Standby low-power mode, since I cannot reproduce it if I just use Stop mode. Initially I don't see why this could be affecting, since as far as I understand when waking up from Standby mode the STM32 reboots and therefore everything would be initialized the same way.

Since it has this characteristic of stop happening after a SW reset, but then happening again on following boot when powered off/on, I figured maybe someone would identify any other potential issue... Any ideas of what could be or what should I check?

Thanks in advance!

Tesla DeLorean · ‎2024-01-26

The FatFs that ST ships is quite old and has some issues.

If the file system gets trashed that tends to be persistent.

Use a USB MSC to allow connection to PC and run CHKDSK against the problem volume.

Error propagation in FatFs isn't particularly good. You'd want to instrument DISKIO to understand if the error was occurring at the peripheral or card/chip level, and what error specifically. Use DMA, polled mode is too fragile.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

AScha.3 · ‎2024-01-26

Hi,

I cannot say anything about eMMC , because we use ( at work ) only sd-cards , but afaik the memory is also flash - and we have -maybe- a similar problem.

Some cards failing after months or > 1 year of use with FR_DISK_ERR . I tried to get the "bad" cards and found :

one was "dead" , can do ( in (windows) PC card adapter) nothing any more; it shows ( on my linux PC , "drive tool" ) one unknown partition, no read, mount, format, test possible, no more any access. (Kingston and Transcend is written on cards.)

Other "bad" cannot read or mount, but format still possible. New format and working again.

So i put some music (wav files) on it , to see, how they perform in my music player (on H743 , sdio 4 bit mode).

Interesting symptom: looking with the scope at the read time ( i set/reset a pin when f_read starts /ready) , always 8 KB blocks, the read time is jittering a lot, from 1ms to >50ms everything is there.

A new card (SanDisk) shows some jitter 1ms to 1,5ms , not more.

A new Kingston "canvas select" 64GB shows 1...40ms !! Most time around 1ms, but some "dropouts" up to 40ms.

So my assumption now : this time is some indication of the real quality of the accessed flash area, long and very long delayed read times indicating, the card controller has to do some repeated read and a lot error correction to calculate, until it has puzzled the data together. And here the (more expensive) SanDisk cards are obviously better, so for now we use only SanDisk - and next year i can tell, whether they are really better/persistent - or not.

Maybe you could do same test: write some big files on card, read constant blocks in a loop and look at a pin, that you set/reset with the read-time .

If you feel a post has answered your question, please click "Accept as Solution".

tjaekel · ‎2024-01-30

Try to lower the eMMC (or is it a SD Card? SDMMC1 for SD cards?) clocks a bit. Check also the external wiring (if it is an external SD Card adapter).

I have tried few days ago an SD card adapter with flying wires, on a NUCLEO-U575ZI-Q board: and it was not working (I got CRC errors on commands). Just with a nice main board and SD card adapter directly connected to the PCB - it was working.

Step through the code where it starts to fail: in my case I saw on SD card commands a response with CRC error code. Later, higher up in file system: it would just say "failed". Trace it down to the original point where it fails.

SD Cards toggle often between slow clock (for init, a single lane) to a faster clock (and 4-bit-mode). Try to find if you can lower one frequency.

Also possible, that a specific SD card does not work: in this case, during the initialization, e.g. setting a voltage, setting features, e.g. 4-bit-mode, can fail. Important to know where it really starts to fail (use breakpoints to trace).

TVare.1 · ‎2024-02-05

Thanks everyone for your answers!

@Tesla DeLorean
Actually I don't think I have the version that ST ships, since I remember having updated it recently. I'm using R0.15 w/patch1.
We are able to set as USB MSC so I tried the CHKDSK on Windows. It said that the unit had no issues. Anyways I run and it seemed that it was actually doing something as the device got to work with no errors after that. However, after a couple of boots the issue was back again. So this gave me the idea that indeed the issue might be related to FatFS corruption but that the CHKDSK is not enough to totally fix it.
Regarding the failure, I'll try to debug it again to be sure of the exact reason. We'll also try to implement DMA for write/read.
Since you mentioned that once the memory gets trashed is hard to fix it, and considering that this FatFS implementation doesn't have any correction tool, would you recommend any other measure to avoid the FatFS to get corrupted in the first place?

@AScha.3
Thanks for your insight. In the near future I'll actually have access to some other memories, even from SanDisk. So this idea of measuring access times seems like a good approach to evaluate "quality" of the memory, as I guess that could increase the chances of the FatFS getting corrupted eventually.

@tjaekel
It is a 4GB NAND flash memory with eMMC interface. It is located in the same board, just 2cm away. The hardware is supposedly already validated (although I don't discard anything at the moment). I did try to lower the speed in devices where I had already the failure and it didn't make a difference. It might be useful to lower the speed even in devices that are still OK (as long as this doesn't affect performance) to avoid the FatFS from getting corrupted. Anyways we are already using the slowest modes that the memory supports (32MHz at the moment).
As I told to Tesla, I'll try to narrow down the failure reason and will let you know.

Thanks again!

TVare.1 · ‎2024-02-07

I got one of the devices that was failing and debugged it a bit more. What I've seen is that the FatFS initializes OK, it mounts the volume as usual, it sends CMDs to the eMMC etc and everything seems to work fine. However, when moving forward in the code, as soon as I want to write any file, it fails. Sometimes when opening the file, sometimes when writing to it, but always in the first 2 o 3 FatFS operations after initializing.

What I've seen is that the failure happens inside disk_read function, which eventually going down the stacks gets to SDMMC_GetCmdResp1 function at stm32h7xx_ll_sdmmc.c. It fails due to the flag SDMMC_FLAG_CTIMEOUT therefore failing with reason SDMMC_ERROR_CMD_RSP_TIMEOUT.

Here a screenshot of the call stack at that moment:

At this point, if I do a soft reset in the error handler, next boot everything works fine. However, this workaround is not ideal for me (due to usage of Standby mode and other specific stuff) so I need to find the root cause of this... Specially since this usually doesn't happen in most of our devices. However once it starts happening, it always does.

Any ideas of what could be the reason or in which direction should I look deeper? (FatFS lib, HAL, HW stuff, ¿?)

Thanks!

tjaekel · ‎2024-02-07

OK, an eMMC chip on your own board. Fine.

No further clues, just some ideas (what I would check):

SDIO config can enable and use the DLYB function (delay line)
check, if this is enabled and used:
if STM HAL drivers assume an SD card adapter, with an SD card in it - maybe this DLYB is set to "compensate a large delay". But you might not have such a large delay (own PCB, short traces, fast chip...)

I am still guessing it is a timing issue. Esp. if clock speed setting does not make it better, and you do not get a CRC error - it sounds to me as: the signals (wires, traces) look fine, just the timing is (a bit) wrong.

You have to see and play with some configs done in HAL drivers. Maybe, there is a DLYB config which is not appropriate for your board (the config setting or use).

A time out sounds a bit like a missing bit on a response (still waiting for a bit to complete a transaction, when reading a response).

Hard to say: if it works after a reset, sometimes, ... it sounds to me more like a "timing issue", "running right at the edge in terms of timing".

BTW:

you could also check this:

is there a command send which is related to set a voltage on the eMMC? (SD Cards have commands to lower the voltage)
are you using HAL drivers in a "mode" for SD Card or for eMMC chips? (this can be different)
do you reuse an existing code (demo) or CubeMX generated code, but intended for a SD card?
Your chip can be different in terms of the configuration: which commands to send: in which order to use commands to configure? any specific timing where to wait (or to check, e.g. a status bit set before continuing with other commands)?
check if all the config code, all the commands which are sent, make sense for your chip (related on your chip datasheet)
are there some needs to elapse the time, to check for a status bit before continuing?
check carefully if the code you have is for an SD card or an eMMC chip (or SDIO device)

Good luck.

TVare.1 · ‎2024-02-08

Thanks for your answer. I can confirm you that our stack looks like this:

FatFS elm-chan http://elm-chan.org/fsw/ff/00index_e.html at ff.c
ST diskio driver: MMC Disk I/O FatFs driver mmc_diskio.c and eMMC driver mounted on STM32H745I-DISCOVERY board at bsp_driver_mmc.c (AKA stm32h745i_discovery_mmc.c)
ST HAL MMC at stm32h7xx_hal_mmc.c
ST LL driver for SDMMC peripheral at stm32h7xx_ll_sdmmc.c

So basically out FatFS implementation with the eMMC is based on the example of the STM32H745 discovery.

In terms of components, the difference is that our uC is the STM32H753 (shouldn't be a problem, as they share the same peripherals) and our eMMC is the Kingston EMMC04G-M627, instead of the THGBMTG5D1LBAIL used in the example. However, I see that both memories are compliant with e•MMC™ 5.0 JEDEC so I imagine they should be fully compatible.

Moreover, I checked DLYB and is not enabled for the eMMC, and I also have HAL_MMC_MODULE_ENABLED defined so everything seems to be what it has to be.

Anyways, I'll continue do some tests to see if I figure out why in these couple of devices the eMMC/FatFS starts failing.

Thanks again for your help and of course let me know if you have any additional ideas.

TVare.1 · ‎2024-06-05

Hi!

As an update: I managed to further debug and identify the exact moment when the issue was happening. Basically what I saw is:

The first write operation on the eMMC after powering it, takes a lot of time. And this time seems to increase proportional to the level of usage of the eMMC.
Once first operation finishes, following write attempts are normal. This was the reason my reset was "fixing" the issue, because during the reset the boot of my uC was so fast that the eMMC was always powered ON (thanks to some capacitors) and therefore the "first write operation" already happened.
The failure happens when this first write operation takes more than 100ms, because of this piece of code in the HAL (HAL_MMC_WriteBlocks():(

while (!__HAL_MMC_GET_FLAG(hmmc,
                               SDMMC_FLAG_TXUNDERR | SDMMC_FLAG_DCRCFAIL | SDMMC_FLAG_DTIMEOUT | SDMMC_FLAG_DATAEND))
    {
      if (__HAL_MMC_GET_FLAG(hmmc, SDMMC_FLAG_TXFIFOHE) && (dataremaining >= 32U))
      {
        /* Write data to SDMMC Tx FIFO */
        for (count = 0U; count < 8U; count++)
        {
          data = (uint32_t)(*tempbuff);
          tempbuff++;
          data |= ((uint32_t)(*tempbuff) << 8U);
          tempbuff++;
          data |= ((uint32_t)(*tempbuff) << 16U);
          tempbuff++;
          data |= ((uint32_t)(*tempbuff) << 24U);
          tempbuff++;
          (void)SDMMC_WriteFIFO(hmmc->Instance, &data);
        }
        dataremaining -= 32U;
      }

      if (((HAL_GetTick() - tickstart) >=  Timeout) || (Timeout == 0U))
      {
        /* Clear all the static flags */
        __HAL_MMC_CLEAR_FLAG(hmmc, SDMMC_STATIC_FLAGS);
        hmmc->ErrorCode |= errorstate;
        hmmc->State = HAL_MMC_STATE_READY;
        return HAL_TIMEOUT;
      }
    }

The TXFIFOHE flag takes too long to be set and therefore the code is stuck here until eventually fails because of Timeout being configured as 100ms. I guess this delay is somehow related to the hardware flow control and how the eMMC and the SDMMC controller communicates. Again, this only happens during the first write operation on the eMMC.

My guess is that the eMMC internal controller has some kind of wear leveling/garbage collection/whatever algorithm that runs on the first operation after powering ON. After that, performance gets "normal". Could someone confirm me this? or could be another reason?

I'll try to measure these times using other eMMCs, to see if this is due to the quality of the card itself, or if it's something that will always happen and I'll have to live with it (i.e. increasing timeout value).

@Tesla DeLorean I'm not sure if implementing DMA will benefit me. When you say that polling mode is fragile you are talking about this issue that I'm experiencing with the timeout? is your recommendation related to the benefit of having the CPU free for other operations while the eMMC writing is taking long? If this is the case, maybe is not that important for my use case. But please let me know if there are other reasons why you recommend DMA.

Thanks!

TVare.1 · ‎2024-06-05

Hi!

As an update: I managed to further debug and identify the exact moment when the issue was happening. Basically what I saw is:

The first write operation on the eMMC after powering it, takes a lot of time. And this time seems to increase proportional to the level of usage of the eMMC.
Once first operation finishes, following write attempts are normal. This was the reason my reset was "fixing" the issue, because during the reset the boot of my uC was so fast that the eMMC was always powered ON (thanks to some capacitors) and therefore the "first write operation" already happened.
The failure happens when this first write operation takes more than 100ms, because of this piece of code in the HAL (HAL_MMC_WriteBlocks():(

while (!__HAL_MMC_GET_FLAG(hmmc,
                               SDMMC_FLAG_TXUNDERR | SDMMC_FLAG_DCRCFAIL | SDMMC_FLAG_DTIMEOUT | SDMMC_FLAG_DATAEND))
    {
      if (__HAL_MMC_GET_FLAG(hmmc, SDMMC_FLAG_TXFIFOHE) && (dataremaining >= 32U))
      {
        /* Write data to SDMMC Tx FIFO */
        for (count = 0U; count < 8U; count++)
        {
          data = (uint32_t)(*tempbuff);
          tempbuff++;
          data |= ((uint32_t)(*tempbuff) << 8U);
          tempbuff++;
          data |= ((uint32_t)(*tempbuff) << 16U);
          tempbuff++;
          data |= ((uint32_t)(*tempbuff) << 24U);
          tempbuff++;
          (void)SDMMC_WriteFIFO(hmmc->Instance, &data);
        }
        dataremaining -= 32U;
      }

      if (((HAL_GetTick() - tickstart) >=  Timeout) || (Timeout == 0U))
      {
        /* Clear all the static flags */
        __HAL_MMC_CLEAR_FLAG(hmmc, SDMMC_STATIC_FLAGS);
        hmmc->ErrorCode |= errorstate;
        hmmc->State = HAL_MMC_STATE_READY;
        return HAL_TIMEOUT;
      }
    }

The TXFIFOHE flag takes too long to be set and therefore the code is stuck here until eventually fails because of Timeout being configured as 100ms. I guess this delay is somehow related to the hardware flow control and how the eMMC and the SDMMC controller communicates. Again, this only happens during the first write operation on the eMMC.

My guess is that the eMMC internal controller has some kind of wear leveling/garbage collection/whatever algorithm that runs on the first operation after powering ON. After that, performance gets "normal". Could someone confirm me this? or could be another reason?

I'll try to measure these times using other eMMCs, to see if this is due to the quality of the card itself, or if it's something that will always happen and I'll have to live with it (i.e. increasing timeout value).

@Tesla DeLorean I'm not sure if implementing DMA will benefit me. When you say that polling mode is fragile you are talking about this issue that I'm experiencing with the timeout? is your recommendation related to the benefit of having the CPU free for other operations while the eMMC writing is taking long? If this is the case, maybe is not that important for my use case. But please let me know if there are other reasons why you recommend DMA.

Thanks!