[SOLVED] Flash Write/Erase mechanisms behaviors, Cube examples reliability

Félix TREFOU · ‎2020-11-23

Hi,

We are currently working on an usb composite library but some old FLASH ghosts reappeared.

When sending data through a cdc flash started to fail some erase/program.

We decided to upgrade our Wireless stack to latest (1.10.0) from our current 1.6.0 (FUS 1.0.2).

From there, we discovered that AN5289 got some big updates for clocks and FLASH sharing (good news!).

Now, 3 big questions/remarks subsist:

Two questions concerning our code:

1) The first device we updated (FUS 1.0.2-> 1.1.0; WS 1.6 -> 1.10) never get CPU2 notification for WS ready and all flash operations fails. Is it a brick or a defaulted CPU2?

2) On some flash operations, FLASH_SR register is not all 0 and FLASH_CR_LOCK bit is set to 1 without doing anything. Trying to write 1 again result in hardfaulting our device (so our actual patch is only call HAL_Lock when FLASH_CR_LOCK is 0).

Two other concerning BLE_RfWithFlash/flash_driver.c

3) When single_flash_operation_status fail, some action don't respect AN5289 p36 Figure 10:

if(single_flash_operation_status != SINGLE_FLASH_OPERATION_DONE)
  {
    return_value = NbrOfSectors - loop_flash + 1;
  }
  else
  {
    /**
     *  Notify the CPU2 there will be no request anymore to erase the flash
     *  On reception of this command, the CPU2 disables the BLE timing protection versus flash erase processing
     */
    SHCI_C2_FLASH_EraseActivity(ERASE_ACTIVITY_OFF);
 
    HAL_FLASH_Lock();
 
    /**
     *  Release the ownership of the Flash IP
     */
    LL_HSEM_ReleaseLock(HSEM, CFG_HW_FLASH_SEMID, 0);
 
    return_value = 0;
  }

There, Flash will not be re-locked (but this make sense regarding question 2) ). More strange, Erase activity off and semaphore 2 will not be released!

4) DeadLock:

UTILS_ENTER_CRITICAL_SECTION();
 
    /**
     *  Depending on the application implementation, in case a multitasking is possible with an OS,
     *  it should be checked here if another task in the application disallowed flash processing to protect
     *  some latency in critical code execution
     *  When flash processing is ongoing, the CPU cannot access the flash anymore.
     *  Trying to access the flash during that time stalls the CPU.
     *  The only way for CPU1 to disallow flash processing is to take CFG_HW_BLOCK_FLASH_REQ_BY_CPU1_SEMID.
     */
    cpu1_sem_status = (SemStatus_t)LL_HSEM_GetStatus(HSEM, CFG_HW_BLOCK_FLASH_REQ_BY_CPU1_SEMID);
    if(cpu1_sem_status == SEM_LOCK_SUCCESSFUL)
    {
      /**
       *  Check now if the CPU2 disallows flash processing to protect its timing.
       *  If the semaphore is locked, the CPU2 does not allow flash processing
       *
       *  Note: By default, the CPU2 uses the PESD mechanism to protect its timing,
       *  therefore, it is useless to get/release the semaphore.
       *
       *  However, keeping that code make it compatible with the two mechanisms.
       *  The protection by semaphore is enabled on CPU2 side with the command SHCI_C2_SetFlashActivityControl()
       *
       */
      cpu2_sem_status = (SemStatus_t)LL_HSEM_1StepLock(HSEM, CFG_HW_BLOCK_FLASH_REQ_BY_CPU2_SEMID);
      if(cpu2_sem_status == SEM_LOCK_SUCCESSFUL)
      {
        /**
         * When CFG_HW_BLOCK_FLASH_REQ_BY_CPU2_SEMID is taken, it is allowed to only erase one sector or
         * write one single 64bits data
         * When either several sectors need to be erased or several 64bits data need to be written,
         * the application shall first exit from the critical section and try again.
         */
        if(FlashOperationType == FLASH_ERASE)
        {
          HAL_FLASHEx_Erase(&p_erase_init, &page_error);
        }
        else
        {
          HAL_FLASH_Program(FLASH_TYPEPROGRAM_DOUBLEWORD, SectorNumberOrDestAddress, Data);
        }
        /**
         *  Release the semaphore to give the opportunity to CPU2 to protect its timing versus the next flash operation
         *  by taking this semaphore.
         *  Note that the CPU2 is polling on this semaphore so CPU1 shall release it as fast as possible.
         *  This is why this code is protected by a critical section.
         */
        LL_HSEM_ReleaseLock(HSEM, CFG_HW_BLOCK_FLASH_REQ_BY_CPU2_SEMID, 0);
      }
    }
 
    UTILS_EXIT_CRITICAL_SECTION();

Here, critical section is defined as follow:

#define UTILS_ENTER_CRITICAL_SECTION( )   uint32_t primask_bit = __get_PRIMASK( );\
                                          __disable_irq( )

Looking into HAL_FLASHex_Erase or HAL_FLASH_Program reveal that both function call FLASH_WaitForLastOperation calling itself HAL_GetTick() for timeout protection.

But in most examples, HAL_GetTick return a variable updated by a systick Interrupt (for us, HAL Ticks are incremented by TIM1).

We already get locked in an infinite timeout measurement waiting for a Tick never incrementing.

Sorry for the long post, feel free to answer any question, just specify which.

NOTE: It's a detail, but our LSE is generated by an external oscillator and configured as bypass mode.

Christophe Arnal · ‎2020-11-27

Hello,

Here are some feedback on your questions:

1/ Could you please provide the values of Security Option Bytes ? ( a dump from CubeProgrammer should be perfect).

2/ FLASH_CR_LOCK set to 1 is the reset value. It is cleared when a unlock sequence is used ( you may refer to the reference manual RM0434 for more information). Regarding this point, is that just a question to understand why this bit is set or is there something you cant do even though you apply the unlock sequence ?

3/ In the current STM32WB package, most comment reagrding API can be found in the header file.

When there is a failure, most of the time the user will retry one more time to erase the sectors. Therefore, it is better to act as if we are still not yet done ( relock is not done, semaphore is not released, eraseactivityoff is not sent).

If for any reason the application decides to not retry to erase the sector or to make the erase much later, then it makes sense at user level to release these ressources to give the oportunity to the CPU2 to work with the flash or to save some power due to the erase activity protection that has been enabled ( as this will wakeup CPU2 25ms before each BLE event to protect the timing)

4/ Your analysis is correct. However, you may have noticed that the return code of the HAL is not tested anyway.

The reason is that the process shall not fail. The FLASH_WaitForLastOperation() shall never timeout with the way the algorithm is defined. The operation shall be always successfull.

Did you experience some dead lock there or is the deadlock seen in a bigger loop ?

If you are getting lock in FLASH_WaitForLastOperation(), at which step is the lock ? (FLASH_FLAG_BSY or FLASH_FLAG_CFGBSY) ?

In order to be proactive on your anwser, in case you are locked on FLASH_FLAG_CFGBSY ( which should never happen), each time we have been working on such issue, the root cause was due to a bug in the Sofware which is tricky.

Most of the time, this is due to a null pointer. For any reason, you are using a pointer that should hold the address of a buffer to write (or a variable) in SRAM. However, as the pointer is null, the software is writing the data @0x00000000 which is in flash. As soon as you are writing in flash, CFGBSY is set ( as this is expected to be the start of a programming sequence). Of course, as this is not the case, the operation is never completed and the CFGBSY bit is kept set. This will make failing all next attempts to really write or erase the flash.

Regards.

Félix TREFOU · ‎2020-12-01

Hello,

First of all thanks for these very accurate responses!

1) Unfortunately I replaced the chip on the board. However I kept it, i will be able to connect some wires to access USB.

What error did you think you identified with this manipulation?

2) It was more a question about "what could set this bit", appart from CPU2 and HAL_FLASH. The response is more or less nothing!

3) It make sense, but calling again will send another flash_erase_activity ON, i suppose this is not a problem though!

4) Yes I observed many times a deadLock in this function, without irq (and timeout measurement) the device shall at least get the watchdog to reset it.

Knowing this is caused by bad user code, i agree that saying it shall never append is enought!

Your conclusion was completely right! I found that i forgot to call a "SetRxBuffer" in cdc Class which don't check NullPointer before writing.

For some reasons, linux send some characters at connect time, and not Windows, i was able to understand what was happening.

I found very strange that a write to 0x00 don't cause an hard-fault. Perhaps Buffer wasn't set to 0x00 but at an address somewhere in flash...

I'm thinking of migrating library to https://github.com/IntergatedCircuits/USBDevice, I hope buffer management will be more explicit!

Regards.

Christophe Arnal · ‎2020-12-01

Hello,

1/ Maybe something went wrong during the firmware update. I wanted to check if the Secure Option Bytes were as expected.

2/ OK. This bit is set on Reset and each time HAL_FLASH_Lock() is called.

Each time the flash need to be written, the flash shall be unlock upfront. When the flash has been written, the lock should be set back by the firmware to avoid any unexpected programming.

On CPU2, as long as we need to write some data on NVM, the lock is removed before writing the NVM and set back once done.

So, this bit is expected to always be set except when the flash needs to be written.

You cannot decide to unlock the flash at startup and keep it unlock because the CPU2 will lock it back after NVM processing so you will need anyway to unlock the flash before any flash programming.

3/ You are correct in both statements. The flash driver is packaging everything needed to make sure it is working in all cases. It is better to send it twice than to forget to send it in some corner cases. On CPU2, as long as the request is identical, it is just ignored.

4/ Good to read you found your issue so I assume you are no more facing the dead lock.

Generally speaking, it is good pratice in a real product to implement a watchdog anyway.

One last comment, make sure to use the WWDG and not the IWDG. The second one will get a clock when CPU2 is running so in case CPU1 is in deepsleep for a long time ( which is not a bug), the IWDG may fire because of CPU2 activity

eg if you just advertize, the CPU1 will stay in deepsleep and does not need to wakeup and only CPU2 will wakeup to schedule advertizing packet.

In that case, the IWDG will fire due to CPU2 activity.

Regards.

Félix TREFOU · ‎2020-12-02

Hello,

Actually we don't use sleeps mode cause power is managed by an external IC.

I removed WWDG some times ago for debug but i'm planning to activate it again.

Regarding the lifeCycle of the product, only stop mode for FreeRTOS will be useFull. I'll make sure to not mess with watchdogs as you described!

Thanks for this great help!

Félix TREFOU · ‎2020-12-02

Here is the chip Obtions Bytes dump:

OPTION BYTES BANK: 0
 
   Read Out Protection:
 
     RDP          : 0xAA (Level 0, no protection) 
 
   BOR Level:
 
     BOR_LEV      : 0x0 (BOR Level 0 reset level threshold is around 1.7 V) 
 
   User Configuration:
 
     nBOOT0       : 0x1 (nBOOT0=1 Boot from main Flash) 
     nBOOT1       : 0x1 (Boot from code area if BOOT0=0 otherwise system Flash) 
     nSWBOOT0     : 0x1 (BOOT0 taken from PH3/BOOT0 pin) 
     SRAM2RST     : 0x1 (SRAM2 is not erased when a system reset occurs) 
     SRAM2PE      : 0x1 (SRAM2 parity check disable) 
     nRST_STOP    : 0x1 (No reset generated when entering the Stop mode) 
     nRST_STDBY   : 0x1 (No reset generated when entering the Standby mode) 
     nRSTSHDW     : 0x1 (No reset generated when entering the Shutdown mode) 
     WWDGSW       : 0x1 (Software window watchdog) 
     IWGDSTDBY    : 0x1 (Independent watchdog counter running in Standby mode) 
     IWDGSTOP     : 0x1 (Independent watchdog counter running in Stop mode) 
     IWDGSW       : 0x1 (Software independent watchdog) 
     IPCCDBA      : 0x0  (0x0) 
 
   Security Configuration Option bytes - 1:
 
     ESE          : 0x1 (Security enabled) 
 
   PCROP Protection:
 
     PCROP1A_STRT : 0x1FF  (0x80FF800) 
     PCROP1A_END  : 0x0  (0x8000800) 
     PCROP_RDP    : 0x1 (PCROP zone is erased when RDP is decreased) 
     PCROP1B_STRT : 0x1FF  (0x80FF800) 
     PCROP1B_END  : 0x0  (0x8000800) 
 
   Write Protection:
 
     WRP1A_STRT   : 0xFF  (0x80FF000) 
     WRP1A_END    : 0x0  (0x8000000) 
     WRP1B_STRT   : 0xFF  (0x80FF000) 
     WRP1B_END    : 0x0  (0x8000000) 
OPTION BYTES BANK: 1
 
   Security Configuration Option bytes - 2:
 
     SFSA         : 0xCB  (0xCB) 
     FSD          : 0x0 (System and Flash secure) 
     DDS          : 0x1 (CPU2 debug access disabled) 
     C2OPT        : 0x1 (SBRV will address Flash) 
     NBRSD        : 0x0 (SRAM2b is secure) 
     SNBRSA       : 0xF  (0xF) 
     BRSD         : 0x0 (SRAM2a is secure) 
     SBRSA        : 0xA  (0xA) 
     SBRV         : 0x32C00  (0x32C00)

Christophe Arnal · ‎2020-12-02

Hello,

The options bytes seem all fine. I see no reason why the CPU2 does not boot.

It could be the CPU2 notification for WS ready is not received due to something wrong on the CM4 firmware.

Did you try a firmware from the cube package with no modification ( like transparent mode) ?

Migth be good to check at IPCC interrupt level. At least, the IPCC Rx interrupt handler shall trigger.

Make sure first to try a binary as delivered in the package.

If there is no success with the tranparent mode application from our firmware package, I am affraid the device migth be dead.

Regards.

Félix TREFOU · ‎2020-12-04

Hi,

I've tried the BLE Hearthrate but it never show up while scanning for devices (the same binary works for other cards).

I was able to update stack so FUS should be Ok, it can only be a broken CPU2 i guess!

Christophe Arnal · ‎2020-12-04

Hello,

Thats extremly strange. If you are able to connect the FUS, it means CPU2 subssytem seems to be working fine.

If you are able to load a new wireless stack, I see no reason why it does not run.

The only explanation would be that the device has been hardware damaged in some way.

There should be no reason to be able to run the FUS and not the wireless stack

Regards.