Wireless Stack recovery for STM32WB

Tim.N · ‎2021-03-29

I've gotten my STM32WB55 into a state where CPU2 is booting into a "wireless firmware" area, but I think it's crashing when it starts. I can no longer talk to FUS inside CPU2:

-------------------------------------------------------------------
                        STM32CubeProgrammer v2.5.0                  
      -------------------------------------------------------------------
 
 
 
USB speed   : Full Speed (12MBit/s)
Manuf. ID   : STMicroelectronics
Product ID  : DFU in FS Mode
SN          : 205434853236
FW version  : 0x011a
Device ID   : 0x0495
Device name : STM32WBxx
Flash size  : 1 MBytes
Device type : MCU
Device CPU  : Cortex-M0+/M4
 
 
FUS state is FUS_ERROR
 
FUS status is FUS_NOT_RUNNING
getFUSstate command execution finished

stm32_programmer_cli -c port=usb1 -ob displ
      -------------------------------------------------------------------
                        STM32CubeProgrammer v2.5.0                  
      -------------------------------------------------------------------
 
 
 
USB speed   : Full Speed (12MBit/s)
Manuf. ID   : STMicroelectronics
Product ID  : DFU in FS Mode
SN          : 205434853236
FW version  : 0x011a
Device ID   : 0x0495
Device name : STM32WBxx
Flash size  : 1 MBytes
Device type : MCU
Device CPU  : Cortex-M0+/M4
 
 
UPLOADING OPTION BYTES DATA ...
 
  Bank          : 0x00
  Address       : 0x1fff8000
  Size          : 128 Bytes
 
[==================================================] 100% 
 
 
OPTION BYTES BANK: 0
 
   Read Out Protection:
 
     RDP          : 0xAA (Level 0, no protection) 
 
   BOR Level:
 
     BOR_LEV      : 0x0 (BOR Level 0 reset level threshold is around 1.7 V) 
 
   User Configuration:
 
     nBOOT0       : 0x0 (nBOOT0=0 Boot selected based on nBOOT1) 
     nBOOT1       : 0x1 (Boot from Flash if nBoot0=0 otherwise system memory) 
     nSWBOOT0     : 0x1 (BOOT0 taken from PH3/BOOT0 pin) 
     SRAM2RST     : 0x0 (SRAM2 erased when a system reset occurs) 
     SRAM2PE      : 0x1 (SRAM2 parity check disable) 
     nRST_STOP    : 0x1 (No reset generated when entering the Stop mode) 
     nRST_STDBY   : 0x1 (No reset generated when entering the Standby mode) 
     nRSTSHDW     : 0x1 (No reset generated when entering the Shutdown mode) 
     WWDGSW       : 0x1 (Software window watchdog) 
     IWGDSTDBY    : 0x1 (Independent watchdog counter running in Standby mode) 
     IWDGSTOP     : 0x1 (Independent watchdog counter running in Stop mode) 
     IWDGSW       : 0x1 (Software independent watchdog) 
     IPCCDBA      : 0x0  (0x0) 
 
   Security Configuration Option bytes:
 
     ESE          : 0x1 (Security enabled) 
     SFSA         : 0xB4  (0xB4) 
     FSD          : 0x0 (System and Flash secure) 
     DDS          : 0x1 (CPU2 debug access disabled) 
     C2OPT        : 0x1 (SBRV will address Flash) 
     NBRSD        : 0x0 (SRAM2b is secure) 
     SNBRSA       : 0xF  (0xF) 
     BRSD         : 0x0 (SRAM2a is secure) 
     SBRSA        : 0xA  (0xA) 
     SBRV         : 0x32800  (0x32800) 
 
   PCROP Protection:
 
     PCROP1A_STRT : 0x1FF  (0x80FF800) 
     PCROP1A_END  : 0x0  (0x8000800) 
     PCROP_RDP    : 0x0 (PCROP zone is kept when RDP is decreased) 
     PCROP1B_STRT : 0x1FF  (0x80FF800) 
     PCROP1B_END  : 0x0  (0x8000800) 
 
   Write Protection:
     WRP1A_STRT   : 0xFF  (0x80FF000) 
     WRP1A_END    : 0x0  (0x8000000) 
     WRP1B_STRT   : 0xFF  (0x80FF000) 
     WRP1B_END    : 0x0  (0x8000000)

I was modifying code in my application to use FUS to update the wireless firmware, and was playing with creating an update path that didn't require erasing the existing stack. In this case, I started with FUS v1.1.0, wireless stack v1.11.0 (stm32wb5x_BLE_HCILayer_fw.bin).

Sequence of events (run automatically inside my firmware)

I uploaded a new firmware image into the internal filesystem of my firmware
CPU1 switched CPU2 into FUS, wireless stack rebooted the part (via options application)
Programmed in the new image to 0xb4000 since SFSA was 0xde, and gave the update process 2 extra pages of space to work with
CPU1 asked CPU2's FUS to install it
FUS went into the "state=16, err=0" mode for ~7 seconds
FUS rebooted the part (via options application)
FUS went into the "state=0, err=0" mode
CPU1 asked CPU2's FUS to start the wireless stack
FUS rebooted the part (via options application)
CPU1 tried to start CPU2, but it didn't bootup correctly

Is there a way to recover this unit's CPU2 firmware state? CPU2 is no longer responding to FUS commands, like FUS_GetState(...).

Christophe Arnal · ‎2021-04-06

Hello,

According to what I can read from your option bytes ( SBRV), the CPU2 should be running the wireless stack. It should switch again to the FUS if you send two times the FUS_GetState(...) command.

With your current status, I would first check if the BLE application on the CPU1 is running properly as it should do.

Especially, I would check the system ready event reported on CPU2 at boot. It will confirm whereas this is the wireless stack or the FUS that is running on CPU2.

Regards.

Tim.N · ‎2021-04-07

Hi Christophe -

I understand the way it's supposed to behave (e.g. sending the FUS_GetState(...) command twice to go back), but the secondary CPU has stopped responding to all things mailbox related. Since CPU2 was no longer processing mailbox related commands (maybe the installation didn't succeed and it set the CPU2 start vector to something incorrect) it doesn't process the request to get back into FUS anymore. Since you don't pass through FUS every time CPU2 boots up, if the main application wasn't installed correctly and the execution vector is switched to it, you're kind of out of luck resolving it since the CPU2 flash area is so locked down on the STM32WB part. I couldn't even get the built-in bootloader to get CPU2 back to the FUS (e.g. by running stm32_programmer_cli twice with the -fusgetstate parameter set) after it got into this state.

If you notice in the option byte dump, it's also unusual that secure flash region doesn't start at the same point as the CPU2 execution vector. It's like FUS did something strange and put the option bytes in a weird state.

E.g. SBRV of 0x32800 = start vector of 0xCA000, and SFSA of 0xBA = 0xBA000

Normally I've seen SBRV align with SFSA unless you're booting inside FUS when a wireless stack is still installed so something goofy happened with this board during the wireless firmware install process. I'll avoid doing a wireless firmware install without erasure from now on since I have only observed this (so far) to occur when a wireless stack was installed during an upgrade.

I read in the STM32WB manual that there is a "safe boot" for CPU2, but I couldn't figure out how to purposefully get into that to restore the state. Is there a way, or is the only way the "corrupt" option bytes as described in the reference manual? Unfortunately, I managed to brick this particular board entirely trying to elicit safe mode by purposefully attempting to corrupt the option bytes, so I can no longer debug this specific situation.

Thanks,

Tim

Christophe Arnal · ‎2021-04-08

Hi Tim,

I apologize as I was to fast reading the option byte dump. You are perfectly correct. As long as the CPU2 is running the wireless firmware, SBRV shall match SFSA in all our deliveries so far. The only exception is when CPU2 is running the FUS whereas a wireless stack is still present in the CPU2 secure flash area.

Unfortunately, there is nothing that can be done to recover the device. I will forward the issue to the FUS team but I can just confirm you can throw the device away ( unless you want to use the CPU1 only without booting CPU2).

Is there any chance that you remember which FUS version was in the device and which flow you did ?

Which wireless firmware version you wanted to upload ?
Did you first delete an existing wireless firmware ?

Regards

Tim.N · ‎2021-04-08

Hi Christophe -

No worries; I completely understand what it's like to support folks (you often check for the common issues first and it's easy to miss something!).

Good to know there's no way to recover this unit; it's too bad the CPU2 firmware isn't designed to always pass through the bootloader so there is always a chance of recovery when coming out of reset as that would have allowed a recovery path for this situation. (E.g. wait for a mailbox command to start the wireless firmware and do it in a way that didn't require CPU1 to reboot.)

I think the exact setup must've been:

Installed firmware: FUS v1.1.0, and v1.11.0 stm32wb5x_BLE_HCILayer_fw.bin
Attempting to "upgrade" to (without erasing the existing firmware): v1.11.0 stm32wb5x_BLE_Stack_full_fw.bin, flashed into offset 0x0b4000
Resulted in a SFSA pointing to where I flashed the image, and a start vector of where the image should've ended up (offset 0xc4000)

I've been hesitant to attempt to reproduce it since I don't want another bricked board (especially since these are one of our production boards with a debug header on it). Additionally, I was using my own firmware to interact with FUS, and not the internal bootloader of the STM32WB55 part.

(It looks like the SFSA statement in the original post was offset by 2 pages before it was printed out to my console:

intptr_t sfsa_offset = (FLASH->SFR & FLASH_SFR_SFSA_Msk) >> FLASH_SFR_SFSA_Pos;
sfsa_offset -= 2;
sfsa_offset *= 4096;
dbg_crit("SFSA: %x", sfsa_offset);

)

Hopefully it's reproducible. If it is, maybe I'll be the cause for yet another FUS version. ;) (You can blame me for the "security enhancements" listed for FUS v1.1.2 - I had reported that back in November, but not through the forums of course due to the nature of the issue.)

Thanks,

Tim

Tim.N · ‎2021-04-21

Hi Christophe -

I looked again at the board that had this issue, and I discovered that it appears that I actually was successful in corrupting the option bytes purposefully to try and get the board into safe mode. However, having the option bytes corrupted appeared to prevent CPU1 from booting. I discovered if I then manually booted CPU2 via the debugger (by writing 0x8000 to 0x5800040c) that it was definitely in "safe mode" since it recovered the option bytes such to:

Delete the wireless firmware stack
Allow CPU1 to boot again by reducing the secure area

Is this the expected outcome of corrupted option bytes, that CPU1 can no longer boot and thus CPU2 cannot boot either since it depends on CPU1 (or a debugger) to start it? If so, it means that there is always some risk in creating a "brick" with updating the wireless stack on CPU2 remotely in the field since that involves programming the option bytes, and if that got interrupted/corrupted, the whole board wouldn't be able to start without outside intervention. In this particular case, it seems like CPU2 should automatically start at boot (instead of waiting for CPU1, which can't run) when the option bytes are corrupted. This would allow a device in the field to recover itself in a bad install case.

By the way, I think the safeboot left portions (if not the entire stack) of the wireless stack unencrypted at the install address above (at 0x080b4000) instead of erasing it. Here's (presumably) the start of the vector table for v1.11.0 of stm32wb5x_BLE_Stack_full_fw.bin left at the install address I tried above:

0x080b4000: 2003f220 00028df9 00000a25 00000add 00000000 00000000 00000000 00000000

Thanks,

Tim

Christophe Arnal · ‎2021-04-21

Hi Tim,

The RM0434 rev6 chapter 2.4 CPU2 boot is clearly stating that on option byte corruption, the CPU2 shall boot on the safe boot with no need of any help from CPU1.

1/ It is expected that CPU1 does not run because all the flash is secured and CPU1 cannot fetch code from a secure flash area so for sure, it is not able to fetch a single line of code.

2/ It is clearly not expected, as confirmed by the Reference Manual, that you need an external action to make CPU2 running on the Safe Boot.

Regarding device recovery on the field, this is a mixed feeling. If you update remotely a wireless stack, or if anything happens on the device (Hardware fault) that causes the option bytes corruption, the Safe Boot will restore the default configuration which means :

it restores the default option bytes
it deletes the wireless stack
it deletes the user flash

So, the device can be recovered but you will of course not be able to do it remotely. You will need a wired connection with the Boot Loader or any JTAG programmer or let's say any known way to program a STM32 device).

I will report the post to our development team regarding the Safe Boot behavior.

Regards.

Tim.N · ‎2021-04-21

Hi Christophe -

I just read the safeboot description, and apparently it was supposed to do a full factory reset. That didn't occur, but I think I know why. I sometimes have production devices with RDP set and I recognize how the debugger behaves in this situation. Openocd showed:

target halted due to debug-request, current mode: Thread
xPSR: 0x01000000 pc: 0xfffffffe msp: 0xfffffffc

When this occurs, I automatically try to unlock RDP by using the debugger to unlock the flash/option registers, modify OPTR to set RDP back to level 0, and then apply the option bytes. If the normal order is:

Option byte corruption: RDP set to level 1, SFR set to 0, CPU2 boot vector set to last flash sector
CPU2 booted
CPU2 sets RDP to level 0, SFR to X and applies the options. Transition from RDP level 1 to 0 causes all flash from sectors 0 to SFR to be erased.

This would clear everything from sector 0 to SFR. In my case I think I interrupted this order:

Option byte corruption: RDP set to level 1, SFR set to 0, CPU2 boot vector set to last flash sector
Debugger set option bytes modified to RDP level 0, and applied. No sectors were erased since SFR was set to 0
CPU2 booted
CPU2 sets RDP to level 0 (no-op), SFR to X, and applies the options

which ended up with no flash being erased by a RDP transition from level 1 to 0, since when this transition occurred, SFR was set to 0 at the time.

I think FUS needs to check the RDP level before adjusting SFR when in safeboot. If it's level 0 already, it probably needs to do an erasure of flash manually instead of relying on the RDP level transition from 1 to 0 to take care of this for you. If that isn't done, it could lead to a hole where your code (if RDP is set to 1) could be revealed to someone with a debugger and the means to purposefully corrupt the option bytes.

Thanks,

Tim

Tim.N · ‎2021-04-21

I'll add in my openocd log here for reference. It shows that initially the debugger saw CPU1 at the PC it shows up at when it can't read flash, even after attempting to reset RDP to level 1. Then I booted CPU2 and it "fixed" what I could read. I didn't explicitly read back SFR (or RDP before attempting to unlock since I thought what I was seeing was RDP level 1), but I'm assuming it was at 0 since that'd be the only other reason that I can think of that CPU1 would have behaved this way.

I'm also making the assumption that it was in safe mode; I didn't read back the correct flash registers before I realized what transitions I believe had occurred.

Log:

> reset halt
target halted due to debug-request, current mode: Thread 
xPSR: 0x01000000 pc: 0xfffffffe msp: 0xfffffffc
> mdw 0x58004010
0x58004010: 00008000 
 
> mww 0x58004010 0x8000
> mdw 0x58004010
0x58004010: 00000000 
 
> mww 0x58004008 0x45670123
> mww 0x58004008 0xCDEF89AB
> 
> # Unlock option registers
> mww 0x5800400c 0x08192A3B
> mww 0x5800400c 0x4C5D6E7F
> mmw 0x58004020 0xaa 0xff
> mww 0x5800402c 0xff
> mww 0x58004030 0xff
> mdw 0x58004020
0x58004020: 7ffff1aa 
 
> mdw 0x5800402c 2
0x5800402c: 000000ff 000000ff 
 
> mmw 0x58004014 0x00020000 0
> mmw 0x58004014 0x08000000 0
Polling target stm32wbx.cpu failed, trying to reexamine
SWD DPIDR 0x6ba02477
stm32wbx.cpu: hardware has 6 breakpoints, 4 watchpoints
stm32wbx.cpu -- clearing lockup after double fault
Polling target stm32wbx.cpu failed, trying to reexamine
stm32wbx.cpu: hardware has 6 breakpoints, 4 watchpoints
> halt
> step
SWD DPIDR 0x6ba02477
Failed to read memory at 0xfffff000
can''t add breakpoint: unknown reason
target halted due to single-step, current mode: Thread 
xPSR: 0x01000000 pc: 0xfffffffe msp: 0xfffffffc
Polling target stm32wbx.cpu failed, trying to reexamine
Could not find MEM-AP to control the core
Examination failed, GDB will be halted. Polling again in 100ms
Polling target stm32wbx.cpu failed, trying to reexamine
Could not find MEM-AP to control the core
Examination failed, GDB will be halted. Polling again in 300ms
Polling target stm32wbx.cpu failed, trying to reexamine
Could not find MEM-AP to control the core
Examination failed, GDB will be halted. Polling again in 700ms
SWD DPIDR 0x6ba02477
stm32wbx.cpu -- clearing lockup after double fault
Polling target stm32wbx.cpu failed, trying to reexamine
stm32wbx.cpu: hardware has 6 breakpoints, 4 watchpoints
> halt
> step
SWD DPIDR 0x6ba02477
Failed to read memory at 0xfffff000
can''t add breakpoint: unknown reason
target halted due to single-step, current mode: Thread 
xPSR: 0x01000000 pc: 0xfffffffe msp: 0xfffffffc
> mdw 0x58004020
0x58004020: 7ffff1aa 
 
> mdw 0x08000000
SWD DPIDR 0x6ba02477
Failed to read memory at 0x08000004
 
> mdw 0x08000000 
SWD DPIDR 0x6ba02477
Failed to read memory at 0x08000004
 
> mdw 0x5800402c
0x5800402c: 000000ff 
 
> mdw 0x5800402c 2
0x5800402c: 000000ff 000000ff 
 
> mdw 0x58004000 2
0x58004000: 00000600 00000000 
 
> mdw 0x58004010 2
0x58004010: 00008000 c0000000 
 
accepting 'telnet' connection on tcp/4446
dropped 'telnet' connection
> mdw 0x58000400
0x58000400: 00000200 
 
> mdw 0x5800000c
0x5800000c: 22040100 
 
> mdw 0x5800040c
0x5800040c: 00000000 
 
> mww 0x5800040c 0x8000
Polling target stm32wbx.cpu failed, trying to reexamine
SWD DPIDR 0x6ba02477
stm32wbx.cpu: hardware has 6 breakpoints, 4 watchpoints
> mdw 0x58000400       
0x58000400: 00000300 
 
stm32wbx.cpu: external reset detected
stm32wbx.cpu: external reset detected
stm32wbx.cpu: external reset detected
accepting 'gdb' connection on tcp/3336
target halted due to debug-request, current mode: Thread 
xPSR: 0x61000000 pc: 0x2000afb0 msp: 0x20013d44

Anyways; this particular board's CPU1/CPU2 are both operable again, so I don't really need any further assistance for now. Maybe someone with more insight into the safe boot operation can explain the logs above for ST internally (if it matters to ST).

Thanks!

Tim

TwisteR · ‎2022-07-28

Dear Tim.N,

I have a question about your post:

>> Anyways; this particular board's CPU1/CPU2 are both operable again

Does this mean that you have managed to somehow remove the FUS and boot the CPU2 from the vector in unlocked flash region?

Here is my post, and since I'm trying to use CPU2 for a general purpose non-radio tasks, I would be very grateful to know the outcome of your experiments.