cancel
Showing results for 
Search instead for 
Did you mean: 

STM32H743IIT6 hard fault error on USB CDC enumeration

TFranke
Associate II

Hi guys,

I'm experiencing a hard fault error during USB CDC enumeration and can't track its source.

The failure occurs in the file usbd_desc.c, when the functon Get_SerialNum() calls a subroutine IntToUnicode() which shall convert an uint32 into its string representation.

I could track down that the calling function Get_SerialNum() provides a valid memory address at 0x24000164 where the string result shall be placed, but in case of that error the called function IntToUnicode() gets 0xFFFFFFFF as destination address which finally causes the hard fault. Screenshots are attached.

0693W000003PPj3QAG.png

0693W000003PPj8QAG.png

To reproduce the issue you can download the provided STM32CubeIDE project to the STM32H743 and place a breakpoint in stm32h7xx_it.c in line 74 (the hard fault handler code).

Connect a USB cable from your host computer to the STM board and run the program. In case of the error it hits the hard fault handler right during enumeration of the USB device.

DOs and DON'Ts to reproduce this issue with the provided example project:

  1. I've modified the linker script so that the functions mentioned above are placed at the specific flash location where I encountered the error for the first time. It is position 0x2E844 (see line 106 of the .ld file). If that position is changed by 4 bytes up or down the error does not occur. Later on I found out that the flash region could start at any position which is dividible by 8 without causing the error. The question remains: Why does this error occur as the STM32H743 should be able to execute code aligned to 32 bits although the AXI bus is 64 bits wide?
  2. You can place a breakpoint anywhere in the subroutine IntToUnicode without affecting the error condition. If you place the breakpoint at the very first assembly code you are able to see the wrong destination address 0xFFFFFFFF in register r1. But DON'T place a breakpoint in the calling function Get_SerialNum and try to do single stepping as then the error does not occur. Again I have no clue why!
  3. I've run the identical project on 3 other boards with the identical STM32H743IIT6 chip and the error does not occur. In addition I've tested it on my NUCLEO-H743ZI2 board which carries a STM32H743ZIT6U (same chip, different package), also without the hard fault issue.
  4. VCC voltage is 3.3V on all boards. Looks ok.
  5. The HSE input clock is 8.0 MHz on all boards (checked with oscilloscope). USB and CAN are functional if the hard fault is prevented e.g. by shifting the code in flash memory. Looks ok, too.
  6. Flash latency and power settings look good according to documentation. AXI peripheral clock is @ 100 MHz, VOS at scale 1 and flash latency set to 1 WS.
  7. I have lowered the AXI clock by half (flash latency is set to 0 automatically so seems to adapt as needed) but without any effect.

So, currently it seems to be related to my single CPU.

It could be damaged, but how and why?

Are there any other explanations I'm currently not aware of?

I'm just feared that the issue is not specific to my CPU and may occur sooner or later on the other boards, too.

Any thoughts or recommendations on that issue?

Best regards,

Tobias

7 REPLIES 7
TFranke
Associate II

Project attachment ...

Incorrect/faulty VCAP, bad solder joint on some of the ground and/or voltage pins?

JW

TFranke
Associate II

Hi Jan,

thanks for your fast reply.

Both VCAP pins are decoupled with 2.2uF each as recommended by the datasheet (using X5R ceramic capacitors in 0603 footprint).

All solder joints look good, there are no broken ones or solder pills in between (just checked again using a microscope).

The only thing I have changed on that board is that I added a VBAT capacitor of 10uF many month ago to make use of the backup RAM, but this should have no effect on the USB or flash peripherals.

BTW: Do you think that bad solder connections can make the program work as intended if moved to specific flash locations, and not to work if the code is shifted by 4 bytes?

Best regards,

Tobias

Tobias,

I don't know. Your findings appear to point to FLASH failure. I'm not sure about the source of 0xFFFFFFFF. In'H7, FLASH has ECC - what's the behaviour of ECC failure occurs?

It sounds like this is a piece heavily used in development. Could the specified FLASH endurance be reached?

But maybe there's some other explanation of the observed behaviour. Two things come into my mind: r1 being changed by some erratic interrupt (including context switching if you use multitasking aka RTOS); or the function is not called from the point you assume (that can be checked easily by checking content of lr register).

JW

TFranke
Associate II

Hi Jan.

(1) Well, you guessed right, the chip has been mainly used for development during the last months. As first reaction I wanted to cry out that millions of write cycles can't be reached within that time, but the STM32H743 guarantees only 10k write cycles. Oops. So, on the one hand flash endurance may be at its limit, but on the other hand the flash seems to be ok because if e.g. a breakpoint is set in the calling function then register r1 is loaded with exactly the value that is expected in flash memory. So, it seems to be there, write access during programming is not an issue, but reading the value during execution time (which I cannot explain).

(2) Task context switching (including its stack space) was the next path I was following. My original project uses FreeRtos, but the error still exists in the test project which is without RTOS. Its stripped down to a simple main loop printing a "Hello World" message over USB VCP and a "." every HAL_Delay(1000). That's all. The lr register points to 0x802e875 in both modes (with and without the breakpoint in Get_SerialNum and, as result, with or without the hard fault issue). I was expecting 0x802e874 in lr because the call to the subroutine was made from 0x802e870 but maybe it's ARM-specific to set the lowest address bit in the return address...

(3) The screenshot provided shows the call stack which proves the same as the lr register. Basically the IntToUnicode function is called from Get_SerialNum as expected. I think it's worth to mention that all these functions are running in the USB IRQ handler context by the HAL and LL USB layers. As ARM doesn't support re-entrant IRQs it should not be possible to have another IRQ influence here. Or is this assumption wrong?

(4) Related to (1) I checked the flash peripheral registers, but no error has been detected there. The SR1 & SR2 registers as well as the ECC_FA1R & ECC_FA2R are all zero.

Screenshot: Call-Tree and registers shortly before the hard fault:

0693W000003PaImQAK.png

Open questions are:

- Why does shifting the flash memory location for the functions in question by +/-4 bytes eliminate the hard fault error?

- Why and how can a breakpoint placed before the jump to the subroutine help so that register r1 is not changed to 0xFFFFFFFF?

- Is it maybe the 64-bit AXI bus interface not running properly when reading 32-bit literals from flash placed at 32-bit adresses?

I'm still puzzled, but I really appreciate your help.

Best regards,

Tobias

TFranke
Associate II

Hi all,

I have probed the VDD and VCAP voltages with my oscilloscope in AC mode to see if there are any spikes on my boards power supply, but there is only a small amount of noise in the 50mV range, even during chip programming, program stop mode and run mode.

Any other ideas?

If not I'll have to put the board onto my shelf of "reproducible but unexplained issues" until a second board shows the same behavior. For now I'm running out of options what to investigate further.

Best regards,

Tobias

TFranke
Associate II

Hi to everyone who is still interested.

I guess I found the underlying issue which caused the strange phenomenon where a CPU register was loaded with a wrong pattern read from flash memory.

It has nothing to do with VCC/ICC noise, unstable or misconfigured clocks, soldering issues or the flash wearing out as expected - and it occurred on the second board right after putting the first one out of sight...

The revision "Y" of the STM32H743 is known to have a faulty implementation of the flash bank swapping feature which I use to upload new code from an USB stick. It's documented rather shortly in the errata sheet (https://www.st.com/resource/en/errata_sheet/dm00368411-stm32h742xig-and-stm32h743xig-device-limitations-stmicroelectronics.pdf, chapter 2.2.8).

Although knowing about that issue I have used the bank swap functionality during the last months as I thought the issue was related to that the bank swapping did not occur at the right point in time (I noticed that during my reset attempts), I was not expecting it to behave faulty in general as all my code seemed to run properly.

But: If the SWAP_BANK bit in the register FLASH_OPTCR is set, after reset of the chip - no matter if done by software or hard power off and on - one piece of code may run without showing any issues while another may not.

I don't know what exactly is going wrong there but I suspect issues if literals which should be loaded from flash memory are placed at 4-byte-aligned positions and not if they are placed at 8-byte-aligned positions as explained above. I've tested the issue on my first CPU which was the first to show this kind of error and a second one which has been flashed only a few times.

The test was quite simple. I prepared 3 USB sticks with different firmware versions:

1st stick:    firmware 001 without known issues
2nd stick:    firmware 002 with observed issues (deadlock due to hard fault handler when the PC USB cable was inserted)
3rd stick:    firmware 003, again without known issues

The sticks have been used to update the CPUs code by a self-implemented loader which loads the required data from an USB stick to the second flash memory bank, then swaps the banks and resets the chip to boot this code.

After each upload the stick has been removed and a PC-USB-cable has been connected to switch the firmware from USB host mode to USB serial device mode and to check if the hard fault error occurs.

Results are:

Test 001    1st stick    SWAP_BANK = '1' after reset    Normal functionality
 Test 002    2nd stick    SWAP_BANK = '0' after reset    Normal functionality
 Test 003    3rd stick    SWAP_BANK = '1' after reset    Normal functionality
 Test 004    1st stick    SWAP_BANK = '0' after reset    Normal functionality
 Test 005    2nd stick    SWAP_BANK = '1' after reset    HARD FAULT after USB cable insertion
 Test 006    3rd stick    SWAP_BANK = '0' after reset    Normal functionality

The pattern of 6 tests could be repeated over and over again on both CPUs which I have tested (both revision Y).

Test 002 and 005 used exactly the same firmware (002) from the same USB stick. The only difference is in which page the code actually landed.

So, from my point of view it's now clear what caused my headaches for the last months - the combination of a vulnerable firmware with the flash bank swap bit set in the option control register.

The issue was so hard to reproduce as it only occurred in a piece of code which is called in IRQ context when the firmware goes into USB serial device mode, and only if that piece of code is memory-aligned in a specific way. I have no clue why no other function like ADC sampling, data analysis, USB host mode, LED control, the whole RTOS, etc... was affected.

Conclusion:

For my revision "Y" CPUs I'll change my upload code to avoid flash bank swapping, and for the newer revision "V" CPUs it seems that the bank swap issue is solved there as my NUCLEO-H743ZI2 board does not show this kind of error.

Best regards to all who have read and thought about my findings,

Tobias