HardFault after several hours with CDC USB_HOST_M7, USB_OTG_HS

roblahnst · ‎2025-03-23

Product: RIVERDI RVT70HSSNWC00-B with STM32H757XIH6

STM32Cube FW_H7 V1.11.1
TouchGFX 4.25.0
CubeIDE 1.16.1
FreeRTOS 10.3.1, CMSIS_V2

I am using the code attached in usb_host.c to read data from a USB Virtual COM barcode reader. The .ioc file is also attached.

The code runs fine, however after some hours I get a hard fault. This happens only when the USB-scanner is plugged in. If nothing is plugged to USB it runs fine. I don't do anything with the barcode reader - it is just plugged in and the hard fault occurs.

I already tried the following:

USBH_CDC_BUFFER_SIZE 1024 to 2048
#define USBH_PROCESS_PRIO osPriorityLow to osPriorityNormal
#define USBH_PROCESS_STACK_SIZE ((uint16_t)4*2048) between 512 and 8192
STM32Cube FW_H7 V1.11.2

Any help would be very much appreciated.

roblahnst · ‎2025-03-23

Apparently this is a known issue => https://savannah.nongnu.org/bugs/?59831

View solution in original post

TDK · ‎2025-03-23

Where does the hard fault occur? What is the stack trace?

What is the nature of the hard fault? The Fault Analyzer in STM32CubeIDE can help here.

If you feel a post has answered your question, please click "Accept as Solution".

roblahnst · ‎2025-03-23

The hard fault occurs at:

Program Counter (PC):  0x080A70CE

Disassembly:

0x080A70C4:   bl      0x0809675E <lwip_htonl>
0x080A70C8:   mov     r4, r0
0x080A70CA:   ldr     r3, [r7, #32]
0x080A70CC:   ldr     r3, [r3, #12]
0x080A70CE:   ldr     r3, [r3, #4]    <-- HARD FAULT occurs here
0x080A70D0:   mov     r0, r3

At the time of the crash:

stacked_pc  = 0x080A70CE  (crash instruction)
stacked_lr  = 0x080A70C9
stacked_r3  = 0x50E76AFE  <-- invalid pointer causing the fault

The faulting instruction attempts to dereference r3, which points to an invalid memory address.

Fault Status Registers:

CFSR  = 0x01000000 (UsageFault: UNALIGNED access)
HFSR  = 0x40000000 (Forced Hard Fault)
MMFAR = 0x00000000 (No valid memory fault address)
BFAR  = 0x00000000 (No valid bus fault address)

Interpretation:

• UNALIGNED bit set in CFSR → Indicates an unaligned memory access or invalid pointer dereference.

• The forced hard fault is triggered due to an unrecoverable error.

• No memory management or bus fault addresses were captured.

The fault occurs inside the LwIP stack, most likely during a pbuf (packet buffer) operation while handling network traffic. The pointer r3 = 0x50E76AFE is invalid and outside of the valid memory regions (SRAM, QSPI, etc.).

But how is LWIP and CDC connected?

Tesla DeLorean · ‎2025-03-23

Chasing a pointer in a structure. Look at source line.

ptr->ptr->thing

How is that value placed in structure?

Perhaps a memory or pool allocation that fails.

CDC using common resources to LWIP. Perhaps memory/resource leak on CDC side.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

roblahnst · ‎2025-03-23

It seems it is here:

080a70ca: ldr r3, [r7, #32]   ; useg laden
080a70cc: ldr r3, [r3, #12]   ; useg->tcphdr
080a70ce: ldr r3, [r3, #4]    ; useg->tcphdr->seqno --> HARDFAULT!

in tcp_out.c:

       /* In the case of fast retransmit, the packet should not go to the tail
         * of the unacked queue, but rather somewhere before it. We need to check for
         * this case. -STJ Jul 27, 2004 */
        if (TCP_SEQ_LT(lwip_ntohl(seg->tcphdr->seqno), lwip_ntohl(useg->tcphdr->seqno))) {
          /* add segment to before tail of unacked list, keeping the list sorted */
          struct tcp_seg **cur_seg = &(pcb->unacked);
          while (*cur_seg &&
                 TCP_SEQ_LT(lwip_ntohl((*cur_seg)->tcphdr->seqno), lwip_ntohl(seg->tcphdr->seqno))) {
            cur_seg = &((*cur_seg)->next );
          }
          seg->next = (*cur_seg);
          (*cur_seg) = seg;
        } else {
          /* add segment to tail of unacked list */
          useg->next = seg;
          useg = useg->next;
        }
      }

roblahnst · ‎2025-03-23

Apparently this is a known issue => https://savannah.nongnu.org/bugs/?59831

Tesla DeLorean · ‎2025-03-23

So an issue with tcphdr

It's walking a list, so I'd perhaps suspect unanticipated concurrent behaviour due to RTOS, either switching tasks, interrupt, or something manipulating the list outside of normal flow.

Perhaps needs to be done within a critical section?

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

FBL · ‎2025-03-25

Hi @roblahnst

To prevent concurrent access issues, you can protect the critical sections of your code where the useg pointer is accessed and modified by using mutexes.

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

roblahnst · ‎2025-03-25

Thanks. I applied the patch from https://savannah.nongnu.org/bugs/?59831 and since then its going fine.

Guillaume K · ‎2025-04-09

Hello

What is the network interface used with LwIP ? Is it the USB CDC ? where does the adaptation code between LwIP and the network interface come from ?

It's this adaptation layer that must take care of not calling LwIP "raw" APIs directly from multiple threads (and from ISR).