what is the best way to recover from a driver function that hangs

rtursen · ‎2020-06-10

Hello,

STM32F429 drives a FT813 based LCD.

LCD driver code is generated from Eve Designer code software.

Sometimes the driver code hangs (suspect of an EMI issue)

If I were using RTOS then I would put it in a seperate task and restart it if it hangs.

Without using RTOS, what is the best way to recover from a function that hangs?

I would prefer to not use a Watchdog timer, as the rest of the system is doing critical stuff.

Best Regards

TDK · ‎2020-06-10

I would roll your own watchdog, run it in the trusted part of the code, and reset the peripheral as needed. The exact details will depend on the peripheral and how your code is organized.

If you feel a post has answered your question, please click "Accept as Solution".

Tesla DeLorean · ‎2020-06-10

Add instrumentation and telemetry to the driver code so you can understand exactly how it is failing, output to a serial terminal so you can do this without debugger intrusion.

Have your HardFault_Handler() and Error_Handler(), and other while(1) loops, output actionable data so you can tell if you end up there, and from where.

Then fix the driver code, and understand how to unwind error/fault conditions in a recoverable fashion.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Pavel A. · ‎2020-06-10

When it "hangs", can you break into debugger?

Do you understand why the program hangs? for example, does it read some memory or register and wait for some bit set?

Or as Clive suggested, does it spin in the hardfault handler?

-- pa

alister · ‎2020-06-10

Especially if the system is critical, your product should be watchdogged.

Don't give up isolating its cause. Instrumenting the code and sticking details in ring-buffer(s), executing under debug and then viewing the details affter it's hung, then refining what you're instrumenting and re-testing until the cause is isolated is common practice.

Some things to check:

The driver interfaces are called single-threaded? You're not calling the same driver from main loop and interrupt?
Interrupts are enabled, especially system tick?
System tick is incrementing? HAL drivers use the tick a lot for timeouts.

rtursen · ‎2020-06-11

Hello folks,

Thanks a lot for your answers.

1) Interrupts are working when the code hangs

2) We can break into debugger, when the program hangs but stepping through the code doesn't help, after a couple of steps, system gets a reset

3) When the program hangs and we break into debugger, the code is usually doing SPI transfer

The problem happens quite rarely so it's a little difficult to debug.

We will try to fix the driver code as you suggest by adding telemetry, I will try to post the progress

alister · ‎2020-06-11

>after a couple of steps, system gets a reset

Watchdog's enabled and it's not disabled while the core's halted, e.g. by __HAL_DBGMCU_FREEZE_IWDG() or similar?

>usually doing SPI transfer

Doing what part of SPI? How are you doing SPI? Does it always hang the same place?

The answers to this post are general advice because we don't know the specifics.

berendi · ‎2020-06-11

Check board temperature and supply voltages. Last time this has happened to me, the power supply was not able to keep up with the power requirements of the display at brightness > 70%. I was able to adjust the time between resets with the brightness PWM 🙂

> If I were using RTOS then I would put it in a seperate task and restart it if it hangs.

> Without using RTOS, what is the best way to recover from a function that hangs?

Implementing your own task management without the rest of RTOS features, and doing the same.

Piranha · ‎2020-06-11

First the nature of the hanging must be understood. If it's because, for example, the code is waiting for a specific number of bytes on USART interface, but some bytes were actually lost, then that is the situation from which code must be able to recover normally by using headers, checksums, timeouts. If it happens because the code is flawed, then it must be fixed. Typical reasons are interrupt/thread unsafe code, deadlocks, missing volatiles, missing compiler and memory barriers, not clearing interrupt flags and other wrong interrupt processing. HAL is hopelessly broken in this regard...