HAL SPI nested interrupt race condition with overrun in IT mode transfer

Zign Innovations - Tim · ‎2021-07-12

Issue details

We have observed a situation where a full duplex HAL SPI transfer in IT mode would hang in permanent BUSY state after many transfers in a system with relatively long ISRs at higher priorities than the SPI.

Debugging showed that the lockup had the following premise:

Transfer is started with HAL_SPI_TransmitReceive_IT
A couple of bytes are transmitted
HAL_SPI_IRQHandler executes with RXNE=1
RxISR function pointer is called
The SPI interrupt is pre-empted by a higher priority IRQ which takes long enough to result in RX FIFO overrun on the SPI peripheral (OVR=1)
RxISR function continues and reads DR register
HAL_SPI_IRQHandler returns (early return)
HAL_SPI_IRQHandler executes with TXE=1
HAL_SPI_IRQHandler reads SR register, clearing the OVR bit as a side effect (Reference manual: "Clearing the OVR bit is done by a read access to the SPI_DR register followed by a read access to the SPI_SR register")
TxISR function pointer is called
HAL_SPI_IRQHandler returns (early return)
HAL_SPI_IRQHandler executes with RXNE=1
HAL_SPI_IRQHandler reads SR but it does not contain the OVR error anymore
Since the overrun went undetected, the HAL waits forever on the last RX interrupt that will never come

Product details

Tested with: STM32F0xx HAL Drivers v1.7.5

Also affects: all hal_spi implementations for SPI peripherals that clear OVR by sequentially reading DR and SR, including: STM32F4xx, STM32F7xx, STM32G0xx, STM32L4xx

Proposed fix

Removing the early returns in HAL_SPI_IRQHandler (in particular the one after TxISR()) prevents this issue, because then SR is checked for errors after TxISR is called without re-reading SR in between. Untested alternative: change the order from RX check -> TX check -> error check to RX check -> error check -> TX check.

Amira · ‎2021-07-12

Hello @Zign Innovations - Tim

Can you precise how long time the others interrupts services routines take (with the priority higher than the SPI) ? Because, it is not recommended to stay long in interrupts, like doing blocking calls. Interrupt service routines (ISR) should be short and only perform highly time critical tasks.

To execute the SPI handler with a high priority you should ensure that the others interrupts don't block it, at least not for a long time.

This can be done by those interrupt handlers, just setting a flag that indicates (notify) that the device needs attention and there is a new data and then handling it in the main loop of your application. That will normally be a much safer solution.

Hope this would help you!

Regards,

Amira

Tesla DeLorean · ‎2021-07-12

I think the underlying complaint here is that the SR/DR read sequence is prone to a race-condition if an OVR flags between these actions.

Some STM32 peripheral designs are more robust, requiring error states to be explicitly cleared rather than simply side-effects of potentially normal interactions with the registers.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

waclawek.jan · ‎2021-07-12

> I think the underlying complaint here is that the SR/DR read sequence is prone to a race-condition

Ok but given vulnerable hardware, isn't it the task of Cube as the only example provided by ST to give an universally good solution?

JW

Zign Innovations - Tim · ‎2021-07-13

We already have reasonably short ISRs and this situation actually happened not as a result of a single pre-emption, but rather a worst-case situation where multiple USART interrupts would all happen at the same time, and while testing with higher data rates than we use in production. But that is besides the point. Encountering overruns as a result of ISR congestion is to be expected, but an overrun condition going undetected and resulting in a broken HAL state is simply a bug. In this case, a bug that was fairly difficult to track down but is fairly easy to fix.

Piranha · ‎2021-07-17

All of the HAL library is just full of race conditions. Check the one impacting the drivers for almost all peripherals:

https://community.st.com/s/question/0D50X0000C5Tns8SQC/bug-stm32-hal-driver-lock-mechanism-is-not-interrupt-safe

Essentially it's impossible to make a reliable firmware using HAL/Cube code...

AVI-crak · ‎2021-07-17

SPI implementation on cmsis - search engines find hundreds of solutions for all MK lines from ST. But that's not even the point.

You need to use DMA instead of interrupts, as it was originally intended.

S.Ma · ‎2022-02-26

Which STM33? If SPI slave mode with NSS as EXTI inyerrupt, and DMA in ram cycling mode, most interrupts are gone. Remains the time delay between NSS falling edgr and the SCK starts, if you want to update the data to transmit out. When NSS goes up, ISR needs time to process the data, so a minimum NSS high time is needed too. After that, change the interrupt priorities between UART and SPI if needed.