[BUG] STM32 lwIP Ethernet driver Rx deadlock

Piranha · ‎2019-09-28

This bug is present in F1, F2, F4 and F7 series examples and CubeMX generated code with RTOS and is one of the biggest flaws in ST's lwIP integration.

Problem

https://github.com/STMicroelectronics/STM32CubeF7/blob/3600603267ebc7da619f50542e99bbdfd7e35f4a/Projects/STM32F767ZI-Nucleo/Applications/LwIP/LwIP_HTTP_Server_Netconn_RTOS/Src/ethernetif.c#L376

lwIP core (also known as the "tcpip_thread") calls low_level_output(), which calls HAL_ETH_TransmitFrame(). While the latter is processing, Ethernet input thread (ethernetif_input() function) can call low_level_input(), which calls HAL_ETH_GetReceivedFrame_IT(). Because both of those HAL functions use HAL's ingeniously stupid "lock" mechanism, HAL_ETH_GetReceivedFrame_IT() returns HAL_BUSY. Subsequently that makes low_level_input() to return NULL and ethernetif_input() to go back on waiting for a semaphore.

Consequences

Received frames are not processed and Rx buffers are not released to DMA.
When the next frames are received, code will iterate and try to process all previous frames up to the current frame. But again, if at some iteration HAL_BUSY is returned, the rest of the frames will be left unprocessed.
If no more frames do come from network, then the frames waiting in Rx buffers will never be processed.
If all Rx descriptors have been used (OWN bit cleared) when semaphore is acquired and HAL_BUSY is returned on processing the first frame, no more Rx complete interrupts will be generated. Consequently semaphore will not be released and Ethernet input thread will be stuck forever on waiting for that semaphore.

Solution

This code:

if(HAL_ETH_GetReceivedFrame_IT(&EthHandle) != HAL_OK)
	return NULL;

Must be replaced with this code:

HAL_StatusTypeDef status;
 
LOCK_TCPIP_CORE();
status = HAL_ETH_GetReceivedFrame_IT(&EthHandle);
UNLOCK_TCPIP_CORE();
 
if (status != HAL_OK) {
	return NULL;
}

Piranha · ‎2022-05-24

You don't need those macros at all, because the lwIP core locking is only necessary with RTOS.

MHama.3 · ‎2022-05-24

do you have any Idea, what to do in my case. Because i tried every thing but the board is still not responding, busy with all packets in the network

alister · ‎2022-05-25

> i tried every thing but the board is still not responding, busy with all packets in the network

This sounds remarkably like the original problem this community thread describes, except Piranha's proposed solution applies apps using an RTOS.

The original problem's root cause is HAL's locks would not permit HAL_ETH_GetReceivedFrame_IT and HAL_ETH_TransmitFrame* to execute at the same time. The proposed workaround for RTOS apps was to use LOCK_TCPIP_CORE to prevent two threads executing HAL_ETH_GetReceivedFrame_IT and HAL_ETH_TransmitFrame* at the same time.

Its correct fix is to remove the locking/lock-checking in HAL_ETH_GetReceivedFrame_IT and HAL_ETH_TransmitFrame* plus whatever vulnerability in those functions is needing those locks.

Perhaps the original problem is not fixed, or its fix has been accidentally reverted, and RTOS apps must still use the LOCK_TCPIP_CORE workaround.

>i tried every thing

Treat the code like you own it. Read it, instrument it, etc, if it's wrong, fix it.

Does your code execute either of those HAL_ETH functions from interrupt? Inspect them to determine whether the locks bug is fixed or not, and if not, fix it. Inspect them to ascertain of the locks were actually necessary or a mistake, and if they were necessary, fix its reason. Take care you don't call lwIP from interrupt.

Something else? Inspect/instrument/etc the code and isolate where.

PHolt.1 · ‎2022-07-23

I posted this before but it got no response.

I think the mutex macro LOCK_TCPIP_CORE() is not right, because if you then try to make LWIP thread-safe, by setting LWIP_TCPIP_CORE_LOCKING=1, you activate the same set of mutexes at the LWIP API, which obviously bombs the whole thing because your API call (say to open a socket) sets the API mutex and then the same mutex is attempted to be set again by the above low level code.

Unless you can somehow be sure that your two low level functions cannot be re-entered (and I can't see how you can be sure if you are calling the LWIP API from multiple threads) then you have to mutex them yourself.

But you have to use a different set of mutexes. That's what I did. I just pulled two more out of FreeRTOS and used them.

The trouble comes if you are calling either function under an interrupt. Then you cannot mutex them.

Transmission does not need interrupts because a call to LWIP to send some data to XYZ eventually results in packets being sent out on ETH, so using the ISR for that adds no value that I can see.

Reception does benefit from interrupts, compared to polling, greatly. I am not sure how to handle this though if you have a multi threaded LWIP. I posted what I did here

https://www.eevblog.com/forum/microcontrollers/anyone-here-familiar-with-lwip/

but I am still doing polled RX.

Piranha · ‎2022-07-23

The code this topic is talking about is located in low_level_input() function, which is called only from the Ethernet input thread (ethernetif_input() function). And the low_level_output() function is called only either from the core thread ("tcpip_thread") or from other threads with the core (thread) locked. Therefore, if the other code is not broken, neither of these two functions can be re-entered.

Take a note that this whole "solution" was only a sub-optimal workaround to make ST's lwIP implementation working at all, while not changing the HAL ETH driver. The real solution is to remove the idiotic and broken HAL "lock" mechanism from the driver. When that is done, there is no reason to use the workaround this topic is about. And the new rewritten zero-copy driver doesn't have HAL "lock" in it anymore. If you still want to waste your time on fixing a code created by incompetent fools, I recommend taking their new rewritten driver and lwIP integration as the base and fixing it's flaws. It will require less work and it will be more compatible with their future code.

In zero-copy drivers the transmission benefits from interrupt on completion of the frame transmission - at that point in time the driver can free the buffer. And, if SYS_LIGHTWEIGHT_PROT=1 and LWIP_ALLOW_MEM_FREE_FROM_OTHER_CONTEXT=1, pbuf_free() can even be called directly from an ISR, improving the overall performance. Reception is easy - signal the event from the ISR and process the actual frames in the Ethernet input processing thread. Now with the rewritten driver ST's ethernetif_input() implementation is correct. It's just recommended to use thread events/flags/notifications instead of semaphore for better performance and less resource usage.

I'll read your topics on EEVblog and comment later... :smiling_face_with_smiling_eyes:

ktrofimo · ‎2022-09-16

Working with the latest H7 HAL code still found some mutex lockup in low_level_output() after HAL_ETH_Transmit_IT() :

while( osSemaphoreAcquire( TxPktSemaphore, TIME_WAITING_FOR_INPUT) != osOK )
{
}

Looks like your solution with locking low_level_input() is working in this case too...

Piranha · ‎2022-09-17

I looked into this and here is the result:

https://community.st.com/s/question/0D53W00001oImr4SAC/bug-stm32-lwip-ethernet-driver-tx-deadlock

P.S. The workaround for Rx deadlock, described in this topic, cannot avoid that Tx deadlock bug.

ktrofimo · ‎2022-09-17

Thanks! I will fix this part of the code too. Nevertheless even with fix from this thread firmware is quite stable (15Mb/s STM -> PC for 26 hours and no stop). Without this fix ethernet would definitely stop within 1st hour.