[BUG] STM32 lwIP Ethernet driver Tx deadlock

Piranha · ‎2022-09-17

This bug is present in ethernetif.c files for lwIP generated by CubeMX for the newer reworked ETH drivers.

Problem

When using CMSIS-RTOSv2, CubeMX generates the following code in the function low_level_output():

pbuf_ref(p);
HAL_ETH_Transmit_IT(&heth, &TxConfig);
while(osSemaphoreAcquire(TxPktSemaphore, TIME_WAITING_FOR_INPUT)!=osOK)
{
}
HAL_ETH_ReleaseTxPacket(&heth);

The TxPktSemaphore is initialized with both maximum and (wrongly) initial count of 1. Because the semaphore count is already 1, the osSemaphoreAcquire() call returns immediately and the HAL_ETH_ReleaseTxPacket() can release the packet before the transmission has actually been completed. While the Tx complete interrupt happens after the osSemaphoreAcquire() call, the code continues running in this crippled mode, corrupting packets and driver/stack state.

When the Tx complete interrupt happens before the osSemaphoreAcquire() call, the osSemaphoreRelease() call in the Tx complete interrupt tries to increase the semaphore count, but fails to do it because the count already is at the maximum. At this moment the code actually corrects the wrongly initialized semaphore count and the osSemaphoreAcquire() call starts to actually wait for the Tx complete interrupt. That means the packets will not be corrupted further, but the driver/stack state can potentially already be damaged and is not guaranteed to recover.

After the semaphore count is corrected or if CMSIS-RTOSv1 is used, another issue can happen. The HAL_ETH_Transmit_IT() function can return an error if, for example, there are not enough descriptors available at the moment. In this case the transmission will not be started at all and the osSemaphoreAcquire() call will wait forever. This means that the function low_level_output() and therefore the whole lwIP core is deadlocked.

In addition, if the low_level_output() function is designed to block until the packet transmission is complete, there is absolutely no point in using the pbuf_ref() and the respective pbuf_free() in HAL_ETH_TxFreeCallback() function. Incrementing the PBUF reference count is necessary for a design, where the low_level_output() function doesn't block and the PBUF is released asynchronously.

The RxPktSemaphore has exactly the same wrong semaphore initial count, but, apart from a single additional useless loop iteration in the ethernetif_input() function, the bug has no critical consequences.

Solution

First, initialize the semaphores with a count of 0 in function low_level_init(). Therefore this CubeMX script in ethernetif_***.ftl files:

[#if cmsis_version = "v2"]
  RxPktSemaphore = osSemaphoreNew(1, 1, NULL);
  TxPktSemaphore = osSemaphoreNew(1, 1, NULL);
[#else][#-- else cmsis_version --]
  RxPktSemaphore = xSemaphoreCreateBinary();
  TxPktSemaphore = xSemaphoreCreateBinary();
[/#if][#-- endif cmsis_version --]

Must be changed to this:

[#if cmsis_version = "v1"]
  RxPktSemaphore = osSemaphoreCreate(osSemaphore(RxSem), 1);
  TxPktSemaphore = osSemaphoreCreate(osSemaphore(TxSem), 1);
  /* Decrease the semaphores' initial count from 1 to 0 */
  osSemaphoreWait(RxPktSemaphore, 0);
  osSemaphoreWait(TxPktSemaphore, 0);
[#else][#-- else cmsis_version --]
  RxPktSemaphore = osSemaphoreNew(1, 0, NULL);
  TxPktSemaphore = osSemaphoreNew(1, 0, NULL);
[/#if][#-- endif cmsis_version --]

In addition to fixing the bugs, I've made other improvements. The CMSIS-RTOSv1 code was using FreeRTOS API, which was functionally correct but inconsistent. The code examples are also using the same inconsistent mix of those two APIs. Also I swapped the order of [if-else] blocks for the CubeMX script to a more logical one and consistent with other parts of the script. I would recommend ST to keep the same order everywhere, which is not the case currently.

Second, the return value of the HAL_ETH_Transmit_IT() function must be checked and, if something other than HAL_OK was returned, the osSemaphoreAcquire() and HAL_ETH_ReleaseTxPacket() calls must be skipped and instead the pbuf_free() must be called to revert the increment done by the earlier pbuf_ref() call, if it is used.

pbuf_ref(p);
if (HAL_ETH_Transmit_IT(&heth, &TxConfig) == HAL_OK) {
	while(osSemaphoreAcquire(TxPktSemaphore, TIME_WAITING_FOR_INPUT)!=osOK)
	{
	}
	HAL_ETH_ReleaseTxPacket(&heth);
} else {
	pbuf_free(p);
}

Additional information

Although with incomplete description of consequences and their severity, this bug and a correct solution has also been reported there:

https://github.com/STMicroelectronics/STM32CubeF4/issues/127

And the fact of a deadlock happening is also reported there:

https://github.com/STMicroelectronics/STM32CubeH7/issues/224

Pavel A. · ‎2022-10-05

In the new ST "sandbox" there is yet another collection of ETH & LwIP examples with a nice readme.

Native CubeIDE projects, no more hassles with import and conversion.

https://github.com/stm32-hotspot/STM32H7-LwIP-Examples

It is not clear how to inform the maintainers of a bug. See CONTRIBUTING.md.

ktrofimo · ‎2022-10-06

Thank you for your efforts!

I think I did implemented all improvements you mentioned. My code was running for weeks under constant load (17Mb/s: vnc, http, https, vpn, modbus, ping). But I noticed that a simple code could break the whole thing.

For example simple netconn UDP client works like a charm:

struct netconn *sendconn = netconn_new( NETCONN_UDP );
netconn_bind(sendconn, &WG_LOCAL, 55155 );
netconn_connect(sendconn, &WG_SERVER, 55155);
struct netbuf* sendbuf = netbuf_new();
uint8_t *data = netbuf_alloc(sendbuf, strlen(message) );
memcpy(data, message, strlen(message) );
netconn_send(sendconn, sendbuf);
netbuf_free(sendbuf);
netbuf_delete(sendbuf);
netconn_disconnect(sendconn);
netconn_delete(sendconn);

But the same RAW client breaks the whole thing:

/* THIS CODE PRETENDS TO WORK, BUT CAUSES LWIP TO FAIL */
struct udp_pcb* my_udp = udp_new();
udp_connect(my_udp, &PC_IPADDR, 55155 );
struct pbuf* udp_buffer = pbuf_alloc(PBUF_TRANSPORT, strlen(message), PBUF_RAM);
if (udp_buffer != NULL)
{
	memcpy(udp_buffer->payload, message, strlen(message));
	udp_send(my_udp, udp_buffer);
	pbuf_free(udp_buffer);
}

From "lwIP Multithreading" (https://www.nongnu.org/lwip/2_1_x/multithreading.html) :

"Netconn or Socket API functions are thread safe against the core thread but they are not reentrant at the control block granularity level. That is, a UDP or TCP control block must not be shared among multiple threads without proper locking."

Piranha · ‎2022-10-30

Of course the second code is broken because like ST and almost everyone you also ignore the rules described in the link you presented. Core API cannot be called from other threads without locking the lwIP core thread! Also I gave two links which show how to do it.

I reported the same type of issues in ST code with many details in this topic:

https://community.st.com/s/question/0D53W00001sGiChSAK/cubemx-lwip-ethernetlinkthread-bug

If necessary, let's continue this discussion there... 🙂

Piranha · ‎2022-10-30

Added a section "Another problem" to the main topic, which was reported here:

https://community.st.com/s/question/0D53W00001pYnd6SAC/problems-after-migrating-stm32f777-to-new-ethernet-driver-in-cubemx-661

MSG_ST · ‎2023-01-10

Hi,

The proposed fix was tested and it negatively impacts the performance of the driver.

Without the solution proposed :

RxPktSemaphore = osSemaphoreNew(1, 1, NULL);

TxPktSemaphore = osSemaphoreNew(1, 1, NULL);

--> Client : 92.9 Mbits/s

--> Server : 95 Mbits/s

With the solution proposed :

RxPktSemaphore = osSemaphoreNew(1, 0, NULL);

TxPktSemaphore = osSemaphoreNew(1, 0, NULL);

--> Client : 75.7 Mbits/s

--> Server : 94.9 Mbits/s

So du to the fact that the performance is paramount and the problem reported is extremely rare, actually this fix will not be implemented.

But we will save it to be implemented on a global enhancement solution.

Thank you for your cooperation and for sharing the encountered issues.

Regards

Mahdy

jacekowski · ‎2023-01-10

I really hope this is a bad joke.

It is not extremely rare, it is extremely frequent.

PHolt.1 · ‎2023-01-10

Exactly. The perf reduction is absolutely negligible. How many embedded systems are a) needing and b) actually achieving (via a real application and a real tcp/ip stack) 75mbps?

ktrofimo · ‎2023-01-10

Agree. I can sacrifice bandwidth but absolutely can't sacrifice stability.

el_d · ‎2023-05-02

Want to share my way to handle this problem. Recommended solutions here and in other places did not worked for my setup(407zg+dp83848) but finally making ETH semaphores control blocks static and moving those to CCMRAM helped

ktrofimo · ‎2023-05-02

Can you share your changes?