cancel
Showing results for 
Search instead for 
Did you mean: 

[BUG] STM32 lwIP Ethernet driver Tx deadlock

Piranha
Chief II

This bug is present in ethernetif.c files for lwIP generated by CubeMX for the newer reworked ETH drivers.

Problem

When using CMSIS-RTOSv2, CubeMX generates the following code in the function low_level_output():

 

pbuf_ref(p);
HAL_ETH_Transmit_IT(&heth, &TxConfig);
while(osSemaphoreAcquire(TxPktSemaphore, TIME_WAITING_FOR_INPUT)!=osOK)
{
}
HAL_ETH_ReleaseTxPacket(&heth);

 

The TxPktSemaphore is initialized with both maximum and (wrongly) initial count of 1. Because the semaphore count is already 1, the osSemaphoreAcquire() call returns immediately and the HAL_ETH_ReleaseTxPacket() can release the packet before the transmission has actually been completed. While the Tx complete interrupt happens after the osSemaphoreAcquire() call, the code continues running in this crippled mode, corrupting packets and driver/stack state.

When the Tx complete interrupt happens before the osSemaphoreAcquire() call, the osSemaphoreRelease() call in the Tx complete interrupt tries to increase the semaphore count, but fails to do it because the count already is at the maximum. At this moment the code actually corrects the wrongly initialized semaphore count and the osSemaphoreAcquire() call starts to actually wait for the Tx complete interrupt. That means the packets will not be corrupted further, but the driver/stack state can potentially already be damaged and is not guaranteed to recover.

After the semaphore count is corrected or if CMSIS-RTOSv1 is used, another issue can happen. The HAL_ETH_Transmit_IT() function can return an error if, for example, there are not enough descriptors available at the moment. In this case the transmission will not be started at all and the osSemaphoreAcquire() call will wait forever. This means that the function low_level_output() and therefore the whole lwIP core is deadlocked.

In addition, if the low_level_output() function is designed to block until the packet transmission is complete, there is absolutely no point in using the pbuf_ref() and the respective pbuf_free() in HAL_ETH_TxFreeCallback() function. Incrementing the PBUF reference count is necessary for a design, where the low_level_output() function doesn't block and the PBUF is released asynchronously.

The RxPktSemaphore has exactly the same wrong semaphore initial count, but, apart from a single additional useless loop iteration in the ethernetif_input() function, the bug has no critical consequences.

Solution

First, initialize the semaphores with a count of 0 in function low_level_init(). Therefore this CubeMX script in ethernetif_***.ftl files:

 

[#if cmsis_version = "v2"]
  RxPktSemaphore = osSemaphoreNew(1, 1, NULL);
  TxPktSemaphore = osSemaphoreNew(1, 1, NULL);
[#else][#-- else cmsis_version --]
  RxPktSemaphore = xSemaphoreCreateBinary();
  TxPktSemaphore = xSemaphoreCreateBinary();
[/#if][#-- endif cmsis_version --]

 

Must be changed to this:

 

[#if cmsis_version = "v1"]
  RxPktSemaphore = osSemaphoreCreate(osSemaphore(RxSem), 1);
  TxPktSemaphore = osSemaphoreCreate(osSemaphore(TxSem), 1);
  /* Decrease the semaphores' initial count from 1 to 0 */
  osSemaphoreWait(RxPktSemaphore, 0);
  osSemaphoreWait(TxPktSemaphore, 0);
[#else][#-- else cmsis_version --]
  RxPktSemaphore = osSemaphoreNew(1, 0, NULL);
  TxPktSemaphore = osSemaphoreNew(1, 0, NULL);
[/#if][#-- endif cmsis_version --]

 

In addition to fixing the bugs, I've made other improvements. The CMSIS-RTOSv1 code was using FreeRTOS API, which was functionally correct but inconsistent. The code examples are also using the same inconsistent mix of those two APIs. Also I swapped the order of [if-else] blocks for the CubeMX script to a more logical one and consistent with other parts of the script. I would recommend ST to keep the same order everywhere, which is not the case currently.

Second, the return value of the HAL_ETH_Transmit_IT() function must be checked and, if something other than HAL_OK was returned, the osSemaphoreAcquire() and HAL_ETH_ReleaseTxPacket() calls must be skipped and instead the pbuf_free() must be called to revert the increment done by the earlier pbuf_ref() call, if it is used.

 

pbuf_ref(p);
if (HAL_ETH_Transmit_IT(&heth, &TxConfig) == HAL_OK) {
	while(osSemaphoreAcquire(TxPktSemaphore, TIME_WAITING_FOR_INPUT)!=osOK)
	{
	}
	HAL_ETH_ReleaseTxPacket(&heth);
} else {
	pbuf_free(p);
}

 

Additional information

Although with incomplete description of consequences and their severity, this bug and a correct solution has also been reported there:

https://github.com/STMicroelectronics/STM32CubeF4/issues/127

And the fact of a deadlock happening is also reported there:

https://github.com/STMicroelectronics/STM32CubeH7/issues/224

26 REPLIES 26
PHolt.1
Senior II

Why are the semaphores needed?

I have been told repeatedly that LWIP does not call TX in multiple threads, even if multiple FreeRTOS tasks are outputting data concurrently.

Piranha
Chief II

As those three lines of code shows, the TxPktSemaphore is used for waiting on hardware Tx completion, not some mutual exclusion. Of course, a decent design should just add the packet to the DMA Tx queue, return immediately and then release the sent packets on a Tx complete interrupt. But this bug report is for ST's code, not something decent!

By the way, when this bug will be fixed, most likely ST's stated 92 Mbps Tx throughput will decrease significantly.

PHolt.1
Senior II

OK thanks; so this is not relevant to my code which was ST's (adapted) one from ~ 2 years ago.

jacekowski
Associate II

There is still an issue with the semaphore that is not fixed by the proposed fix,

If interrupt happens between

hal_status = HAL_ETH_Transmit_IT(&heth, &TxConfig);

and

while(osSemaphoreAcquire(TxPktSemaphore, TIME_WAITING_FOR_INPUT)!=osOK)

Then there is a chance that HAL_ETH_TxCpltCallback will be called between those two and you will still have the code stopping at this line.

No, that is not how semaphores work. The ISR calls the osSemaphoreRelease(), which increases the semaphore count to 1. When the count is non-zero, the osSemaphoreAcquire() decreases the count and returns immediately with a return value of osOK. The order of those calls doesn't matter - the logic will still be correct. If that would not be the case, the semaphore would be useless. And this behavior is not specific to CMSIS-RTOS or FreeRTOS - that is the way semaphores work in general.

Issue in my case was due to eiuther sending things from multiple tasks or from within interrupt context (i didn't really check which of those two things was actually breaking it, just fixed both potential issues) with fairly high interrupt load, so in my case i would end up with 2x release first (because of semaphore max count being 1 it leaves semaphore at 1) then acquire that works, then 2nd acquire that locked.

ktrofimo
Senior III

0693W00000UnKuvQAF.png

  /* create a binary semaphore used for informing ethernetif of frame transmission */
  TxPktSemaphore = osSemaphoreNew(1, 0, NULL);
void HAL_ETH_RxCpltCallback(ETH_HandleTypeDef *handlerEth)
{
  osSemaphoreRelease(RxPktSemaphore);
}

Still locks :(

ktrofimo
Senior III

Update:

Both TCP/IP task and Eth input task both are blocked at the same event object 0x200090ac0693W00000UnKyYQAV.png

https://github.com/STMicroelectronics/STM32CubeH7/issues/224#issuecomment-1199200853

It seems that you are still using the ST's broken link detection code. I've pointed it out and explained what is wrong with it in my main Ethernet & lwIP issue topic since the beginning and I am repeating it again and again, but nobody is listening. ST, forum users and almost everyone just keeps silent and pretends that those issues doesn't exist. It's a mass insanity!

To fix it, you can get inspired from these links:

https://community.st.com/s/question/0D53W00000zHH9PSAW/changing-ip-address-mac-address-during-runtime

https://github.com/stm32-hotspot/STM32H7-LwIP-Examples/blob/b4bcc6cffa109ff44b875165b5cd5aec04f5df22/STM32H743_Nucleo_ETH/LWIP/Target/ethernetif.c#L798

P.S. I cannot know whether your other threads, which are using lwIP API, respect the rules.