cancel
Showing results for 
Search instead for 
Did you mean: 

[BUG] STM32 lwIP Ethernet driver Tx deadlock

Piranha
Chief II

This bug is present in ethernetif.c files for lwIP generated by CubeMX for the newer reworked ETH drivers.

Problem

When using CMSIS-RTOSv2, CubeMX generates the following code in the function low_level_output():

 

pbuf_ref(p);
HAL_ETH_Transmit_IT(&heth, &TxConfig);
while(osSemaphoreAcquire(TxPktSemaphore, TIME_WAITING_FOR_INPUT)!=osOK)
{
}
HAL_ETH_ReleaseTxPacket(&heth);

 

The TxPktSemaphore is initialized with both maximum and (wrongly) initial count of 1. Because the semaphore count is already 1, the osSemaphoreAcquire() call returns immediately and the HAL_ETH_ReleaseTxPacket() can release the packet before the transmission has actually been completed. While the Tx complete interrupt happens after the osSemaphoreAcquire() call, the code continues running in this crippled mode, corrupting packets and driver/stack state.

When the Tx complete interrupt happens before the osSemaphoreAcquire() call, the osSemaphoreRelease() call in the Tx complete interrupt tries to increase the semaphore count, but fails to do it because the count already is at the maximum. At this moment the code actually corrects the wrongly initialized semaphore count and the osSemaphoreAcquire() call starts to actually wait for the Tx complete interrupt. That means the packets will not be corrupted further, but the driver/stack state can potentially already be damaged and is not guaranteed to recover.

After the semaphore count is corrected or if CMSIS-RTOSv1 is used, another issue can happen. The HAL_ETH_Transmit_IT() function can return an error if, for example, there are not enough descriptors available at the moment. In this case the transmission will not be started at all and the osSemaphoreAcquire() call will wait forever. This means that the function low_level_output() and therefore the whole lwIP core is deadlocked.

In addition, if the low_level_output() function is designed to block until the packet transmission is complete, there is absolutely no point in using the pbuf_ref() and the respective pbuf_free() in HAL_ETH_TxFreeCallback() function. Incrementing the PBUF reference count is necessary for a design, where the low_level_output() function doesn't block and the PBUF is released asynchronously.

The RxPktSemaphore has exactly the same wrong semaphore initial count, but, apart from a single additional useless loop iteration in the ethernetif_input() function, the bug has no critical consequences.

Solution

First, initialize the semaphores with a count of 0 in function low_level_init(). Therefore this CubeMX script in ethernetif_***.ftl files:

 

[#if cmsis_version = "v2"]
  RxPktSemaphore = osSemaphoreNew(1, 1, NULL);
  TxPktSemaphore = osSemaphoreNew(1, 1, NULL);
[#else][#-- else cmsis_version --]
  RxPktSemaphore = xSemaphoreCreateBinary();
  TxPktSemaphore = xSemaphoreCreateBinary();
[/#if][#-- endif cmsis_version --]

 

Must be changed to this:

 

[#if cmsis_version = "v1"]
  RxPktSemaphore = osSemaphoreCreate(osSemaphore(RxSem), 1);
  TxPktSemaphore = osSemaphoreCreate(osSemaphore(TxSem), 1);
  /* Decrease the semaphores' initial count from 1 to 0 */
  osSemaphoreWait(RxPktSemaphore, 0);
  osSemaphoreWait(TxPktSemaphore, 0);
[#else][#-- else cmsis_version --]
  RxPktSemaphore = osSemaphoreNew(1, 0, NULL);
  TxPktSemaphore = osSemaphoreNew(1, 0, NULL);
[/#if][#-- endif cmsis_version --]

 

In addition to fixing the bugs, I've made other improvements. The CMSIS-RTOSv1 code was using FreeRTOS API, which was functionally correct but inconsistent. The code examples are also using the same inconsistent mix of those two APIs. Also I swapped the order of [if-else] blocks for the CubeMX script to a more logical one and consistent with other parts of the script. I would recommend ST to keep the same order everywhere, which is not the case currently.

Second, the return value of the HAL_ETH_Transmit_IT() function must be checked and, if something other than HAL_OK was returned, the osSemaphoreAcquire() and HAL_ETH_ReleaseTxPacket() calls must be skipped and instead the pbuf_free() must be called to revert the increment done by the earlier pbuf_ref() call, if it is used.

 

pbuf_ref(p);
if (HAL_ETH_Transmit_IT(&heth, &TxConfig) == HAL_OK) {
	while(osSemaphoreAcquire(TxPktSemaphore, TIME_WAITING_FOR_INPUT)!=osOK)
	{
	}
	HAL_ETH_ReleaseTxPacket(&heth);
} else {
	pbuf_free(p);
}

 

Additional information

Although with incomplete description of consequences and their severity, this bug and a correct solution has also been reported there:

https://github.com/STMicroelectronics/STM32CubeF4/issues/127

And the fact of a deadlock happening is also reported there:

https://github.com/STMicroelectronics/STM32CubeH7/issues/224

26 REPLIES 26

Here is a patch file for ethernetif.c

diff --git "a/LWIP/Target/ethernetif.c" "b/LWIP/Target/ethernetif.c"
index 88f5e6e..c689082 100644
--- "a/LWIP/Target/ethernetif.c"
+++ "b/LWIP/Target/ethernetif.c"
@@ -3,11 +3,11 @@
   ******************************************************************************
   * File Name          : ethernetif.c
   * Description        : This file provides code for the configuration
-  *                      of the Target/ethernetif.c MiddleWare.
+  *                      of the ethernetif.c MiddleWare.
   ******************************************************************************
   * @attention
   *
-  * Copyright (c) 2022 STMicroelectronics.
+  * Copyright (c) 2023 STMicroelectronics.
   * All rights reserved.
   *
   * This software is licensed under terms that can be found in the LICENSE file
@@ -33,7 +33,7 @@
 
 /* Within 'USER CODE' section, code will be kept by default at each generation */
 /* USER CODE BEGIN 0 */
-
+#include "App.h"
 /* USER CODE END 0 */
 
 /* Private define ------------------------------------------------------------*/
@@ -102,7 +102,18 @@ ETH_DMADescTypeDef  DMARxDscrTab[ETH_RX_DESC_CNT]; /* Ethernet Rx DMA Descriptor
 ETH_DMADescTypeDef  DMATxDscrTab[ETH_TX_DESC_CNT]; /* Ethernet Tx DMA Descriptors */
 
 /* USER CODE BEGIN 2 */
-
+CCMRAM StaticSemaphore_t RxSemCB;
+osSemaphoreAttr_t RxAttr = {
+		.name = "RxSemaphore",
+		.cb_mem = &RxSemCB,
+		.cb_size = sizeof(RxSemCB),
+};
+CCMRAM StaticSemaphore_t TxSemCB;
+osSemaphoreAttr_t TxAttr = {
+		.name = "TxSemaphore",
+		.cb_mem = &TxSemCB,
+		.cb_size = sizeof(TxSemCB),
+};
 /* USER CODE END 2 */
 
 osSemaphoreId RxPktSemaphore = NULL;   /* Semaphore to signal incoming packets */
@@ -244,10 +255,10 @@ static void low_level_init(struct netif *netif)
   #endif /* LWIP_ARP */
 
   /* create a binary semaphore used for informing ethernetif of frame reception */
-  RxPktSemaphore = osSemaphoreNew(1, 0, NULL);
+  RxPktSemaphore = osSemaphoreNew(1, 0, &RxAttr);//StaticSemaphore_t
 
   /* create a binary semaphore used for informing ethernetif of frame transmission */
-  TxPktSemaphore = osSemaphoreNew(1, 0, NULL);
+  TxPktSemaphore = osSemaphoreNew(1, 0, &TxAttr);
 
   /* create the task that handles the ETH_MAC */
 /* USER CODE BEGIN OS_THREAD_NEW_CMSIS_RTOS_V2 */
@@ -382,7 +393,10 @@ static err_t low_level_output(struct netif *netif, struct pbuf *p)
 
   pbuf_ref(p);
 
+  __DSB();__ISB();
   HAL_ETH_Transmit_IT(&heth, &TxConfig);
+  __DSB();__ISB();
+
   while(osSemaphoreAcquire(TxPktSemaphore, TIME_WAITING_FOR_INPUT)!=osOK)
 
   {

Moving some data to TCM/CCM changes the speed of execution and therefore can indirectly hide or reveal some bugs, but it cannot fix those. The lines 65 and 67 are unnecessary. Instead the code must check the return status of the HAL_ETH_Transmit_IT() function as reported in the main post of this topic.

Also, as reported in another topic, the rewritten HAL ETH driver for F series has a bunch of flaws, including a broken internal ETH_Prepare_Tx_Descriptors() function, which is an absolutely critical issue.

PHolt.1
Senior III

OK, let me summarise this topic, for those with an IQ below 200.

New driver released.

Full of bugs.

Lots of bugs listed by description (always leaving out key details)

No actual fixed source published anywhere.

If I am wrong, for the 32F417, I will eat my board...

mmetzger
Associate II

I'm not sure if I'm correct here.

I use STM32F407 with freertos and lwip. Try to setup an UDP server example. Init is ok but with the first answer from the server code blocks at mem_free called from HAL_ETH_ReleaseTxPacket. The client receives the correct message but afterwards the function HAL_ETH_ReleaseTxPacket is called a second time and then I got a "LWIP_MEM_ILLEGAL_FREE("mem_free: illegal memory: double free");" because mem->used is 0.

It is the same if I use CMSYS V1 or V2 and also with initializing the semaphores to 0

idefix2000
Associate II

Package: H7, V1.11.1

uC: STM32H725IGKX

I'm running into the same bug as described by this topic using Freemodbus: the transmission is getting stuck after a sending data for a few minutes.

After implementing the suggested fix the microcontroller it is crashing on some internal FreeRTOS queue configASSERT(...)  - again after running a few minutes.

So this bug still persists and isn't fixed :(

Stacktrace:

 

 

xQueueSemaphoreTake() at queue.c:1,433 0x80114d6	
osSemaphoreAcquire() at cmsis_os2.c:1,609 0x80104da	
low_level_output() at ethernetif.c:412 0x800f884	
ethernet_output() at ethernet.c:312 0x801e8d0	
etharp_output_to_arp_index() at etharp.c:769 0x801cb94	
etharp_output() at etharp.c:868 0x801ccf2	
ip4_output_if_src() at ip4.c:1,007 0x801da22	
ip4_output_if() at ip4.c:818 0x801d8de	
tcp_output_segment() at tcp_out.c:1,607 0x801b37c	
tcp_output() at tcp_out.c:1,360 0x801afda	
xMBTCPPortSendResponse() at porttcp.c:313 0x801fa4a	
eMBTCPSend() at mbtcp.c:150 0x801f56c	
eMBPoll() at mb.c:402 0x801f42c	
modbus_task() at modbus.c:59 0x80018f4	
pxPortInitialiseStack() at port.c:214 0x80132dc	

 

 

 

 

 

xQueueSemaphoreTake() at queue.c:1,433 0x80114d6:

BaseType_t xQueueSemaphoreTake( QueueHandle_t xQueue, TickType_t xTicksToWait )
{
BaseType_t xEntryTimeSet = pdFALSE;
TimeOut_t xTimeOut;
Queue_t * const pxQueue = xQueue;

#if( configUSE_MUTEXES == 1 )
	BaseType_t xInheritanceOccurred = pdFALSE;
#endif

	/* Check the queue pointer is not NULL. */
	configASSERT( ( pxQueue ) );    

	/* Check this really is a semaphore, in which case the item size will be
	0. */
	configASSERT( pxQueue->uxItemSize == 0 );    <<-- !!!!! This is line 433

 

 

Last stuck position before going into FreeRTOS calls:

idefix2000_0-1695618693997.png

 

 

lealoureiro
Associate II

I'm trying to implement this fix/improvement even though I don't face any problems but I'm having some compilation errors.

I'm compiling the code using F4 1.27.1 HAL Firmware Package, using CMSISv1 but I'm getting the following error:

 

src/ethernetif.c:242:40: note: in expansion of macro 'osSemaphore'
  242 |     RxPktSemaphore = osSemaphoreCreate(osSemaphore(RxSem), 1);
      |                                        ^~~~~~~~~~~
lib/FreeRTOS/include/cmsis_os.h:665:2: error: 'os_semaphore_def_TxSem' undeclared (first use in this function); did you mean 'os_semaphore_def'?
  665 | &os_semaphore_def_##name

I replace directly in the code because I'm using VSC+PlatformIO instead. The code I have is:

    RxPktSemaphore = osSemaphoreCreate(osSemaphore(RxSem), 1);
    TxPktSemaphore = osSemaphoreCreate(osSemaphore(TxSem), 1);

    osSemaphoreWait(RxPktSemaphore, 0);
    osSemaphoreWait(TxPktSemaphore, 0);

Where the RxSem and TxSem objects come from? 

Thanks!

I believe I forgot to put definitions:

osSemaphoreDef(RxSem);
osSemaphoreDef(TxSem);
    
RxPktSemaphore = osSemaphoreCreate(osSemaphore(RxSem), 1);
TxPktSemaphore = osSemaphoreCreate(osSemaphore(TxSem), 1);

osSemaphoreWait(RxPktSemaphore, 0);
osSemaphoreWait(TxPktSemaphore, 0);