cancel
Showing results for 
Search instead for 
Did you mean: 

[BUG] STM32 lwIP Ethernet driver Rx deadlock

Piranha
Chief II

This bug is present in F1, F2, F4 and F7 series examples and CubeMX generated code with RTOS and is one of the biggest flaws in ST's lwIP integration.

Problem

https://github.com/STMicroelectronics/STM32CubeF7/blob/3600603267ebc7da619f50542e99bbdfd7e35f4a/Projects/STM32F767ZI-Nucleo/Applications/LwIP/LwIP_HTTP_Server_Netconn_RTOS/Src/ethernetif.c#L376

lwIP core (also known as the "tcpip_thread") calls low_level_output(), which calls HAL_ETH_TransmitFrame(). While the latter is processing, Ethernet input thread (ethernetif_input() function) can call low_level_input(), which calls HAL_ETH_GetReceivedFrame_IT(). Because both of those HAL functions use HAL's ingeniously stupid "lock" mechanism, HAL_ETH_GetReceivedFrame_IT() returns HAL_BUSY. Subsequently that makes low_level_input() to return NULL and ethernetif_input() to go back on waiting for a semaphore.

Consequences

  • Received frames are not processed and Rx buffers are not released to DMA.
  • When the next frames are received, code will iterate and try to process all previous frames up to the current frame. But again, if at some iteration HAL_BUSY is returned, the rest of the frames will be left unprocessed.
  • If no more frames do come from network, then the frames waiting in Rx buffers will never be processed.
  • If all Rx descriptors have been used (OWN bit cleared) when semaphore is acquired and HAL_BUSY is returned on processing the first frame, no more Rx complete interrupts will be generated. Consequently semaphore will not be released and Ethernet input thread will be stuck forever on waiting for that semaphore.

Solution

This code:

if(HAL_ETH_GetReceivedFrame_IT(&EthHandle) != HAL_OK)
	return NULL;

Must be replaced with this code:

HAL_StatusTypeDef status;
 
LOCK_TCPIP_CORE();
status = HAL_ETH_GetReceivedFrame_IT(&EthHandle);
UNLOCK_TCPIP_CORE();
 
if (status != HAL_OK) {
	return NULL;
}

1 ACCEPTED SOLUTION

Accepted Solutions
Piranha
Chief II

The good news is that ST has fixed this bug in CubeMX v5.5.0. The bad news is that it's done in the same sub-optimal way the H7 series already had. They've not followed my suggestion, but have put the lwIP core locking in ethernetif_input() function:

do
      {   
        LOCK_TCPIP_CORE();
        p = low_level_input( netif );
        if   (p != NULL)
        {
          if (netif->input( p, netif) != ERR_OK )
          {
            pbuf_free(p);
          }
        }
        UNLOCK_TCPIP_CORE();
      } while(p!=NULL);

As ST do not read documentation (Multithreading) and do not analyse code, they don't know that both of these..

  1. pbuf_free()
  2. netif->input() => tcpip_input() => tcpip_inpkt()

..are thread safe. Therefore the code locks lwIP core for a longer time than it's necessary and is sub-optimal. Poor fix, but at least it fixes the deadlock issue!

P.S. The examples are still flawed regarding this!

View solution in original post

17 REPLIES 17
Amel NASRI
ST Employee

Hi @Piranha​ ,

Thanks for this detailed explanation and the suggested solution.

This is reported internally for farther review by development teams.

-Amel

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

Hi, @Amel NASRI​ !

That's good to hear!

https://community.st.com/s/question/0D50X0000BOtfhnSQB

Please reopen that topic. The particular topic's purpose is not for ST employees, but for users of this forum and everyone on internet. It's not a bug-report in itself, but a collection of all Ethernet/lwIP related critical problems in one place. Therefore it's like a how-to to which a single URL can be given to most of the hundreds of "Ethernet/lwIP not working on STM32" voices out there. Two items is just the beginning... I'm already preparing at least 5 more bug reports on this and probably there will be even more!

ST has failed to provide a working Ethernet/lwIP solution since releasing STM32F107 in year 2008. Since that time internet has been filled with ST's flawed code, which people are and will use even if ST fixes current drivers, examples and CubeMX generated code. If all of these network related problems will not be collected in a single place, 98% of people (which by the way applies to ST developers also) will never be able to fix those problems. So, please, at least don't suppress others in doing ST's job.

OK, I'll re-open it.

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

Thanks! :)

Piranha
Chief II

The good news is that ST has fixed this bug in CubeMX v5.5.0. The bad news is that it's done in the same sub-optimal way the H7 series already had. They've not followed my suggestion, but have put the lwIP core locking in ethernetif_input() function:

do
      {   
        LOCK_TCPIP_CORE();
        p = low_level_input( netif );
        if   (p != NULL)
        {
          if (netif->input( p, netif) != ERR_OK )
          {
            pbuf_free(p);
          }
        }
        UNLOCK_TCPIP_CORE();
      } while(p!=NULL);

As ST do not read documentation (Multithreading) and do not analyse code, they don't know that both of these..

  1. pbuf_free()
  2. netif->input() => tcpip_input() => tcpip_inpkt()

..are thread safe. Therefore the code locks lwIP core for a longer time than it's necessary and is sub-optimal. Poor fix, but at least it fixes the deadlock issue!

P.S. The examples are still flawed regarding this!

Hi...

it could be that this particular bug has been fixed, but as far we can see, CubeMX is still doing weird things on a STM32F767 Nucleo board (i.e. by setting PHY_ADDRESS=1 in some circustances) and even a simple ICMP (ping) still hangs after some time (and we are talking of minutes, not days or weeks)...

I wonder how can it be that ST is not paying attention on this and is not focusing on such a crucial stack as LWIP...

Best Regards,

Giampaolo

Indeed, I was already using the new CubeMX version with the fix, and ICMP (ping) still hangs after some time.

I made a couple of changes which I outline in a comment in this thread: https://community.st.com/s/question/0D50X0000BMErkFSQT/stm32f407-lwip-ip-stack-stop-working-during-tcp-port-scanning-from-vulnerabilities-test-tool

And it improved things, but I let it running a test overnight and it hanged again after a couple of hours.

>lwIP core (also known as the "tcpip_thread") calls low_level_output(), which calls HAL_ETH_TransmitFrame().

>While the latter is processing, Ethernet input thread (ethernetif_input() function) can call low_level_input(), which calls HAL_ETH_GetReceivedFrame_IT().

>Because both of those HAL functions use HAL's ingeniously stupid "lock" mechanism, HAL_ETH_GetReceivedFrame_IT() returns HAL_BUSY. 

@Piranha​ and @Amel NASRI​, the problem with the ETH driver's incorrect LOCK implementation is identified correctly.

But the proposed fix of surrounding HAL_ETH_GetReceivedFrame_IT with LOCK_TCPIP_CORE() and UNLOCK_TCPIP_CORE() to make the ETH driver's receive and transmit mutually exclusive is NOT the most optimal fix.

>Therefore the code locks lwIP core for a longer time than it's necessary and is sub-optimal. 

That's correct. And the lwIP core would not need to be locked at all for receive if the incorrect locking in the ETH driver was fixed.

It is a simple fix in the ETH driver... Either REMOVE THE LOCK from the HAL_ETH_GetReceivedFrame_IT OR define DIFFERENT LOCKS for the ETH driver's receive and transmit.

MHama.3
Associate II

@Piranha​ first of all thanks for your efforts and your probosed solution

  1. I tried to do like you did but the LOCK_TCPIP_CORE(); is not defined in my program
  2. HAL_StatusTypeDef status;
  3.  
  4. LOCK_TCPIP_CORE();
  5. status = HAL_ETH_GetReceivedFrame_IT(&EthHandle);
  6. UNLOCK_TCPIP_CORE();
  7.  
  8. if (status != HAL_OK) {
  9. return NULL;
  10. }

I use Lwip without Rtos and I am facing the same problem you had, I can ping my board if no Traffic is there

Where can I finde the definition of the function LOCK_TCPIP_CORE();

Best regards

Mosaab