CubeMX LWIP ethernet_link_thread bug

el_d · ‎2022-10-15

I am using stm32f407zg board with DP83848 ethernet PHY chip. CubeMX generates code with LWIP for FreeRTOS. App used to answer ping successfully when it starts with connected eth cable only and stops after its reconnecting. After some investigation i have noticed some suspicious place in code(see below). After adding "_IT" postfix it started to work, so looks like cube used wrong handler to stop/start ETH periphery(see lines 7 and 30). So my question is: is it a CubeMX bug or i just missed some delails?

Here is a cycle from ethernet_link_thread entry in ethernetif.c file:

for(;;)
  {
  PHYLinkState = DP83848_GetLinkState(&DP83848);
 
  if(netif_is_link_up(netif) && (PHYLinkState <= DP83848_STATUS_LINK_DOWN))
  {
    HAL_ETH_Stop_IT(&heth); //Here we stop ETH 
    netif_set_down(netif);
    netif_set_link_down(netif);
  }
  else if(!netif_is_link_up(netif) && (PHYLinkState > DP83848_STATUS_LINK_DOWN))
  {
    switch (PHYLinkState)
    {
    case DP83848_STATUS_100MBITS_FULLDUPLEX:
      duplex = ETH_FULLDUPLEX_MODE;
      speed = ETH_SPEED_100M;
      linkchanged = 1;
      break;
   ...
    }
 
    if(linkchanged)
    {
      /* Get MAC Config MAC */
      HAL_ETH_GetMACConfig(&heth, &MACConf);
      MACConf.DuplexMode = duplex;
      MACConf.Speed = speed;
      HAL_ETH_SetMACConfig(&heth, &MACConf);
      HAL_ETH_Start(&heth); // HERE IS A POSSIBLE BUG by using wrong handle to start ETH
      netif_set_up(netif);
      netif_set_link_up(netif);
    }
  }

Piranha · ‎2022-10-30

The reported CubeMX bug is present for all series with rewritten ETH drivers - F4, F7, H7. But unfortunately that bug is just a tip of the iceberg of bugs in that code made by incompetent fools.

The ethernet_link_thread() thread calls HAL_ETH_Stop_IT()/HAL_ETH_Start_IT() functions, which modify the descriptors and related variables, and HAL_ETH_SetMACConfig() function, which modifies the peripheral configuration. Now remember that the ethernetif_input() thread calls the low_level_input() and consequentially the HAL_ETH_ReadData(), which also reads and modifies descriptors and related variables. Obviously such calls cannot be done in parallel from multiple threads. The simplest solution is to get rid of the ethernet_link_thread() thread at all and do the link state management in ethernetif_input() thread on RxPktSemaphore semaphore timeout. Therefore the semaphore timeout becomes a link checking "period". This is not only a fix, but also a performance improvement - while the packets are quickly coming in, the link state will not be checked at all. Another improvement already suggested million times is replacing a semaphore with a thread notification/signal/flag.

Of course, as all ST code, this code also doesn't respect lwIP multi-threading rules. It's clearly explained in Common pitfalls and Multithreading that it is not allowed to call the core API without locking the lwIP core thread. And yet they call netif_is_link_up(), netif_set_link_up()/netif_set_link_down, netif_set_up()/netif_set_down() completely ignoring the rules. And how many million times it has to be repeated that netif_set_up()/netif_set_down() are not related to link state and must not be called in this code at all?

The same story goes on for DHCP_Thread() thread. That one also calls dhcp_start(), dhcp_supplied_address(), netif_get_client_data(), netif_set_addr(), netif_ip4_addr() and reads NETIF internal variables without locking the lwIP core thread. And this thread is even more useless because the DHCP state is managed by lwIP internally. Monitoring DHCP state and reporting IP address can easily be done by a callback set up with netif_set_status_callback() without wasting a thread on it.

The example code in the ST's article "How to create project for STM32H7 with Ethernet and LwIP stack working" also breaks lwIP multi-threading rules.

View solution in original post

Pavel A. · ‎2022-10-16

Welcome to the forum @Community member

This is a CubeMX bug (more correctly, bug in its template for generating ethernetif.c).

Piranha · ‎2022-10-30

The reported CubeMX bug is present for all series with rewritten ETH drivers - F4, F7, H7. But unfortunately that bug is just a tip of the iceberg of bugs in that code made by incompetent fools.

The ethernet_link_thread() thread calls HAL_ETH_Stop_IT()/HAL_ETH_Start_IT() functions, which modify the descriptors and related variables, and HAL_ETH_SetMACConfig() function, which modifies the peripheral configuration. Now remember that the ethernetif_input() thread calls the low_level_input() and consequentially the HAL_ETH_ReadData(), which also reads and modifies descriptors and related variables. Obviously such calls cannot be done in parallel from multiple threads. The simplest solution is to get rid of the ethernet_link_thread() thread at all and do the link state management in ethernetif_input() thread on RxPktSemaphore semaphore timeout. Therefore the semaphore timeout becomes a link checking "period". This is not only a fix, but also a performance improvement - while the packets are quickly coming in, the link state will not be checked at all. Another improvement already suggested million times is replacing a semaphore with a thread notification/signal/flag.

Of course, as all ST code, this code also doesn't respect lwIP multi-threading rules. It's clearly explained in Common pitfalls and Multithreading that it is not allowed to call the core API without locking the lwIP core thread. And yet they call netif_is_link_up(), netif_set_link_up()/netif_set_link_down, netif_set_up()/netif_set_down() completely ignoring the rules. And how many million times it has to be repeated that netif_set_up()/netif_set_down() are not related to link state and must not be called in this code at all?

The same story goes on for DHCP_Thread() thread. That one also calls dhcp_start(), dhcp_supplied_address(), netif_get_client_data(), netif_set_addr(), netif_ip4_addr() and reads NETIF internal variables without locking the lwIP core thread. And this thread is even more useless because the DHCP state is managed by lwIP internally. Monitoring DHCP state and reporting IP address can easily be done by a callback set up with netif_set_status_callback() without wasting a thread on it.

The example code in the ST's article "How to create project for STM32H7 with Ethernet and LwIP stack working" also breaks lwIP multi-threading rules.

el_d · ‎2022-10-30

Now i'm fighting with probable bug when TxPktSemaphore->uxItemSize becomes corrupted (not zero) somehow few minutes after start-up, it suddenly goes to while(1) during uxItemSize assertion as zero in semaphore getting call...

... so, your answer showed me a lot about it, it's a huge area to think about

@Piranha thank you so much!

Piranha · ‎2022-10-30

Could be related to this:

https://community.st.com/s/question/0D53W00001oImr4SAC/bug-stm32-lwip-ethernet-driver-tx-deadlock

And just in case you haven't found:

https://community.st.com/s/question/0D50X0000BOtfhnSQB/how-to-make-ethernet-and-lwip-working-on-stm32

Piranha · ‎2022-10-30

Here are few examples how to do the lwIP core thread locking with LOCK_TCPIP_CORE()/UNLOCK_TCPIP_CORE() macros:

MSG_ST · ‎2022-12-06

Hi @Community member ,

Thank you for your feedback and sorry for the delayed answer.

The issue reported from your side concerning the "_IT" postfix and other bugs are already fixed in the last CubeMx version (6.7.0).

Regards

Mahdy

Piranha · ‎2022-12-06

None of the other bugs is fixed. Thread safety issues for HAL, thread safety issues for lwIP, wrong NETIF usage for lwIP and not even Tx deadlock have been fixed. So stop spreading lies about "other bugs" being fixed!

MSG_ST · ‎2022-12-06

You could refer to the MX 6.7.0 release note to check the other fixed bugs.

Piranha · ‎2022-12-07

Of course, there are fixes for other bugs unrelated to this topic, but the text "concerning the "_IT" postfix and other bugs" makes one think that it fixes other bugs reported in this topic, which is not the case. Can you comment on how many decades will people have to wait for your team to understand and fix all the other bugs reported here and elsewhere?

P.S. The nonsense, which you wrote in the other topic, is still an absolute nonsense...