STM32F429 + NetxDUO - Heavy traffic problems

Davide Dalfra · ‎2025-07-07

Hello Folks

I'm looking for suggestions / hints on what's the best way to troubleshoot a strange problem we're experiencing.
We're running AzureRTOS/ThreadX + NetxDuo (Version 6.1.0) where we have a MQTT Client subscribing to a broker, waiting for a message and then reply back.
As Phy, we're using LAN8742A.

The problem we're experiencing is that while the application (based on pc) is flooding of messages the STM32F4 board, then it suddenly stuck somewhere on NetX side.

The other part of application is correctly running, and in order to troubleshoot better the issue we're experiencing (which is highly replicable) we have:

Increased the ip packet pool
Increased the RX descriptors
Disabled all the other task , just leave NetX ip instance and the task running the mqtt client(wait for message and then reply with a static message saying "Hello").
Force speed from 100MBit down to 10Mbit.

None of the previous tries give us a clue on what's happening. The only thing we've noticed is that after this happen we're no longer able to get ETH isr triggering.

Just for your information we're sending 10 message, at 10ms each one . After 4/5 burst, the ip stack get stuck and we loose also the ping (pc is pinging the board).

Any suggestion?

Regards

Davide

Ozone · ‎2025-07-07

> The problem we're experiencing is that while the application (based on pc) is flooding of messages the STM32F4 board, then it suddenly stuck somewhere on NetX side.

This sentence is not fully comprehensible.
Do you mean, the PC floods the F4-based board, which suddenly stops to respond ?

I would recommend to instrument the code in question, including the ETH interrupts. Perhaps use GPIO toggles to reduce the additional load, and use a scope.

And, review your error handling.
Perhaps overflow (loss of packages due to overrun) is not handled well, or at all.

And due to ETH buffer capacities and core performance of you setup, you might need to reduce your expectations.
If I remember correctly, most application processors (Cortex A, x86) and associated network interface ICs have internal buffer capacities for at least two jumbo frames.

mbarg.1 · ‎2025-07-07

I had a similar problem with H7, and my workaround implies to re-write Interface between HAL and NetX.

I tested almost all examples and non could stand without crashing a flood of syn with original code.

Actually STM code is based on multiplexing ethernet interrupts, put in thread, demultiplex them, execute some processing with no mutexes on STM global variables.

With low traffic work fine but not on heavvy load.

Re-writing the interrupt interface, we achieved a stable solution, no way to crash on any Ethernet traffic, bad packets or similar.

Obviously, CPU load is a limit - reduce time and resources for ethernet thread and you can lose some packets but without any crash - this is why we use only H7, we do have heavvy HTTP traffic in IPv6 and IPv4 in parallel.

Davide Dalfra · ‎2025-07-07

@Ozone wrote:
> The problem we're experiencing is that while the application (based on pc) is flooding of messages the STM32F4 board, then it suddenly stuck somewhere on NetX side.
This sentence is not fully comprehensible.
Do you mean, the PC floods the F4-based board, which suddenly stops to respond ?

Yes, i mean the PC floods F4 board and it suddenly stop to respond (F4 board).

I would recommend to instrument the code in question, including the ETH interrupts. Perhaps use GPIO toggles to reduce the additional load, and use a scope.

And, review your error handling.
Perhaps overflow (loss of packages due to overrun) is not handled well, or at all.

I hade a look with a Profiling tool (Tracealyzer) and i see no problem of interrupts there.
Btw i could have a look enabling some debugging defines NetXDuo offers, to see if there's something interesting.

The fact is that after i send the first burst (8 messages @ 10ms each) everything work fine. Second burst works too, after that it could happen that Third burst work or stuck.

I could lower my expectation about performances, i can even accept to loose packets , that's fine.
What i can't accept is that after loosing packet , the communication can't be restored.
I'm using as low-level interface between LAN8742 and NetX the integration provided in ST Examples.

Thanks
Davide

Davide Dalfra · ‎2025-07-07

@mbarg.1 wrote:
I had a similar problem with H7, and my workaround implies to re-write Interface between HAL and NetX.
I tested almost all examples and non could stand without crashing a flood of syn with original code.
Actually STM code is based on multiplexing ethernet interrupts, put in thread, demultiplex them, execute some processing with no mutexes on STM global variables.
With low traffic work fine but not on heavvy load.
Re-writing the interrupt interface, we achieved a stable solution, no way to crash on any Ethernet traffic, bad packets or similar.
Obviously, CPU load is a limit - reduce time and resources for ethernet thread and you can lose some packets but without any crash - this is why we use only H7, we do have heavvy HTTP traffic in IPv6 and IPv4 in parallel.

I think you got the point. There's something that happens only on heavy load due to something broken between NetXDuo low level interface and Lan8742 integration.

Was you you using NetX too or LWIP ?

Regards

Davide

Pavel A. · ‎2025-07-07

> first burst (8 messages @ 10ms each) everything work fine. Second burst works too, after that it could happen that Third burst work or stuck.

People familiar with testing of network equipment know that proper tests include thousands of hours, with all combination of packet sizes and data patterns. ST does not provide any low-level examples or tests.

mbarg.1 · ‎2025-07-08

@Davide Dalfra : actually I do use ThreadX + NetXDuo.

Two years ago STM stopped supporting LwIP even if now they decided that on new processors they will move back to FreeRtos+.

ThreadX so far (at least on my experiernce) has proven to be 100% reliable - one caveat, always check that you dont overflow threads stack, there is no warning and it is very difficult to forecast stack size, while FreeRtos had many more instabilities in several C functions.

Even if final design run in custom boards, I do always develop on STM nucleo boards, to be able to share possible bugs and limitations with other and exclude hw related problems.

As sad above, look at Ethernet interrrupts routines and you can easily see that proposed interface is a non sense.

Mike

mbarg.1 · ‎2025-07-08

@Pavel A. : to crash any demo, just use nc to generate a flood of syn - you do not need hours or days.

Mike

Davide Dalfra · ‎2025-07-11

@mbarg.1 I did not understand why they choose to move back to FreeRTOS. I found ThreadX (and all the usb + network stacks) more reliable than FreeRTOS.

Coming back to my problem, i was able to reproduce the issue also on a NUCLEO-F429ZI, with my code that runs on the custom board. This has of course excluded hardware problem on the custom board.

I had a look at how the Interrupts on ETH side works, but except one stuff (which i did no understand the reason why is made like that) i see that after a RX interrupt arrives in it's callback , an event is set ( by the nx_ip_driver_deferred_processing). On the event-mode i did not see anything wrong except what i'm showing below.

Look at the "nx_driver_information" structs and how it's managed.

void HAL_ETH_RxCpltCallback(ETH_HandleTypeDef *heth)
{

  ULONG deffered_events;
  deffered_events = nx_driver_information.nx_driver_information_deferred_events;

  nx_driver_information.nx_driver_information_deferred_events |= NX_DRIVER_DEFERRED_PACKET_RECEIVED;

  if (!deffered_events)
  {
    /* Call NetX deferred driver processing.  */
    _nx_ip_driver_deferred_processing(nx_driver_information.nx_driver_information_ip_ptr);
  }
}

VOID  _nx_ip_driver_deferred_processing(NX_IP *ip_ptr)
{

    /* Set event flags to wake the IP helper thread, which will in turn
       call the driver with the NX_LINK_DEFERRED_PROCESSING command.  */
    tx_event_flags_set(&(ip_ptr -> nx_ip_events), NX_IP_DRIVER_DEFERRED_EVENT, TX_OR);
}

Then, this is picked-up by something in the ip_thread task, which process the packet by issuing a driver request.

What i found a bit tricky is those parts (but i might be wrong):

1st, the nx_driver_information.nx_driver_information_deferred_events is set to 1 (as per code snippet above) without any protection against 2nd point.

2nd, the "nx_driver_information.nx_driver_information_deferred_events = 0;" is set here to zero by disabling interrupts

static VOID  _nx_driver_deferred_processing(NX_IP_DRIVER *driver_req_ptr)
{

  TX_INTERRUPT_SAVE_AREA

    ULONG       deferred_events;


  /* Disable interrupts.  */
  TX_DISABLE

    /* Pickup deferred events.  */
    deferred_events =  nx_driver_information.nx_driver_information_deferred_events;
  nx_driver_information.nx_driver_information_deferred_events =  0;

  /* Restore interrupts.  */
  TX_RESTORE
    /* Check for a transmit complete event.  */
    if(deferred_events & NX_DRIVER_DEFERRED_PACKET_TRANSMITTED)
    {

      /* Process transmitted packet(s).  */
      HAL_ETH_ReleaseTxPacket(&eth_handle);
    }
  /* Check for received packet.  */
  if(deferred_events & NX_DRIVER_DEFERRED_PACKET_RECEIVED)
  {

    /* Process received packet(s).  */
    _nx_driver_hardware_packet_received();
  }

  /* Mark request as successful.  */
  driver_req_ptr->nx_ip_driver_status =  NX_SUCCESS;
}

Am i on the right way?

D.

Davide Dalfra · ‎2025-07-13

Weekend news:

i found some more to work with, and on the interrupt side i stabilized the situation. Now i am quite able to get good eth performances and also i am able to restore the system after flooding.
Tested also with hping3, and it almost works.

I have now a question: sometimes i fall into HAL_ETH_ErrorCallback. How the restore of the ETH peripheral shall be made?
Setting a new event and picking it up on the working thread to issue a restart? Any good suggestion?
I fall into with:

heth->gState = 35
heth->ErrorCode = 8
heth->DMAErrorCode = 33920

Regards

Davide