STM32H743 Nucleo - Ethernet stops working

Torsten Jaekel · ‎2018-07-31

My project uses the CubeMX RTOS LwIP web server demo. It works fine, just: after a long time, often after hours, the Ethernet peripheral stops working. It happens in a random way: sometimes the recv_err = netconn_recv(conn, &inbuf); does not result in ERR_OK, but very often the ETH just does not receive anymore. The ETH Interrupt is not generated anymore. The MCU has to be reset. The RTOS is still alive, just the ETH Peripheral seems to hang.

If this happens then I cannot see any ETH I/O registers in debugger, the block looks like disabled, hanging and not responding anymore. The bus interface to ETH seems to be locked up.

It happens, if I let display the dynamic Task View in web browser with a HTTP auto-refresh (every second), for a very long time.

The ETH MAC peripheral seems to hang. It happens always, just the duration how long it will still work is random and it will die after thousands of requests and packets transferred and hours running well (should not be an 'out of memory' issue).

Short before it will stop working - the web page auto refresh seems to need much more time (huge delay) before the request is responded. Nothing else is running and the FW should do all the time the same stuff.

ETH on STM32H7 is too unreliable if running for a long time (endless), with periodic HTTP requests. No clue how to debug and what to look for when ETH stops working (not access-able anymore via debugger).

Torsten Jaekel · ‎2018-07-31

Update: it happened again, after >3 hours running:

recv_err = netconn_recv(conn, &inbuf); has returned with -15, ERR_CLSD (Connection Closed). Why?

Torsten Jaekel · ‎2018-08-01

Update: it looks like, the dynamic memory management of LwIP vers. 2.0.3 has an issue: it still stops working after a long time, the memory looks corrupted: the mail boxes (pointers) are destroyed (many pointer values to mbox's are NULL when stopped), my current assumption: LwIP v. 2.0.3 has an issue with dynamic buffer management (its own malloc).

After changing memory layout: I got error code -13 (ERR_ABRT) but now: nothing of such errors: it just has stopped (so random still, assuming memories are corrupted).

Torsten Jaekel · ‎2018-08-01

Update: just increasing RTOS stack size (configMINIMAL_STACK_SIZE) to 1024: no compile errors or warnings but now: I get IP address from DHCP but HTTPD does not respond to ant request. If I Iower again RTOS task stack size (e.g. 256) - it works again.

It is sooooo strange, esp. there are not any compile or linker errors but it seems to be sensitive for the memory layout (RTOS + LwIP RAM usage, size and locations of RAMs).|

(I had never such a trouble with STM32F7xx, LwIP v. 1.x.x - but now with STM32H7 + LwIP v. 2.x.x - it seems to be very odd and unreliable).

Torsten Jaekel · ‎2018-08-03

Update: it turns out that the ETH input (receiver) is triggered (via mailboxes), but the mailbox content is 0x0. So, instead to assign the address of the received ETH, TCP/IP frame - the pointer becomes zero and the LwIP stack closes the TCP/IP socket.

BTW: closing this seems to result also in disabling the ETH MAC in general (for all further traffic).

The piece of code where it starts failing is in file 'sys_arch.c', in function 'sys_arch_mbox_fetch':

else

{

event = osMessageGet (*mbox, osWaitForever);

*msg = (void *)event.value.v;

return (osKernelSysTick() - starttime);

}

The event.value.v is 0x0 and it results in file 'api_lib.c', function 'netconn_recv_data', code:

/* If we are closed, we indicate that we no longer wish to use the socket */

if (buf == NULL) {

API_EVENT(conn, NETCONN_EVT_RCVMINUS, 0);

if (conn->pcb.ip == NULL) {

/* race condition: RST during recv */

return conn->last_err == ERR_OK ? ERR_RST : conn->last_err;

}

/* RX side is closed, so deallocate the recvmbox */

netconn_close_shutdown(conn, NETCONN_SHUT_RD);

/* Don' store ERR_CLSD as conn->err since we are only half-closed */

return ERR_CLSD;

to close the TCP/IP socket. Unfortunately, all is dead afterwards, we cannot create anymore any other or new network connection. The RTOS is still alive but for network the FW has to be restarted.

Why does this happen? - no clue. It does not look like a memory corruption (destroyed memories), it looks more like a unknown ETH frame, e.g. an empty ETH frame or an issue with the STM32H7 MAC filter (forwarding a corrupted or un-handled packet to LwIP).

JongOk Baek · ‎2018-10-04

Hi ,

I'm same problem at UDP receive on STM32H743ZI.

And, It is good operation with same application on STM32F746T6.

(with FreeRTOS v9.0, LwIP 2.0.3)

When It occurred problem.

All task is good running. But Ethernet only stopped.

So, I want to know how to resolve it.

If you resolved it, please let me know it.

(I will do it, too.)

Thanks.

JongOk Baek · ‎2018-12-11

Hi Torsten,

Did you solve the problem?

I was testing another TCP/IP Stack such as CycloneTCP with STM32H7.

So, It is good.

I didn't find the same bug such as LwIP.

I will test more condition with STM32H7.

Best regards,

Torsten Jaekel · ‎2018-12-11

Hi JongOK,

no, I have not solved (I was not looking, debugging on it for a while).

Good to know.

My guess is:

in my network might be some IP packets, messages (e.g. routing messages, SNMP) which the LwIP stack does not decode and handle properly (SNMP is not enabled on my system).

So, it might depend on the network I use (at home with PowerLine modems, switches and some other devices connected on network, e.g. Apple devices). Maybe on other networks not to see an issue.

Also, changing to another TCP/IP stack can "solve" the problem.

I am sure, it is not a HW issue on MCU, more an issue with the TCP/IP stack, state machine etc.

Let's see. I had to test my network stack on MCU anyway again.

KR1 · ‎2018-12-12

I've posted an answer to this issue on a different thread:

https://community.st.com/s/question/0D50X0000A4nCOmSQM/need-help-to-run-ethernet-communication-correctly-with-stm32h7-nucleo

but I can re-iterate. Essentially, I have only managed to get Ethernet running on an older version of STM32Cube (STM32Cube_FW_H7_V1.1.0). The exemplary code from the newer version (STM32Cube_FW_H7_V1.3.0) or the one generated directly using STM32CubeMX simply does not work. Everything seems to crash after several minutes of operation, as most have indicated in this forum. So my temporary suggestion would be to get a hold of STM32Cube_FW_H7_V1.1.0 and to modify the LwIP_HTTP_Server_Netconn_RTOS example. It's not the most elegant solution, but it should work until the firmware bugs are fixed.

JongOk Baek · ‎2018-12-13

Dear KR,

Thanks for your reply.

I was testing my application with STM32Cube_H7_V1.1.0. (with lwip)

So, It is good.

It have not problem such as V1.3.0.

I will try downgrade only lwip.

Thanks again.

Best regards,