2022-11-30 12:48 AM
Hello,
ST's and all Ethernet experts, please.
STM32F767,
Nucleo-144, and custom board
no OS
lwIP 2.1.3, IPv4 only
STM32CubeIDE
no ETH interrupts used
Application:
Industrial frontend,
streaming "audio" data from SAIs via ethernet,
at high data rates, for long periods of time (weeks)
TCP is a must, losing packets is not an option.
Audio streaming mostly uses UDP, and they use interpolation
to mask lost packets - we are not allowed do that.
Problem:
ETH transmission: TCP header checksum is = ZERO = 0.
Depending on settings below (SRAM usage, CPU clock),
this happens after a few MB, or many GB of data,
at 25.6 Mbps it's running sometimes for hours,
sometimes it stops after a few minutes.
IP4 header checksum is okay (at least not 0).
All checked with Wireshark.
Then the PC side stops ACKnowledging,
then lwIP shuts down TCP.
Checked:
Findings:
Settings so far with an impact on the "zero-checksum-failure":
Best "setup" until now:
-> Much better, but still not perfect, failed on one board after 1 hour or so.
Having used FPGAs for years, I had the hope of leaving these
"assumed race conditions" behind (naive me...). At least in the
FPGA you can get control of these problems.
For the final product, right now it seems the STM32F7 is not an option.
Which is sad, after having spent a lot of time on that one, having the
firmware at about 99% finished.
So, what am I doing wrong?
Or is there a known issue?
Source code of ethernetif.c etc. attached.
P.S.: I spammed the code with lots of __DSB()... I think I can remove many of these. But as it seems to be memory related, I had some hope.
2022-11-30 03:53 AM
BTW, I am not the only one who encountered that problem:
And I wanted to add:
instruction cache and data cache are disabled (check via SCB->CCR)
2022-11-30 04:50 AM
If you are willing to test wild hypotheses, try reverting to the state with many errorneous transmissions (so that you are confident that error will occur very quickly), and after that switch DMAOMR.OSF off.
JW
2022-11-30 04:59 AM
I just posted the documentation error as a separate thread.
JW
2022-11-30 06:57 AM
Hi @Community member ,
thanks for the idea, right now I am so desperate that I welcome every idea, no matter how wild. And playing with that bit doesn't sound too wild.
The problem is that I don't know how the firmware can detect that error.
So first I have to find a way to detect that error, because I don't get any info back from the TX descriptors, and the MAC cannot write the TCP checksum back into its source buffer.
So I see that error only when it's too late: in my PC application (scope / analyser), in wireshark, and in my UART output that lwIP closed the connection due to not getting the ACKs for too long.
Right now I can only reset the OSF bit manually - when it's much too late.
Any ideas?
Just tried without the OSF bit:
nice, the zero-checksum error now comes after a few packets!
Need to think about that...
Right now the only solution that comes to my mind:
make 1 pbuf out of the 2 that lwIP builds for header and data. Ouch...
2022-11-30 07:42 AM
It's such a stupid error, on my 2nd PC it's running flawlessly for almost 5 hours now.
EDIT: 5.5 hours, and zero checksum again.
That's really a pain in... everywhere.
2022-12-01 01:10 AM
I've been running Iperf2 TCP full-duplex at ~190 Mbps on F76x for weeks non-stop moving terabytes of data and running over 2^32 packets for both Rx and Tx without any issues. Therefore the issue should be a software issue and at least you can stop worrying about the MCU not being usable.
Even if the device sends a corrupted packet or several, it's almost impossible to accidentally break a TCP connection. That just doesn't sound right. But what exactly stops working? A single TCP connection, TCP subsystem, IP stack, driver, hardware? When it stops... Can the device create a new TCP connection? Does ICMP ping work? Does the driver still receive and can send packets?
I'll look into code and report here.
2022-12-01 03:13 AM
Hello @Piranha ,
thanks for chiming in!
And thanks for the good news about your long-term testing!
So there's hope, and it's "only" something stupid in my software.
This morning I had some hope, because in the SAI RX complete ISR in case of SAI buffer overflow it might have grabbed some other buffer already given to ETH DMA. I removed that, but problem still remains.
I went through all ISRs again, even put in some IRQ blocking in some functions (preparing ETH TX descriptors, because these are pointing to the SAI buffers).
Here are some more infos:
a) As soon as TCP header checksum = 0, the PC side stops ACK'ing, which seems to lead to a TCP timeout, the streaming connection's error callback is called and tells me via UART:
eErrIn = -13 = Connection aborted
b) Even after that streaming server stop, LwIP stats show that there are no lwIP errors, and 2 TCP servers / listening PCBs active:
c) Now when I send another TCP/Http command, in wireshark I see that the SYN from PC gets through, and even that the MCU sends a SYN / ACK, but again with checksum = 0, and again this packet is not ACK'd. The MCU throws out TCP retransmissions, same problem, checksum = 0.
So TCP is alive, but never gets any ACK due to sending checksum = 0.
d) UDP is still working, as I have seen from PTP still running in the background, and so does a UDP echo server which I can start manually from UART.
But: UDP header checksum = 0! Which is not the case before the "checksum breakdown".
e) All MCU register settings after "checksum breakdown" are still the same, I just compared these again with a still running version.
f) TX descriptors also show no errors, as maybe IHE / IPE might be expected.
2022-12-01 11:21 PM
Until now I had not found that wireshark feature that it can check the checksum,
now that I turned it on it gave me a good laugh because of the comment in brackets:
Checksum: 0x0000 incorrect, should be 0x242e (maybe caused by "TCP checksum offload"?)
2022-12-01 11:25 PM
BTW, I forgot to show the MAC and DMA register setup.
Maybe there's something wrong?
ETH->
MACCR = 0000CE0C
speed: 100M
duplex: FULL
transmitter: ON
receiver: ON
Interframe Gap = 96 bit times
IPv4 checksum offload: ON
Retry OFF (half-duplex)
MACSR = 00000000
MACFFR = 00000051
MACHTHR = 00000000
MACHTLR = 00000000
MACMIIAR = 00000050
MACMIIDR = 0000782D
MACFCR = 00000000
DMASR = 00660404
DMAIER = 0001A041
DMAOMR = 02202006
RSF
TSF
ST
OSF
SR
DMABMR = 02C12080
DMARDLAR = 2007C000
DMATDLAR = 2007D800
DMACHTDR = 2007E1D8
I have to admit that I don't really understand the MACFCR flow control register. Have to read more...