STM32F7: ETH TCP checksum offload fails

LCE · ‎2022-11-30

Hello,

ST's and all Ethernet experts, please.

STM32F767,

Nucleo-144, and custom board

no OS

lwIP 2.1.3, IPv4 only

STM32CubeIDE

no ETH interrupts used

Application:

Industrial frontend,

streaming "audio" data from SAIs via ethernet,

at high data rates, for long periods of time (weeks)

TCP is a must, losing packets is not an option.

Audio streaming mostly uses UDP, and they use interpolation

to mask lost packets - we are not allowed do that.

Problem:

ETH transmission: TCP header checksum is = ZERO = 0.

Depending on settings below (SRAM usage, CPU clock),

this happens after a few MB, or many GB of data,

at 25.6 Mbps it's running sometimes for hours,

sometimes it stops after a few minutes.

IP4 header checksum is okay (at least not 0).

All checked with Wireshark.

Then the PC side stops ACKnowledging,

then lwIP shuts down TCP.

Checked:

Same behaviour on Nucleo and custom board.
Checked on 2 different PCs.
LwIP stats don't show any errors.
"Transmit Store and Forward" is set in DMAOMR.
Transmit FIFO is deep enough for packets (1514 B, no bigger packets).
Checksum offload (CIC = ETH_DMATXDESC_CIC_TCPUDPICMP_FULL) is activated in all TX descriptor Status registers (BTW, there's a documentation error in RM0410 which says CIC bits are 28..27 in DESC1, page 1785).
Header checksums from lwIP are definitely = 0 before given to ETH DMA.
Payload checksum error status bit in the Transmit Status vector is NEVER set.
Memory barriers are used as recommended (DMB, DSB).
RM says reasons for checksum failure might be:
- no end of frame written to FIFO
- incorrect length
- All this does not happen, I checked all descriptors.

Findings:

Settings so far with an impact on the "zero-checksum-failure":

CPU clock -> lower = better
usage of internal SRAM memory areas, use of DTCM / SRAM1

Best "setup" until now:

CPU clock reduced to 192 MHz (216 is max for F767)
no use of DTCM - which makes it lose 1/4 of internal SRAM

-> Much better, but still not perfect, failed on one board after 1 hour or so.

Having used FPGAs for years, I had the hope of leaving these

"assumed race conditions" behind (naive me...). At least in the

FPGA you can get control of these problems.

For the final product, right now it seems the STM32F7 is not an option.

Which is sad, after having spent a lot of time on that one, having the

firmware at about 99% finished.

So, what am I doing wrong?

Or is there a known issue?

Source code of ethernetif.c etc. attached.

P.S.: I spammed the code with lots of __DSB()... I think I can remove many of these. But as it seems to be memory related, I had some hope.

Piranha · ‎2022-12-02

What about the simplest test - ICMP ping? Does that continue working and what happens with it's checksums?

You can try broadcasting some simple constant UDP packet and capture it with Wireshark. Then compare the before and after versions and check whether the checksum field is the only one that differs or some other bytes also differ.

Maybe you are calling some lwIP function from some ISR? Maybe the crappy Newlib's printf functions are called from ISRs?

If the PC side stops ACK'ing, then the connection is indeed aborted, not just stuck because of corrupted frames. It seems that some memory corruption happens on the IP stack. Is the stack size big enough? How reliable is the rest of the firmware code to not f*ck up some unrelated memory?

LCE · ‎2022-12-02

Thanks for your further ideas!

UDP:

before failure: valid checksums,

after failure checksum = 0, but still working (checked with "Packet Sender")

PING:

before failure: valid checksums,

after failure checksum = 0 => ping fails

lwIP:

Even after the checksum failure, I get these okay stats from lwIP:

MEM HEAP
used 156 bytes of 15 kB (3044 max used)
# ERR: 0
 
lwIP internal memory pools (11):
 
RAW_PCB         used 0 of 4     max 0
UDP_PCB         used 4 of 6     max 4
TCP_PCB         used 0 of 12    max 7
TCP_PCB_LISTEN  used 2 of 4     max 2
TCP_SEG         used 1 of 128   max 32
REASSDATA       used 0 of 8     max 0
FRAG_PBUF       used 0 of 32    max 0
IGMP_GROUP      used 3 of 4     max 3
SYS_TIMEOUT     used 6 of 8     max 7
PBUF_REF/ROM    used 0 of 240   max 32
PBUF_POOL       used 28 of 36   max 30

and as I just saw again: TCP stack still seems to be working, messages are simply not accepted by the other side due to the wrong checksum.

printf:

Holy cow, you might have found it!

When I just had the TCP streaming running, and then started in parallel the UDP echo test, the streaming immediately failed. And a reason for that might be that I'm using printf to UART for the data received from the echo test. But not from an ISR.

And another hint that printf and my stupid usage for it might be the culprit:

In the SAI RX DMA interrupts, when the buffers overflow because of network traffic / PC side too slow / whatever, I throw a short debug message out via printf / UART (TX DMA).

And the thing I observed:

The more buffer overflows I had, the higher the chance for the checksum failure.

So on another PC with less network traffic, these failures were much less common.

I'll go and check and remove...

Why's "Newlib's" printf so bad?

LCE · ‎2022-12-02

In my "sai.c" file where the SAI_RX_(half)complete ISRs are I have noted:

/* no printf in ISR */

A function that is called from that ISR uses printf if an error occurrs... Ouch!

I just removed all printf calls which might happen when TCP streaming starts, now it's running smoothly for half hour, on the "bad PC" (more network traffic, about 100 anti-virus programs), with the until recently worst configuration: using all SRAM incl. DTCM and CPU clock 216 MHz.

... and just as I am writing this, BAM, it stops again, checksums = 0

Is it safe to assume that printf causes this problem only when called from an ISR?

If not, I'm having some other problem, because the control interface via Http / REST uses trillions of scanf and sprintf calls.

Which is so comfortable... I'm using 8 bit controllers for 20 years now, and never used printf, even for debugging with text output, and now after 1 year of working with the STM32 with (feels like) huge amounts of flash and SRAM I really got used to all kinds of printf.

LCE · ‎2022-12-03

Just found out:

TX FIFO flushing restores correct checksum.

Still, after a while checksum = 0.

EDIT:

But at least I can get everything else back to work, like the http interface.

So when the streaming PCB gets the connection aborted error, I immediately flush the TX FIFO and checksums with everything else is good again.

Except for the data stream...

And I know a little bit more:

so the data and the given length to the FIFO somehow do not correspond, but I wonder why the descriptor's error bit is not set
I can't remember any register giving any info about the FIFO status, so I have to read through the register descriptions again.

LCE · ‎2022-12-05

The problem still there...

I have thrown out all HAL setups and inits, went through all MAC & DMA registers, maybe it's a little bit better now. Sometimes it's running for hours, sometimes checksum = 0 after a few seconds.

On my private Laptop, which is using a USB/ ethernet bridge with a Microchip LAN9512 (100M), and which has only "normal" anti-virus programs running, and which is also newer and has more CPU power, the checksum error happens less often.

On my work laptop, which is running (feels like) 100 anti-virus-programs in the background, the problem occurrs more often. So I have checked the Intel I219LM ethernet adapter settings and played with these (incl. TCP checksum offload), then things only got worse.

But I can NOT see any external reasons in the network stream that might provoke the checksum error - apart from the fact that this wouldn't make sense because it's definitely the STM32 taking care of the TCP checksum, and the PHY or anything of the outside world is not involved.

RM0410, page 1786, about TX checksums:

"The result of this operation is indicated by the payload checksum error status bit in the Transmit Status vector (bit 12). The payload checksum error status bit is set when either of the following is detected:

– the frame has been forwarded to the MAC transmitter in Store-and-forward mode without the end of frame being written to the FIFO

– the packet ends before the number of bytes indicated by the payload length field in the IP header is received.

When the packet is longer than the indicated payload length, the bytes are ignored as stuff bytes, and no error is reported.

When the first type of error is detected, the TCP, UDP or ICMP header is not modified.

For the second error type, still, the calculated checksum is inserted into the corresponding header field." - quote end

As no error bit is ever set, it could only be the above in bold payload length problem. But I check the descriptors and all length info is as it should be.

Very frustrating...

Piranha · ‎2022-12-05

> You can try broadcasting some simple constant UDP packet and capture it with Wireshark. Then compare the before and after versions and check whether the checksum field is the only one that differs or some other bytes also differ.

Do this and report which bytes exactly differ.

> Why's "Newlib's" printf so bad?

Because it is written and optimized for speed on PCs with a relatively huge virtual memory. That implementation should not be used in a MCU environment. You can read about it's problems in this, this and this topic. And there is a good discussion about the topic on EEVblog forum. A decent solutions are: eyalroz/printf, LwPRINTF, nanoprintf.

LCE · ‎2022-12-05

@Piranha thanks again for having a look at this.

And indeed, I oversaw that also the IP4 header checksum fails, otherwise the UDP packets before and after failure are identical (except for IP4 ID, but that must be).

So I go check if IP4 header checksum is always set to 0 before given to the MAC.

UDP echo test 1
 
"0000"
 
before failure:
0000   d4 81 d7 86 d8 45 ce 22 29 43 04 20 08 00 45 00  
0010   00 20 00 7e 00 00 ff 11 d5 8f c0 a8 b2 46 c0 a8  
0020   b2 27 00 07 ed 91 00 0c cc 1d 30 30 30 30 00 00  
0030   00 00 00 00 00 00 00 00 00 00 00 00              
 
after failure:
0000   d4 81 d7 86 d8 45 ce 22 29 43 04 20 08 00 45 00  
0010   00 20 45 2d 00 00 ff 11 00 00 c0 a8 b2 46 c0 a8  
0020   b2 27 00 07 ed 91 00 0c 00 00 30 30 30 30 00 00  
0030   00 00 00 00 00 00 00 00 00 00 00 00              
 
=> IP4 and UDP checksums = 0
different IP4 ID (okay)
otherwise identical
 
--------------------------------------------------------
UDP echo test 2
 
"01234567890123456789"
 
before failure:
0000   d4 81 d7 86 d8 45 ce 22 29 43 04 20 08 00 45 00  
0010   00 30 00 81 00 00 ff 11 d5 7c c0 a8 b2 46 c0 a8  
0020   b2 27 00 07 ed 91 00 1c 22 4a 30 31 32 33 34 35  
0030   36 37 38 39 30 31 32 33 34 35 36 37 38 39        
 
after failure:
0000   d4 81 d7 86 d8 45 ce 22 29 43 04 20 08 00 45 00  
0010   00 30 45 2c 00 00 ff 11 00 00 c0 a8 b2 46 c0 a8  
0020   b2 27 00 07 ed 91 00 1c 00 00 30 31 32 33 34 35  
0030   36 37 38 39 30 31 32 33 34 35 36 37 38 39        
 
=> IP4 and UDP checksums = 0
different IP4 ID (okay)
otherwise identical

printf:

I'll have a look at this, as I really need this for the Http interface.

But for now with the standard printf:

As long as it is not called while streaming, it can't be the reason for that failure, right?

Also, if heap is big enough, this also should not be an issue?

general: (mostly thinking loudly)

that flushing the TX FIFO cures that problem (checksums are okay again), should hint at some problem concerning the packet length, the size the FIFO is told vs real size.

With lwIP's TCP each packet comes with (at least) 2 chained packet buffers (pbufs) which are built in tcp_write():

the header: pbuf->payload = IP + TCP header
the data: pbuf->payload = "user data"

These are put on the TCP PCB's unsent queue, which is again "emptied" by tcp_output() with calling tcp_output_segment(), which "finalizes" the header and calls the hardware output function.

There the descriptors are set, concerning the checksum offload most importantly the First / Last segment bits and the length info.

In store-and-forward mode the FIFO should only be given to the MAC when the frame is complete = Last segment bit set, as long as the packet is smaller than FIFO size.

Which should be the case, with packets of max 1514 bytes vs 2kB FIFO.

LCE · ‎2022-12-05

IPv4 checksum:

now, that's interesting:

the 1st TCP packet that fails the TCP checksum = 0, still has a valid IPv4 checksum.

After this packet also the IPv4 checksum fails with 0.

Does that tell us anything except that I have some TX FIFO / size / length / whatever problem?

This thing drives me crazy...

waclawek.jan · ‎2022-12-06

> the 1st TCP packet that fails the TCP checksum = 0, still has a valid IPv4 checksum.

Is there anything particularly interesting/special/unusual in that packet? Can you perhaps post it?

JW

LCE · ‎2022-12-06

Right now it's stable as hell, even when I spam it with extra http requests like crazy...

I "only" reduced TCP_MSS from 1460 (maximum) to 1220 and accordingly the SAI buffer size.

I build that back to 1460 and show the latest packets.