STM32F7: ETH TCP checksum offload fails

LCE · ‎2022-11-30

Hello,

ST's and all Ethernet experts, please.

STM32F767,

Nucleo-144, and custom board

no OS

lwIP 2.1.3, IPv4 only

STM32CubeIDE

no ETH interrupts used

Application:

Industrial frontend,

streaming "audio" data from SAIs via ethernet,

at high data rates, for long periods of time (weeks)

TCP is a must, losing packets is not an option.

Audio streaming mostly uses UDP, and they use interpolation

to mask lost packets - we are not allowed do that.

Problem:

ETH transmission: TCP header checksum is = ZERO = 0.

Depending on settings below (SRAM usage, CPU clock),

this happens after a few MB, or many GB of data,

at 25.6 Mbps it's running sometimes for hours,

sometimes it stops after a few minutes.

IP4 header checksum is okay (at least not 0).

All checked with Wireshark.

Then the PC side stops ACKnowledging,

then lwIP shuts down TCP.

Checked:

Same behaviour on Nucleo and custom board.
Checked on 2 different PCs.
LwIP stats don't show any errors.
"Transmit Store and Forward" is set in DMAOMR.
Transmit FIFO is deep enough for packets (1514 B, no bigger packets).
Checksum offload (CIC = ETH_DMATXDESC_CIC_TCPUDPICMP_FULL) is activated in all TX descriptor Status registers (BTW, there's a documentation error in RM0410 which says CIC bits are 28..27 in DESC1, page 1785).
Header checksums from lwIP are definitely = 0 before given to ETH DMA.
Payload checksum error status bit in the Transmit Status vector is NEVER set.
Memory barriers are used as recommended (DMB, DSB).
RM says reasons for checksum failure might be:
- no end of frame written to FIFO
- incorrect length
- All this does not happen, I checked all descriptors.

Findings:

Settings so far with an impact on the "zero-checksum-failure":

CPU clock -> lower = better
usage of internal SRAM memory areas, use of DTCM / SRAM1

Best "setup" until now:

CPU clock reduced to 192 MHz (216 is max for F767)
no use of DTCM - which makes it lose 1/4 of internal SRAM

-> Much better, but still not perfect, failed on one board after 1 hour or so.

Having used FPGAs for years, I had the hope of leaving these

"assumed race conditions" behind (naive me...). At least in the

FPGA you can get control of these problems.

For the final product, right now it seems the STM32F7 is not an option.

Which is sad, after having spent a lot of time on that one, having the

firmware at about 99% finished.

So, what am I doing wrong?

Or is there a known issue?

Source code of ethernetif.c etc. attached.

P.S.: I spammed the code with lots of __DSB()... I think I can remove many of these. But as it seems to be memory related, I had some hope.

LCE · ‎2022-12-06

@Community member Thanks again for having another look at it.

Here are the 2 last good packets (Wireshark's "header analysis" for better readability), then the 1st bad packet (TCP cs = 0), and the 2nd bad packet (IPv4 cs = 0).

I have checked the payload, that's okay, starts with a 20 byte header incl. timestamp, packet number, then comes audio data.

Summary:

All is well and as it should be. Except for that "§%!* checksums. ;)

2nd last good packet:
 
Frame 40177: 1514 bytes on wire (12112 bits), 1514 bytes captured (12112 bits)
Ethernet II, Src: ce:22:29:43:04:20 (ce:22:29:43:04:20), Dst: Dell_Laptop
Internet Protocol Version 4, Src: 192.168.178.70, Dst: 192.168.178.39
    0100 .... = Version: 4
    .... 0101 = Header Length: 20 bytes (5)
    Differentiated Services Field: 0xb8 (DSCP: EF PHB, ECN: Not-ECT)
    Total Length: 1500
    Identification: 0x68d7 (26839)
    000. .... = Flags: 0x0
    ...0 0000 0000 0000 = Fragment Offset: 0
    Time to Live: 255
    Protocol: TCP (6)
    Header Checksum: 0x66cd [correct]
    [Header checksum status: Good]
    [Calculated Checksum: 0x66cd]
    Source Address: 192.168.178.70
    Destination Address: 192.168.178.39
Transmission Control Protocol, Src Port: 9603, Dst Port: 52127, Seq: 38951341, Ack: 1, Len: 1460
    Source Port: 9603
    Destination Port: 52127
    [Stream index: 4]
    [Conversation completeness: Incomplete (12)]
    [TCP Segment Len: 1460]
    Sequence Number: 38951341    (relative sequence number)
    Sequence Number (raw): 38957876
    [Next Sequence Number: 38952801    (relative sequence number)]
    Acknowledgment Number: 1    (relative ack number)
    Acknowledgment number (raw): 285078846
    0101 .... = Header Length: 20 bytes (5)
    Flags: 0x018 (PSH, ACK)
    Window: 5840
    [Calculated window size: 5840]
    [Window size scaling factor: -1 (unknown)]
    Checksum: 0x4506 [correct]
    [Checksum Status: Good]
    [Calculated Checksum: 0x4506]
    Urgent Pointer: 0
    [Timestamps]
    [SEQ/ACK analysis]
    TCP payload (1460 bytes)
Data (1460 bytes) looks good (easily identified by own header before audio data)
_______________________________________________________________________________
 
last good packet:
 
Frame 40178: 1514 bytes on wire (12112 bits), 1514 bytes captured (12112 bits) 
Ethernet II, Src: ce:22:29:43:04:20 (ce:22:29:43:04:20), Dst: Dell_Laptop
Internet Protocol Version 4, Src: 192.168.178.70, Dst: 192.168.178.39
    0100 .... = Version: 4
    .... 0101 = Header Length: 20 bytes (5)
    Differentiated Services Field: 0xb8 (DSCP: EF PHB, ECN: Not-ECT)
    Total Length: 1500
    Identification: 0x68d8 (26840)
    000. .... = Flags: 0x0
    ...0 0000 0000 0000 = Fragment Offset: 0
    Time to Live: 255
    Protocol: TCP (6)
    Header Checksum: 0x66cc [correct]
    [Header checksum status: Good]
    [Calculated Checksum: 0x66cc]
    Source Address: 192.168.178.70
    Destination Address: 192.168.178.39
Transmission Control Protocol, Src Port: 9603, Dst Port: 52127, Seq: 38952801, Ack: 1, Len: 1460
    Source Port: 9603
    Destination Port: 52127
    [Stream index: 4]
    [Conversation completeness: Incomplete (12)]
    [TCP Segment Len: 1460]
    Sequence Number: 38952801    (relative sequence number)
    Sequence Number (raw): 38959336
    [Next Sequence Number: 38954261    (relative sequence number)]
    Acknowledgment Number: 1    (relative ack number)
    Acknowledgment number (raw): 285078846
    0101 .... = Header Length: 20 bytes (5)
    Flags: 0x018 (PSH, ACK)
    Window: 5840
    [Calculated window size: 5840]
    [Window size scaling factor: -1 (unknown)]
    Checksum: 0x838e [correct]
    [Checksum Status: Good]
    [Calculated Checksum: 0x838e]
    Urgent Pointer: 0
    [Timestamps]
    [SEQ/ACK analysis]
    TCP payload (1460 bytes)
Data (1460 bytes) looks good (easily identified by own header before audio data)
_______________________________________________________________________________
 
1st bad packet: TCP checksum = 0
 
Frame 40179: 1514 bytes on wire (12112 bits), 1514 bytes captured (12112 bits) 
Ethernet II, Src: ce:22:29:43:04:20 (ce:22:29:43:04:20), Dst: Dell_Laptop
Internet Protocol Version 4, Src: 192.168.178.70, Dst: 192.168.178.39
    0100 .... = Version: 4
    .... 0101 = Header Length: 20 bytes (5)
    Differentiated Services Field: 0xb8 (DSCP: EF PHB, ECN: Not-ECT)
    Total Length: 1500
    Identification: 0x68d9 (26841)
    000. .... = Flags: 0x0
    ...0 0000 0000 0000 = Fragment Offset: 0
    Time to Live: 255
    Protocol: TCP (6)
    Header Checksum: 0x66cb [correct]
    [Header checksum status: Good]
    [Calculated Checksum: 0x66cb]
    Source Address: 192.168.178.70
    Destination Address: 192.168.178.39
Transmission Control Protocol, Src Port: 9603, Dst Port: 52127, Seq: 38954261, Ack: 1, Len: 1460
    Source Port: 9603
    Destination Port: 52127
    [Stream index: 4]
    [Conversation completeness: Incomplete (12)]
    [TCP Segment Len: 1460]
    Sequence Number: 38954261    (relative sequence number)
    Sequence Number (raw): 38960796
    [Next Sequence Number: 38955721    (relative sequence number)]
    Acknowledgment Number: 1    (relative ack number)
    Acknowledgment number (raw): 285078846
    0101 .... = Header Length: 20 bytes (5)
    Flags: 0x018 (PSH, ACK)
    Window: 5840
    [Calculated window size: 5840]
    [Window size scaling factor: -1 (unknown)]
    Checksum: 0x0000 incorrect, should be 0x9619(maybe caused by "TCP checksum offload"?)
    [Checksum Status: Bad]
    [Calculated Checksum: 0x9619]
    Urgent Pointer: 0
    [Timestamps]
    [SEQ/ACK analysis]
    TCP payload (1460 bytes)
Data (1460 bytes) looks good (easily identified by own header before audio data)
_______________________________________________________________________________
 
2nd bad packet: TCP and IPv4 checksum = 0
 
Frame 40180: 1514 bytes on wire (12112 bits), 1514 bytes captured (12112 bits)
Ethernet II, Src: ce:22:29:43:04:20 (ce:22:29:43:04:20), Dst: Dell_Laptop
Internet Protocol Version 4, Src: 192.168.178.70, Dst: 192.168.178.39
    0100 .... = Version: 4
    .... 0101 = Header Length: 20 bytes (5)
    Differentiated Services Field: 0xb8 (DSCP: EF PHB, ECN: Not-ECT)
    Total Length: 1500
    Identification: 0x68da (26842)
    000. .... = Flags: 0x0
    ...0 0000 0000 0000 = Fragment Offset: 0
    Time to Live: 255
    Protocol: TCP (6)
    Header Checksum: 0x0000 incorrect, should be 0x66ca(may be caused by "IP checksum offload"?)
    [Header checksum status: Bad]
    [Calculated Checksum: 0x66ca]
    Source Address: 192.168.178.70
    Destination Address: 192.168.178.39
Transmission Control Protocol, Src Port: 9603, Dst Port: 52127, Seq: 38955721, Ack: 1, Len: 1460
    Source Port: 9603
    Destination Port: 52127
    [Stream index: 4]
    [Conversation completeness: Incomplete (12)]
    [TCP Segment Len: 1460]
    Sequence Number: 38955721    (relative sequence number)
    Sequence Number (raw): 38962256
    [Next Sequence Number: 38957181    (relative sequence number)]
    Acknowledgment Number: 1    (relative ack number)
    Acknowledgment number (raw): 285078846
    0101 .... = Header Length: 20 bytes (5)
    Flags: 0x018 (PSH, ACK)
    Window: 5840
    [Calculated window size: 5840]
    [Window size scaling factor: -1 (unknown)]
    Checksum: 0x0000 incorrect, should be 0x1801(maybe caused by "TCP checksum offload"?)
    [Checksum Status: Bad]
    [Calculated Checksum: 0x1801]
    Urgent Pointer: 0
    [Timestamps]
    [SEQ/ACK analysis]
    TCP payload (1460 bytes)
Data (1460 bytes) looks good (easily identified by own header before audio data)

LCE · ‎2022-12-06

What I don't get:

It's been perfectly stable now with reduced TCP_MSS (and reduced SAI buffer size).

waclawek.jan · ‎2022-12-06

I don't see anything suspicious in those packets, and I have no more ideas.

> It's been perfectly stable now with reduced TCP_MSS (and reduced SAI buffer size).

That of course may or may not be coincidental...

JW

LCE · ‎2022-12-06

@Community member Thanks again!

> That of course may or may not be coincidental...

That's the problem!

I'm going through the lwIP output functions again, where TCP_MSS might have an impact.

LCE · ‎2022-12-06

Same error, also with reduced MSS, but it ran much longer considering the circumstances (lots of parallel http and network traffic).

I checked all interrupts again, and I actually found the IGMP timer which might have called the output function, changed to a flag to call igmp_tmr() in main, didn't help.

But now I'm quite sure there are:

no calls to lwIP functions from interrupts
no (s)printf when data streaming
interrupts are disabled while the TX descriptors are prepared, until after writing to the poll register DMATPDR to resume transmission

So far it happens less when - but I'm not so sure...:

DTCM is not used
CPU clock is below max 216 MHz
TCP_MSS < max 1460
more parallel http access (no interrupts used)
more network traffic on PC side

There must be something stupid somewhere...

Again, current ethernetif.c & co attached.

LCE · ‎2022-12-08

I think I found it....

After turning off all the stuff not needed for TCP transfers, including the source SAIs and sending only static buffers, nothing changed, still zero checksum.

Then I went a few weeks back when I did not yet have good software for testing on the PC side, at that time I built a packet queue between SAI DMA buffers and tcp_write().

And there's some "hole" which somehow might access the buffers given to ETH DMA.

Anyway, without the queue, transfers are whacky, but never fail due to checksum.

Although this needs some more testing, didn't have enough time this morning and had to get on the road - company's christmas party...

If that's it I'll come back, report, and mark topic as solved.

@Piranha & @Community member big thanks again for your help!

LCE · ‎2022-12-14

Again, conicidental...

I still have the checksum problem.

And I changed a lot over the last few days.

Let me start with a confession, and I really feel stupid about how I approached the STM32 in the first place:

I have worked with FPGAs and "slow" 8 -bit controllers for years, together with some specialized interface ICs (e.g., high-speed USB).

So any time-critical stuff happened in the FPGA and between that and the "interface IC".

When working with FPGAs in VHDL, I basically had control over each bit, at each clock edge.

Being fed up with too many ICs (FPGA, SRAM, uC, IF-IC, ...) and too expensive FPGAs, we decided to try an all-in-one solution, and because I had some previous (good) experience with the STM32, we chose that one.

So far, so good.

Problem: I somehow approached the STM32 as if it was kinda magic device, throw in some register values, then let it run...

Which means I did not worry about things like SRAM and DMA usage - where's what, how is it connected.

AND even worse, being used to the slow non-OS 8-bitters, I just threw everything in the main loop as I was used to. With the result that some things were constantly checked at full clock speed - absolutely not necessary, and surely blocking the internal busses.

An example:

For the SAI data buffers, which should go zero-copy to ethernet, I had some control variables "in front" of the data, which I checked very often. These being in the same RAM area, I definitely took some time from the DMAs working on the data buffers.

So again I changed a lot:

internal SRAM:
- DTCM:
  - SAI buffer control variables
  - ethernet descriptors
  - ALL other variables
- SRAM1:
  - SAI buffers that go zero copy to ethernet, 256 x 1460 bytes, 99% of the 368kB
- SRAM2:
  - only some no-init stuff used at hard fault, basically wasting 16 kB
main loop:
- go through some functions concerning SAI to ETH data transfer and PTP
- then alternating between "low priority" stuff
- then go to SLEEP, woken by any interrupt, but the DMAs should have a little more time in the background now
internal ADC:
- is used for monitoring supply voltages
- it had a MHz sampling clock... not anymore

So, did that help? No. :(

I still get zero checksums, which is at least "fixable" by flushing the TX FIFO (but that takes looong! Why?).

What I found so far, these things are making things worse -> higher chances for the checksum failures:

lwIP's http webserver with SSI tags is ... terrible:
- it's working, but it sends each tag response as a single pbuf to ETH, so there can be like 12 pbufs, some with only a few bytes, but FIRST and LAST segments look to be set correctly
- the more that is used in parallel with data streaming
when the PC side takes too long to ACK / get the data (starting the compiler is a good way of occupying the PC)

Could there be any kind of race condition between DTCM / SRAM1 / CPU / DMA2 / ETH-DMA, on the interface busses ?

(Again: I have lots of memory barriers in place, esp. before/after the using the descriptors' OWN bit)

When the SAI DMA complete interrupt occurrs, is it really done writing the buffers from the SAI FIFOs into internal SRAM?

What could "confuse" the ETH TX FIFO when filling it with many small buffers, like it's done with http / SSI ?

waclawek.jan · ‎2022-12-15

I understand your frustration, but it's very unlikely you will be able to get help here, as we are mere users with inevitably limited experience and zero access to inside information.

The Synopsys modules (both ETH and OTG_USB) used in STM32 are not only very complex, but also quirky and laden with historical layers. The documentation is lacking and it's only partially ST's fault, as they mostly copypaste what they purchased. To make things worse, the modules tend to change in time, i.e. they are around in various versions, and those have various quirks.

Look for example at the Successive write operations to the same register might not be fully

taken into account erratum in the STM32F407 errata...

JW

LCE · ‎2022-12-15

Ah, that's where stuff like that in the HAL driver comes from:

/* Wait until the write operation will be taken into account :
  at least four TX_CLK/RX_CLK clock cycles */
tmpreg = (heth->Instance)->MACCR;
HAL_Delay(ETH_REG_WRITE_DELAY);
(heth->Instance)->MACCR = tmpreg;

So maybe F4 & F7 ETH MAC are very much the same?

Will check...

Thanks again!

LCE · ‎2022-12-15

I changed register writing to this method, didn't change anything.

BUT... as I use the 128 kB DTCM as "main" RAM, I changed some variables' alignment to "8", and so I did with the lwIP stuff.

It's running very stable now for 3 hours, only when I access the webserver at the same time with a page using SSI tags, the checksum error comes quickly.

Spamming the device with GET requests to non-SSI pages (either html from flash/RAM or JSON replies) are no problem at all.

And in general, even on my work PC I have less buffer overflows.

Okay, let's dig into lwIP's SSI stuff...