2014-08-28 06:57 AM
Hello,
currently I'm porting a STM32F103ZG / Wiznet W5300 Ethernet and SPL based to a STM32F429ZG / STM32CubeMX / LwIP based application. Both versions using FreeRTOS. In both cases its a custom board. I'm struggling on a quite low performance of the Ethernet part. The application transmits a large amount of data from the SD Card to a remote PC using the TCP RAW API of LwIP. The transmit rate is about 20 kB/s. Currently, I guess the biggest berformance gain would be optimizing the low level Ethernet device driver. low_level_output() needs about 25 us for 60 bytes and 200us for 1524 bytes. This indicates a limitation to ~45 MBit/s just by writing data to the output. And then, the device would had no time to do anything else. low_level_output() uses memcpy() to copy the data. Is this an actual DMA transfer or CPU bound work? In my case, the copy wouldn't necessary at all. With LwIP, all the data is statically in the RAM, as long as the package hasn't been acknowledged. Is there a zero-copy driver implementation around? There is a big in the ChibiOS forums, regarding that topic. Has someone already tried to use their findings together with the CubeMX driver and FreeRTOS? Unfortunately, I'm unable to download the Dropbox stuff from work. Any other tips to increase the performance? Versions used: STM32F429ZG 180 MHz clock IM2516SDBATG 16 MB SDRAM holding the buffered data STM32CubeMX V4.3.0 together with FW lib 1.3.0 FreeRTOS 7.60 LwIP V4.1.? (provided by Cube) #tcp/ip #stm32 #lwip #stm32f4 #cubemx2014-08-28 09:00 AM
A naive analysis leads to think that your memcpy performs less than 9bytes per µs, it is not good at all, on a machine that can drive the SDRAM bus at 90Mhz, even for a naive analysis.
First I will try to characterize the target for memory performance, running various memory access tests of my own. You migth then conclude that the implementation of memcpy is poor (what toolchain do you use ? Did you turned all optimization flags of the compiler). But, low performance could also happen because of poorly optimized SDRAM controller settings. or any other cause... I am running SW on f429 (ST eval board), with SDCARD, FTP server, FreeRTOS, LWIP, GNU Toolchain, MAC driver of my own with zero copy, and I reach over 10Mb throughput. It is not very impressive (I did not took time to bench for highest figures) but certainly better than numbers you reported. This device is capable, you have a SW problem.2014-08-29 02:00 AM
Hi,
indeed, that seems quite slow. I turned on -Ofast optimizations. his decreased the needed time for a 1520 byte package to 100 µs, which is still way to slow. Interestingly, the overall throughput didn't increased . I'm using Attolic with a GNU Toolchain. Instruction Set is Thumb 2. C Standard is gnu90. I will do some test with writes from the internal RAM and eventually look up for configuration issues. On the other hand - are you able/willing to share your MAC driver?2014-08-29 03:01 AM
I published the MAC part of the driver at opensource.gezedo.com
The whole project will eventually be entirely open sourced there ... later ... Note that zero-copy has only be implemented on RX side.2014-09-01 01:17 AM
I've changed the ST implementation to start DMA transfers to the PHY, directly from the passed pbuf chains. That improved the performance by a factor of 20. Unfortunately that is still to slow for my application.
2014-09-01 01:47 AM
So you reached about 400KB/s ? What about memory bandwidth (I think it is a legitimate question on custom HW) ?
It is also possible that the SDCard became the new bottle neck, can you run tests only on SDCard side ? Only on ethernet side ? Are you sure of the internal clock settings ? You certainly noticed that the MAC driver has zero copy only on the RX side, there is some work to do to extend the feature to the TX side. I ran some tests again on my product, I can DL (from network to SD) at around 100KB/s and UL (from SD to network) at 550KB/s. Note that in my case the SD is limiting the performances as it is connected using a single wire (D0) for data. Before becoming short of IO, I ran SDCard with 4 data wires, write performance still was 100KB/s, and read reached around 2MB/s (with FTP/TCP/IP overhead it is near to 20Mb/s on the wire).2014-09-18 07:38 AM
So, I finally got the thing running. The main issue were two Ethernet bugs in the STM32CubeMX driver V1.3.0. Refer to http://lists.nongnu.org/archive/html/lwip-users/2014-03/msg00033.html
@ ST: fix this! Currently I'm able to transmit around 22 MBit/s - I guess this is currently limited by the SD Card - but by far, fast enough. I did also some slightly memory optimizations but there was no big gain. (about 1,5% reduction in the benchmarks). If someone is interested, I can share my zero-copy ethernetif.c. It needs a tightly integration with the LwIP memory pools.2014-09-18 09:52 AM
Hi steinecke.michael,
We’ve notified our development team about your inputs. We take all feedback and suggestions into consideration and are working hard to improve the STM32Cube experience for everyone. Cheers,Heisenberg.2014-09-19 07:26 AM
Hi Michael,
Thanks for your feedback :)About the referred bugs we already fixed theHAL_ETH_GetReceivedFrame_IT() and the low_level_input() functions
in the STM32CubeMX V1.3.0. Please ensure that you are using the last Firmware Package in your project settings. We are working on the point of the receive task and it will be fixed as soon as possible.We developed the technique of zero-copy in the past and we found that there are no big difference in term of performance, because the time copying a frame to/from the stack is insignificant compared of the process time by the stackRegards.2014-09-25 01:37 AM
Hello LEO,
thats correct. I'm already using the FW Library 1.3.0. I thought the bug was fixed, too and looked in some other directions, therefore! However, there is still missing some crucial part of the fix. I would propose this fix:
__IO
static
bool
pendingRxFrames =
false
;
/**
* @brief Ethernet Rx Transfer completed callback
* @param heth: ETH handle
* @retval None
*/
void
HAL_ETH_RxCpltCallback(ETH_HandleTypeDef *heth)
{
// Use pendingRxFrames as a guard around s_xSemaphore to prevent the OS reporting semaphore full errors
// This ISR is an atom operation
if
(!pendingRxFrames)
{
pendingRxFrames =
true
;
osSemaphoreRelease(s_xSemaphore);
}
}
/**
* This function should be called when a packet is ready to be read
* from the interface. It uses the function low_level_input() that
* should handle the actual reception of bytes from the network
* interface. Then the type of the received packet is determined and
* the appropriate input function is called.
*
* @param netif the lwip network interface structure for this ethernetif
*/
void
ethernetif_input(
void
const
* argument )
{
struct
pbuf *p;
struct
netif *netif = (
struct
netif *) argument;
for
( ;; )
{
if
(osSemaphoreWait( s_xSemaphore, TIME_WAITING_FOR_INPUT)==osOK)
{
// Reset pendingRxFrames: This allows new signaling of the semaphore
// If there was another IRQ between reset of the s_xSemaphore and reset of pendingRxFrames, the packet will be read due to the while loop, anyways.
pendingRxFrames =
false
;
// If there is an IRQ during the execution of the while loop, it will signal the semaphore and the flag.
// The loop will read that newly signaled packet as well. As soon there is no more packet available,
// the semaphore will reset due to the next for(;;) loop, but low_level_input() will still find no packet
// and we're falling back to the blocking state until the next packet is received.
// MST !''§%!$§%$''%/''%%!$!$!§!§
// Sometimes there is an ETH package while the IRQ handler is executed. This wont increase the semaphore count.
// On each released semaphore, we fetch as much Ethernet frames as possible.
while
((p = low_level_input( netif )) != NULL)
{
if
(netif->input( p, netif) != ERR_OK )
{
pbuf_free(p);
p = NULL;
}
}
}
}
}