Testing STM32 FS USB Library Performance

ashley23 · ‎2010-08-23

Posted on August 23, 2010 at 09:46

paturu · ‎2011-05-17

Posted on May 17, 2011 at 14:03

I am also looking to improve the STM32 USB performance, I found that you are able to achieve transfer rates by using the largest buffer on windows API : 64Kbytes is a best fit. can you please provide some example code will help me.

Thanks in advance.

Regards,

santhosh

ashley23 · ‎2011-05-17

Posted on May 17, 2011 at 14:03

You need to set up memory for double buffering:

// EP0 - Control IN / OUT

#define ENDP0_RXADDR (0x18) // 64 bytes for RX

#define ENDP0_TXADDR (0x58) // 64 bytes for TX

// EP1 - Bulk IN (device tx)

#define ENDP1_TXADDR0 (0x98) // 64 bytes for TX

#define ENDP1_TXADDR1 (0xD8) // 64 bytes for TX

// EP2 - Bulk OUT (device rx)

#define ENDP2_RXADDR0 (0x118) // 64 bytes for RX buf 0

#define ENDP2_RXADDR1 (0x158) // 64 bytes for RX buf 1

// EP3 - Interrupt IN (device tx)

#define ENDP3_TXADDR (0x198) // 64 bytes for TX

Then you need to configure the endpoint for double buffering:

// Initialize Endpoint 1 double buffered bulk IN (device tx)

SetEPType(ENDP1, EP_BULK);

SetEPDoubleBuff(ENDP1);

SetEPDblBuffAddr(ENDP1, ENDP1_TXADDR0, ENDP1_TXADDR1);

SetEPDblBuffCount(ENDP1, EP_DBUF_IN, Device_Property.MaxPacketSize);

ClearDTOG_TX(ENDP1); // Clear DTOG

ClearDTOG_RX(ENDP1); // Clear SW_BUF - Sets buf 0 as software buffer and no buffer for USB peripheral

SetEPTxStatus(ENDP1, EP_TX_NAK);

SetEPRxStatus(ENDP1, EP_RX_DIS);

// Initialize Endpoint 2 double buffered bulk OUT (device rx)

SetEPType(ENDP2, EP_BULK);

SetEPDoubleBuff(ENDP2);

SetEPDblBuffAddr(ENDP2, ENDP2_RXADDR0, ENDP2_RXADDR1);

SetEPDblBuffCount(ENDP2, EP_DBUF_OUT, Device_Property.MaxPacketSize);

ClearDTOG_RX(ENDP2); // Clear DTOG

ClearDTOG_TX(ENDP2); // Clear SW_BUF

ToggleDTOG_TX(ENDP2); // Toggle SW_BUF - Sets buf 1 as software buffer (buf 0 for first rx)

SetEPRxStatus(ENDP2, EP_RX_VALID);

SetEPTxStatus(ENDP2, EP_TX_DIS);

Then you need to handle the buffer swapping in the EP callback:

/* if (GetENDPOINT(ENDP2) & EP_DTOG_TX) // Use EP_DTOG_TX flag as is actually SW_BUF flag. Refer ref manual page 589-590

{

// Read from ENDP2_BUF0Addr buffer

dataLength = GetEPDblBuf0Count(ENDP2);

PMAToUserBufferCopy(myBuffer, ENDP2_RXADDR0, dataLength);

}

else

{

// Read from ENDP2_BUF1Addr buffer

dataLength = GetEPDblBuf1Count(ENDP2);

PMAToUserBufferCopy(myBuffer, ENDP2_RXADDR1, dataLength);

}

FreeUserBuffer(); // I cant remember the exact name of this call...

Note that the mass storage example does the data writes in the EP callback, effectively stalling the EP until the flash write is complete. Ideally you should buffer the data then flash it in your own code. Note that the USB rate will still be limited by the rate that you can write to flash (or disk)...

tsuneo · ‎2011-05-17

Posted on May 17, 2011 at 14:03

> With your test as it stands, you are testing the latency of Windows USB transfers, not the performance of the library.

Neither testing ''the performance of the library'' nor ''testing the latency of Windows USB transfers''

WinUSB hits more than 20M bytes/s with high-speed devices. It means PC device driver is not the bottleneck, here.

You are examining bus scheduling on the host controller (HC) hardware. This result on above duncan's post is a typical case which shows how bus scheduling works on a full-speed HC. (Because of loopback, the speed is shown in half of typical case)

// Tested on USB 1 and USB 3

// write size (bytes) = data rate

// 16384 = 547 kb/s

// 8192 = 512 kb/s

// 4096 = 456 kb/s

// 2048 = 409 kb/s

// 1024 = 340 kb/s - < 500 writes per second

// 512 = 256 kb/s - 500 writes per second

// 256 = 127 kb/s

// 128 = 64 kb/s

// 64 = 32 kb/s

// 32 = 16 kb/s

// 16 = 8 kb/s

// 8 = 4 kb/s

// 4 = 2 kb/s

// 2 = 1 kb/s

// 1 = 0.5 kb/s

Full-speed HC (UHCI, OHCI) schedules next transfer at the end of each USB frame. Therefore, transfer speed increases proportionally along with the transfer size, for smaller transfer size.

ex. 1 byte transfer results in 1 byte/frame (ms). 512 bytes transfer gives 512 bytes/frame - one transfer per one frame.

A frame saturates in 19 full-size taransactions (19 x 64 bytes) for full-speed, theoretically. Because of this saturation, greater transfer size doesn't increase the transfer speed so much.

[Tips 1]

When the full-speed device connects to a PC over a USB2.0 hub, you'll see faster speed in the smaller transfer size. USB2.0 hub converts full-speed transactions into high-speed ones. High-speed HC (EHCI) schedules in micro-frame (125us), instead of frame (1ms)

[Tips 2]

WinUSB RAW_IO policy breaks this transfer-per-frame restriction for bulk IN transfers. To make this feature work, multiple OVERLAPPED WinUsb_ReadPipe() calls are issued in advance, without waiting for finish of any single call.

When we recognize this HC scheduling, the principles are simple to get better transfer speed.

On the device side,

- Always exchange full-size (64 bytes) transactions

- Double buffer increases the performance

Short transaction (less than 64 bytes) terminates the transfer. Once a transfer terminates, we have to wait for the next frame to start next transfer. Therefore, keep the transactions in full-size to maximize the transfer speed.

On the PC applications,

- Request as large transfer size as possible for the device driver.

When entire data transfer is split into shorter ones, the gaps between transfers reduce the speed performance. WinUSB accepts some 10M bytes transfer in single WinUsb_ReadPipe/WritePipe call.

In the real applications, however, we may need to modify this simple principles to satisfy other requirements.

Here are three typical examples.

a) ROM writer

On the device side, ROM is read out / written to any time, quickly, when required. We don't need any buffer greater than full-size packet (64 bytes), two buffers for double-buffering.

On the PC application, when the target ROM size is 1M bytes or so, single read/write call does the job. For ROMs of greater size, we may split the call into shorter chunks, just to show progress bar.

b) Data streaming of ADC / DAC

ADC generates data in regular sampling interval. DAC has to be fed regularly without any gap. But bulk transfer speed may fluctuate, affected by activity of other devices on the bus. On the device side, buffers of enough size are equipped for IN (ADC) and OUT endpoints (DAC), to absorb speed fluctuations. To keep the transactions in full size, the buffer size is tuned to a multiple of 64 bytes (64 x N bytes)

On the PC side, suppose that a PC data logger displays the ADC result in real time for user's eyes. In this application, PC app requests shorter chunks of data to refresh the data display continuously. The chunk size is tuned to fit to display refresh rate.

c) USB-serial converters (USB-UART, USB-I2C, USB-SPI, USB-CAN, etc)

The traffic of these serial communications takes place sometimes in burst or sometimes in sporadic. We can't assume regular interval.

To prevent data drop in burst traffic, buffers of enough size are assigned to IN and OUT, like above case b). UART RX does not always come in 64 bytes chunk. It may come in sporadic interval, and in short chunks. If the device would always wait for 64 bytes on the buffer, no response could pass from the device to PC for long interval. To ensure the deadline response, a latency timer forces the packet transfer, even when the buffer has less than 64 bytes.

FTDI appnote explains on this configuration well.

http://www.ftdichip.com/Support/Knowledgebase/index.html?an232b_04smalldataend.htm

Tsuneo