STM32 USB FS Bulk In speed tuning

cpmh1 · ‎2014-09-09

Posted on September 10, 2014 at 00:55

Using STM32Cube_FW_F4_V1.1.0 on STM32F401

I am trying to maximize data from host to stm on FS bulk endpoint. Using a 64 byte Rx buffer, I can only get about 5 MBps, as the interrupt and setup for next Rx buffer and restart of Rx takes a large amount of time.

If I used a much larger buffer, I can get 9 MBps, but I don't get an Rx complete interrupt unless the host sends a short packet. I need to use the bigger buffer for speed, but I can't count on the host sending non-64 byte packets, and if the host sends less than an Rx buffer-full the app doesn't get an int.

Is there some way to stop the Rx in the middle and recover the existing data (and count) in the Rx operation as if if filled, and not drop any in-process data (nak of it is OK)? Or is there some other solution?

#usb-fs-bulk-rx-speed-stm32cube

tsuneo · ‎2014-09-10

Posted on September 10, 2014 at 17:38

You have two problems,

a) How to implement variable-length transfer.

b) How to move large data quickly over the bulk OUT endpoint (host --> device).

a) Variable-length transfer

In practice, there are two popular methods to send variable-length transfer over a bulk/interrupt OUT endpoint.

1) The transfer length is told to the device beforehand, or at the first packet (transaction) of the transfer.

2) The transfer is always terminated by a short packet, including ZLP (Zero-Length Packet).

The second method is applied just to limited PC drivers like WinUSB (**1), because

- ZLP termination should be ordered by PC applications to the driver,

- but most of PC in-box drivers don't provide any way to send ZLP.

(**1) WinUSB sends ZLP over OUT endpoint, when SHORT_PACKET_TERMINATE policy is enabled.

http://msdn.microsoft.com/en-us/library/windows/hardware/ff728833%28v=vs.85%29.aspx

For other drivers, like CDC (Virtual COM port), you have to apply the first method.

b) Transfer speed

The USB device engine on STM32F2/F4 is designed in transfer-oriented; we have to set up the number of packets (transactions) and entire transfer length on the endpoint register. To make the endpoint run quickly, we have to assign the real transfer length (or more) to the endpoint, And then, the engine works quickly without (so much) intervention of firmware.

1) The first packet holds the transfer length

The firmware claims MPS (MaxPacketSize) transfer (ie. 64 bytes for full-speed) for the first packet. When it receives the first packet, the exact length of the transfer is known. The firmware claims another transfer of the exact rest length.

2) Short packet termination

In this method, your firmware assigns large enough transfer length to cover the possible max transfer length on your protocol. When a short packet (including ZLP) is received, the transfer terminates, and transfer completion interrupt should be generated.

On the host side, the PC application has to add ZLP explicitly, or it sets up the driver, so that a ZLP is appended, when the transfer length is a multiple of MPS.

If it were possible, the second method would be simple on both sides.

Tsuneo

cpmh1 · ‎2014-09-10

Posted on September 11, 2014 at 01:39

Thanks Tsuneo,

The short/ZLP packet is a great / easy solution, and is what I was using before being informed by my client that this code has to work with older hosts (embedded Linux boards) that can't be updated wrt usb driver code.

I did manage a decent solution however, and I'll explain it here in case anyone else runs across this:

There are two interrupts that occur in the usb h/w irq handler. The interrupt that signals Rx-Complete (buffer count full) is the one that eventually calls the class DataOut routine. That irq fires when the endpoint rx count reaches the buffer count you set, or a short packet it read. The other interrupt is handled internally, and happens each packet, and that interrupt is used to read the h/w Rx Fifo into your buffer.

I modified the hal_pcd usb interrupt code and used the per-packet interrupt to call a function in my app with the packet length, and my function returns a pointer to the head position in a long linear (not ring) buffer which I hand to USBreadPacket which fills it in from the h/w fifo. My app reads data from the tail of the long buffer. When the head gets to the end of the buffer (and the normal rx irq complete fires), and the app reads all the data, I restart the Rx for the whole big buffer. Since the app reads faster than the usb fs xfer, the restart happens very fast, and with the linear buffer, there's no vars shared between main-loop and isr contexts so no locking, etc. I am now getting close to 10MBps which is close to the h/w max of 12, and client is happy.