cancel
Showing results for 
Search instead for 
Did you mean: 

Notes on STM32 (G431) USB-CDC throughput (libusb) and pitfalls of using dd to transfer data (I2S)

BarryWhit
Lead II

Hi All,

I recently set up a USB CDC application for the first time, using a STM32G431 device.
While writing the software (on both Host and Device), I had some problems. First
with achieving the expected throughput, then with corrupted (or so it seemed) data.

 

If you have more information about the reason behinds low OUT throughput with libusb

when there are no IN transfers pending, please let me know. Is it a USB spec requirements?

a bug in libusb? a bug in the USB middleware? a bug in silicon?

 

In order to save others the trouble I'm going to share my results and what I learned here.
Since I only needed Host->Device in my application - that's all I tested.

Here's a quick summary:

1. Once working, I achieved 960Kib/S (960,000 Bytes/Sec) (H(ost)->D(evice), sustained, without any data processing on the device side).

2. Initially, I could only get half of that (more on that below), around 530 KiB/s H->D.

3. ST has a youtube video discussing exactly this situation titled "STM32 USB training - 14 USB CDC throughput" (https://www.youtube.com/watch?v=YaHvPeemnvQ)

4. The ST video suggests using the unix tool dd to measure throughput. I'm on linux, so I did.
With dd, I immediately got the expected throughput (around 1MB/s is what you get on Full-Speed USB)

5. The low throughput turned out to be a software issue with the Host application side, due to libusb
(specifically libusb1) code I was using. Using libusb's Synchronous API, I could not get to 1MB. libusb docs suggest it's slower, but I find it hard to believe that it can't even saturate a full-speed usb link.
I never did get it to max throughput.

6. When I switched to libusb's async API (using some code I grabbed off github and modified),
and made sure to have several transfers in flight. Initially, I only got slightly better
throughput, which puzzled me. Through trial and error I found out that throughput drop in half,
if I don't initiate any IN transfers in my host application. This is even though my Device
sends no data to the host. Simply submitting a single IN transfer at the start of my application,
which is never completed, and keeping just 2 OUT transfers in place, I was able to get good OUT
throughput. In fact, even keeping just one OUT transfer in flight using async mode got me
to 900Kib/s, so the missing IN transfers were really the problem.

7. Since dd achieves good OUT throughput, I looked at what's going on the USB bus using wireshark
(which can anazlyze USB traffic at the URB level). What I saw is that here too, a few pending
IN transfers are initiated at the start of the application and never complete. even though dd
transfers data in strictly one direction. So there must be something about this that's required
by either the device or in the USB specification. It makses sense then, that the synchronous libusb
API can't achieve max OUT throughput, since it can't have a pending IN transfer while we
process an OUT transfer. But It's hard to believe this monster deficiency is by design, there
must be more to it. Maybe I'll ask on the libusb mailing list/forum.

8. As a piece of advice, I highly recommend measuring the throughput on both the device and the host (by just
keeping track of time and bytes transferred) and making sure both measurements are in good agreement.
My initial attempts became confusing due to the low confidence I had in my device-side-only
measurements, which at first relied on HAl_Delay instead of using a timer.

That's the first part, having to to with low throughput problems. While it does work reliably now,
it's obviously only a partial explanation. Ideally, the requirement to have an IN transfer in flight
to achieve good OUT throughput is strange. That's either a libusb bug, something wrong with the
example code or modification I made to it, or some requirement of the USB specification that
I am not aware of. Maybe some expert on the forum can offer a more complete explanation.

The second part is about the problems I had using dd for data transfer.

Having seen the ST video on measure CDC throughput with dd, I really liked the idea. It's
actually very natural. Since on linux the device shows up as /dev/ttyACM (the linux equivalent of VCP),
dd seems like the perfect too not only to measure throughput, but to actually send data to the device.
In my case, I simply wanted to send audio samples to the device, which would send them to a
codec over I2S. The fact that dd immediately gave me the max throughput achievable meant
I didn't have to write or debug any host-side code, and this seemed ideal.

However, when I reached the stage where my device would forward the data as samples to the
device all I got was (loud. very loud) noise. I suspected wrong endianess, or inefficient code
not sending samples quickly enough, but when I hard-coded some of the same data into a memory buf
It worked fine.

So, I used CubeIDE's debugger to compare the samples the device received over CDC and compared
them to a hex dump of the file. They were similar, but did not match. the data corruption had
to do with the mysterious appearance of either Carriage returns or Line feeds in the data stream (\x0d and \x0a), I can't remember which. I realized that this was an issue with the terminal configuration
on host side. When you use a terminal program, it takes care of all that for you, or lets you
configure it. But dd just uses the device as it is, so you have to configure the "Terminal characteristics"
yourself and it turns out that, by default, the terminal does not pass through characters as-is
(there's a whole bunch of arcana around serial terminals, since they descend from the hardware of mid last-century). The program to do this on linux is stty(1). there's a characteristic called "post"
(or "postprocess output") which seems to be the culprit, but I just turned on "raw" mode
which turns OFF a bunch of things, and the data corruption issue was resolved. Now I can
stream any audio file to my STM32G431 from the command line by piping ffmpeg output into dd
which sends it to the device.

Hope this helps someone and, if you know more about the software issue I described,
where OUT throughput drops if you don't have a (seemingly redundant) IN transfer pending,
please share.

- If someone's post helped resolve your issue, please thank them by clicking "Accept as Solution".
- Please post an update with details once you've solved your issue. Your experience may help others.
11 REPLIES 11
FBL
ST Employee

Hi @BarryWhit 

 

Could you share a trace for USB traffic to check?

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.


I'm out of offce with limited access to my emails.
Happy New Year!
BarryWhit
Lead II

Hello @FBL,

 

See attached 7z file. Your forum for some reason believes that zip files are a security risk, but rar and 7zip are not. Crazy.

 

I can't tell you what's happening on the wire, but the two wireshark capture files therein show you what's happening at the URB level (and their timestamps). There's not much to see. One of them (the fast one) has an IN URB submitted at the start of the communications, and the other - doesn't. This small difference causes throughput for one case (with IN request) to be 950kpbs  but only 545kbps for the other version (no IN request) .

 

Both captures are of a small libusb based host-side application I adapted from some online code, which runs in libusb async mode (if you're familiar). Essentially, I have an API call which sends an IN request before starting the HOST->DEVICE continuous transmission. I can comment out the line or not, and this causes the difference.

 

The "fast" version overall corresponds to what wireshark captures when I use dd to send data to the device. for some reason, dd also submits a phantom IN request before starting. I've searched high and low, and asked on the libusb mailing list, and found no evidence that the USB spec requires this odd behavior.

 

I did these test on an old linux box which has an ICH9 controller. The STM32 code is just the basic code generated one (as far as I can remember).  main is just an empty loop. Here is the callback code. As you can see the receive callback does nothing but acknowledge the packet. If I recall correctly, the transmt calback never gets called anyway (since my firmware never sends anything).

```

static int8_t CDC_Receive_FS(uint8_t* Buf, uint32_t *Len)
{
/* USER CODE BEGIN 6 */
mycount+=*Len;

USBD_CDC_SetRxBuffer(&hUsbDeviceFS, &Buf[0]);
USBD_CDC_ReceivePacket(&hUsbDeviceFS);

return (USBD_OK);
/* USER CODE END 6 */
}


uint8_t CDC_Transmit_FS(uint8_t* Buf, uint16_t Len)
{
uint8_t result = USBD_OK;
/* USER CODE BEGIN 7 */

/* USER CODE END 7 */
return result;
}

```

 

My host application measures throughput, but this corresponds well to what the device sees (I did other tests to verify this). The throughput measured by my application also corresponds closely to what `dd` reports as its average throughput (it just sends to the device as quickly as possible).

 

Looking forward to your insights,

Barry

- If someone's post helped resolve your issue, please thank them by clicking "Accept as Solution".
- Please post an update with details once you've solved your issue. Your experience may help others.
FBL
ST Employee

Hi @BarryWhit 

> when there are no IN transfers pending, please let me know. Is it a USB spec requirements?

 IN transfers (device to host) should not impact OUT transfers (host to device) in terms of functionality

If you are not using transmitting data or status information back to the host, you might be considering disabling the IN direction (disable the TX).

 

For the OUT direction (host to device), you can use cat to send a file to /dev/ttyACM0 is a straightforward method that avoids the overhead of more complex transfer mechanisms. 

bash -c 'cat filename >/dev/ttyACM0'

simply streams the contents of filename to the USB CDC device represented by /dev/ttyACM0. This method is efficient because it does not involve any additional processing of the data; it is sent as-is to the device.

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.


I'm out of offce with limited access to my emails.
Happy New Year!

IN transfers (device to host) should not impact 

but apparently, with the STM32 USB stack - they do.

 

For the OUT direction (host to device), you can use cat to send a file to /dev/ttyACM0

That's not helpful. I already mentioned tests performed with dd. I know that dd gets better throughput, and I've achieved the same throughput if I send a "useless" IN request before starting the transfer. My problem is not how to achieve better throughput, but to figure out why the STM32 USB behaves so strangely in this respect. You  say yourself IN requests should have no impact on OUT transfers - but they do and i've given you the mens to reproduce this yourself.

 

This method is efficient because it does not involve any additional processing of the data; 

Irrelevent. As I stated, I get 1mbps with my host application if I send an "unnecessary" IN request at the start of comms, so it's clear this is not an efficiency/overhead issue (on the host side). This 3Ghz CPU can easily keep up with 1mbps, and so can libusb.  

 

I was hoping for a more informative answer then "It shouldn't happen. But since it does, just have your windows users use `cat`". Come on.

 

- If someone's post helped resolve your issue, please thank them by clicking "Accept as Solution".
- Please post an update with details once you've solved your issue. Your experience may help others.
FBL
ST Employee

Hi @BarryWhit 

Sorry if I couldn't help. I have forwarded your request to experts for further investigation.

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.


I'm out of offce with limited access to my emails.
Happy New Year!
FBL
ST Employee

Hi @BarryWhit 

Would you remove the hub and see if the issue disappears? 

FBL_0-1719560883005.png

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.


I'm out of offce with limited access to my emails.
Happy New Year!

@FBL, thanks but I tested that months ago, before I even opened this thread.

 

I repeated the test today to make sure. I connected the device directly to a port on my computer and it makes no difference: When I enable/disable the IN transfer at the start of my program the OUT throughput drops from ~950kpbs to ~500kpbs.

- If someone's post helped resolve your issue, please thank them by clicking "Accept as Solution".
- Please post an update with details once you've solved your issue. Your experience may help others.
FBL
ST Employee

But@BarryWhit USB is half duplex, it achieves bidirectional communication over the single differential pair by switching between sending and receiving data, but it cannot do both simultaneously. 

When you are referring to having a pending IN transfer to achieve good OUT throughput is confusing. Maybe there is something else I'm missing.

 

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.


I'm out of offce with limited access to my emails.
Happy New Year!

@FBL wrote:

But@BarryWhit USB is half duplex, it achieves bidirectional communication over the single differential pair by switching between sending and receiving data, but it cannot do both simultaneously. 

 

I don't understand why this has anything to do with it, except possibly that it's been a while since you looked at this and you've simply forgotten what the issue I described was. Please refresh your memory.

 

When you are referring to having a pending IN transfer to achieve good OUT throughput is confusing. Maybe there is something else I'm missing.

 

But that is the whole point, which I also explained to you in a detailed private message a month ago. The difference in throughput depends on whether or not I issue a single "unwarrented" IN request at the start of the program, before I begin the continuous OUT transmission.

 

I sent you the source code via PM a month ago. I gave you sniffer captures which show this exactly. I explained this in writing several times. What more could I possibly do??

 

I feel like we're going around in circles.

 

- If someone's post helped resolve your issue, please thank them by clicking "Accept as Solution".
- Please post an update with details once you've solved your issue. Your experience may help others.