cancel
Showing results for 
Search instead for 
Did you mean: 

Notes on STM32 (G431) USB-CDC throughput (libusb) and pitfalls of using dd to transfer data (I2S)

BarryWhit
Lead II

Hi All,

I recently set up a USB CDC application for the first time, using a STM32G431 device.
While writing the software (on both Host and Device), I had some problems. First
with achieving the expected throughput, then with corrupted (or so it seemed) data.

 

If you have more information about the reason behinds low OUT throughput with libusb

when there are no IN transfers pending, please let me know. Is it a USB spec requirements?

a bug in libusb? a bug in the USB middleware? a bug in silicon?

 

In order to save others the trouble I'm going to share my results and what I learned here.
Since I only needed Host->Device in my application - that's all I tested.

Here's a quick summary:

1. Once working, I achieved 960Kib/S (960,000 Bytes/Sec) (H(ost)->D(evice), sustained, without any data processing on the device side).

2. Initially, I could only get half of that (more on that below), around 530 KiB/s H->D.

3. ST has a youtube video discussing exactly this situation titled "STM32 USB training - 14 USB CDC throughput" (https://www.youtube.com/watch?v=YaHvPeemnvQ)

4. The ST video suggests using the unix tool dd to measure throughput. I'm on linux, so I did.
With dd, I immediately got the expected throughput (around 1MB/s is what you get on Full-Speed USB)

5. The low throughput turned out to be a software issue with the Host application side, due to libusb
(specifically libusb1) code I was using. Using libusb's Synchronous API, I could not get to 1MB. libusb docs suggest it's slower, but I find it hard to believe that it can't even saturate a full-speed usb link.
I never did get it to max throughput.

6. When I switched to libusb's async API (using some code I grabbed off github and modified),
and made sure to have several transfers in flight. Initially, I only got slightly better
throughput, which puzzled me. Through trial and error I found out that throughput drop in half,
if I don't initiate any IN transfers in my host application. This is even though my Device
sends no data to the host. Simply submitting a single IN transfer at the start of my application,
which is never completed, and keeping just 2 OUT transfers in place, I was able to get good OUT
throughput. In fact, even keeping just one OUT transfer in flight using async mode got me
to 900Kib/s, so the missing IN transfers were really the problem.

7. Since dd achieves good OUT throughput, I looked at what's going on the USB bus using wireshark
(which can anazlyze USB traffic at the URB level). What I saw is that here too, a few pending
IN transfers are initiated at the start of the application and never complete. even though dd
transfers data in strictly one direction. So there must be something about this that's required
by either the device or in the USB specification. It makses sense then, that the synchronous libusb
API can't achieve max OUT throughput, since it can't have a pending IN transfer while we
process an OUT transfer. But It's hard to believe this monster deficiency is by design, there
must be more to it. Maybe I'll ask on the libusb mailing list/forum.

8. As a piece of advice, I highly recommend measuring the throughput on both the device and the host (by just
keeping track of time and bytes transferred) and making sure both measurements are in good agreement.
My initial attempts became confusing due to the low confidence I had in my device-side-only
measurements, which at first relied on HAl_Delay instead of using a timer.

That's the first part, having to to with low throughput problems. While it does work reliably now,
it's obviously only a partial explanation. Ideally, the requirement to have an IN transfer in flight
to achieve good OUT throughput is strange. That's either a libusb bug, something wrong with the
example code or modification I made to it, or some requirement of the USB specification that
I am not aware of. Maybe some expert on the forum can offer a more complete explanation.

The second part is about the problems I had using dd for data transfer.

Having seen the ST video on measure CDC throughput with dd, I really liked the idea. It's
actually very natural. Since on linux the device shows up as /dev/ttyACM (the linux equivalent of VCP),
dd seems like the perfect too not only to measure throughput, but to actually send data to the device.
In my case, I simply wanted to send audio samples to the device, which would send them to a
codec over I2S. The fact that dd immediately gave me the max throughput achievable meant
I didn't have to write or debug any host-side code, and this seemed ideal.

However, when I reached the stage where my device would forward the data as samples to the
device all I got was (loud. very loud) noise. I suspected wrong endianess, or inefficient code
not sending samples quickly enough, but when I hard-coded some of the same data into a memory buf
It worked fine.

So, I used CubeIDE's debugger to compare the samples the device received over CDC and compared
them to a hex dump of the file. They were similar, but did not match. the data corruption had
to do with the mysterious appearance of either Carriage returns or Line feeds in the data stream (\x0d and \x0a), I can't remember which. I realized that this was an issue with the terminal configuration
on host side. When you use a terminal program, it takes care of all that for you, or lets you
configure it. But dd just uses the device as it is, so you have to configure the "Terminal characteristics"
yourself and it turns out that, by default, the terminal does not pass through characters as-is
(there's a whole bunch of arcana around serial terminals, since they descend from the hardware of mid last-century). The program to do this on linux is stty(1). there's a characteristic called "post"
(or "postprocess output") which seems to be the culprit, but I just turned on "raw" mode
which turns OFF a bunch of things, and the data corruption issue was resolved. Now I can
stream any audio file to my STM32G431 from the command line by piping ffmpeg output into dd
which sends it to the device.

Hope this helps someone and, if you know more about the software issue I described,
where OUT throughput drops if you don't have a (seemingly redundant) IN transfer pending,
please share.

- If someone's post helped resolve your issue, please thank them by clicking "Accept as Solution".
- Please post an update with details once you've solved your issue. Your experience may help others.
11 REPLIES 11

Pending IN transfer will cause the host controller to poll the device almost continuously (or once in N ms if the IN endpoint is interrupt). This obviously can cause difference in behavior of the firmware. Need to use the analyzer to understand details. It is not clear how the dd can activate IN request, it is basically a wrapper for "write" syscall, not aware of USB. There are production quality USB libraries and drivers for STM32, for best results and decent support you may want to use them. The best feature of the Cube "middleware" is the price.

Pending IN transfer will cause the host controller to poll the device almost continuously

First useful response I've gotten. But then again, by design there is no IN traffic. How could a higher polling rate affect OUT traffic? this isn't sane behavior.

It is not clear how the dd can activate IN request,

But verified by wireshark capture, it does.

The best feature of the Cube "middleware" is the price.

Yeah, and the USBPD lib is being phased out. I've been meaning to do the same test with the new USBX libraries. But it's not a high priority.

- If someone's post helped resolve your issue, please thank them by clicking "Accept as Solution".
- Please post an update with details once you've solved your issue. Your experience may help others.