2024-07-20 08:58 PM
Today I decided to improve the performance of my USB driver by transferring multiple IN packets at a time, instead of one at a time. This in theory would be relatively easy, as I can just configure the IN endpoint's TX-FIFO to have a larger allocation and simply push more data into it whenever I'm sending multiple packets, setting PKTCNT and XFRSIZ as needed.
Well that's theory, but in practice? It worked for the most part. I could push up to 7 IN packets all-in-one-go with no problems, but what happens at 8 you may ask? Well, it led to the baffling bug of the "IN Transfer Completed" interrupt not being asserted anymore.
I double-checked the endpoints' FIFO address field, the size, the PKTCNT, and XFRSIZ, ... no obvious issue anywhere. I checked the USB bus traffic and could verify that a single IN packet does indeed get sent to the host after I push 8 packets worth of data, but that's it.
Okay... what if I reduce my Bulk IN endpoint from 64 bytes to 32 bytes? No difference. What if I allocated space for eight packets in the TX-FIFO but I only ever push at most 7 packets? That happens to be just fine, apparently. So it's something about trying to schedule 8 packets that breaks things, which is peculiar since PKTCNT of the DIEPTSIZx register is a 10-bit field, so it's definitely not due to some bit-masking error. Could this be an errata? I checked the document and the only relevant thing is the issue where accessing the FIFO in combination with another USB core register corrupts the next write to the FIFO, but I don't think that's relevant here.
So I sat in silence and pondered.
What if I slow down the write to TX-FIFO ever so slightly? Perhaps I'm writing data to the FIFO "too fast" or something, so I decided to do insert a no-op loop after each push of a word, and lo-and-behold it started working again! From there I started to narrow things down, and came to realization that the DTXFSTSx register -- which indicates how much free space is in the IN endpoint's TX-FIFO -- is not what it actually seems...
See, I assert at the beginning of my routine that the DTXFSTSx register should be at its maximum whenever I'm starting a new IN transfer. From there on out I assumed that I have full immunity in pushing data into the TX-FIFO without having to worry about overflows. I mean, the RM literally says "This read-only register contains the free space information for the device IN endpoint Tx FIFO", what could possibly be misleading about that?
Well, as it turns out, once seven IN packets are buffered up, this DTXFSTSx field immediately gets zeroed, even if there was plenty of space for more data. For instance, if my Bulk IN endpoint is 64 bytes, and I was sending 512 byte worth of data, and I allocated 512 bytes for the IN endpoint's TX-FIFO, then this would be DTXFSTSx's value over time:
64-byte IN Packets Pushed So Far | DTXFSTSx (units of 4-byte words) |
0 | 128 |
1 | 112 |
2 | 96 |
3 | 80 |
4 | 64 |
5 | 48 |
6 | 32 |
7 | 0 |
8 | 0 |
So as it turns out, I was indeed "overflowing" the TX-FIFO buffer, as the "amount of free space in the FIFO" became zero, but I kept on pushing more data.
The fix is relatively simple: check DTXFSTSx to make sure there's available space at each packet boundary. It's important that it's done at packet boundaries since there's the aforementioned errata where FIFO data can get corrupted. Although the errata document doesn't make it very clear, I believe it's fine to do accesses to other USB core registers so long it's not in the middle of writing a packet.
Now the question is why? Well after some deep digging through some forum posts, I did come across someone running into the same issue as I had almost a decade ago. The original post had the same findings as I had, but ST horribly mangled it like a matted dog, and the sprinkles of bad grammar and spelling didn't exactly make it a pleasant read either, so I decided to make this post to help clarify things a bit.
Apparently the RM did indeed mention this limitation, but in a very innocuous manner:
Every time a packet is written into the transmit FIFO by the application, the transfer size for that endpoint is decremented by the packet size. The data is fetched from the memory by the application, until the transfer size for the endpoint becomes 0. After writing the data into the FIFO, the “number of packets in FIFO” count is incremented (this is a 3-bit count, internally maintained by the core for each IN endpoint transmit FIFO. The maximum number of packets maintained by the core at any time in an IN endpoint FIFO is eight). For zero-length packets, a separate flag is set for each FIFO, without any data in the FIFO.
So my theory is that -- once seven packets worth of data gets pushed into the TX-FIFO -- the internal bit counter becomes 0b111, and since this is the max, the USB core promptly snaps DTXFSTSx to 0 to stop the application from continuing to push more data. Of course, if this was the case, then the maximum amount of packets maintained by the core in an IN endpoint is actually seven, not eight, as quoted by the RM.
Because of this limitation though, what exactly is the point of being able to allocate more to the TX-FIFO if you can only schedule at most 7 IN packets? The only reason I can think of is that this is only actually useful in host mode where the endpoints become channels, but I haven't used host mode, so I wouldn't know. It's even more annoying considering the fact that it's a 3-bit counter. If it was a 4-bit counter, sending a 512-byte sector worth of data becomes possible, and as a bonus: 4 is a power of two!
It is what it is, so I guess I'll just have to deal with it.
2024-07-21 12:47 AM
Thanks for the next chapter of your OTG-USB saga. Most enlightening.
> what exactly is the point of being able to allocate more to the TX-FIFO if you can only schedule at most 7 IN packets
Again, the reason is the DMA mode (which you can't use in the 'L4's incarnation of OTG). So you just fill in PKTCNT and XFRSIZ, enable DMA and lean back. The internal DMA machine will then take care of pumping data in a manner not overwhelming the packet-queue you've just described.
This also partially exposes the fragility of the apparently ad-hoc-glued-up-and-patched nature of the FIFO machine. This, and the "don't mix FIFO and registers accesses" erratum. And the "don't mix different FIFOs accesses" error ST/Synopsys is yet to discover.
JW