STM32L4 - USB - Sporadic Stall (-EPIPE) when reading control-endpoint.

stst9180 · ‎2022-04-07

Dear list,

I'm currently developing an USB-Audio composite device with the help of an stm32L4.

As composite is not really supported in the cube firmware, I have adopted the stack to support that meanwhile.

The device consists of an virtual com-port and 2 audio devices.

Everything is recognized well and is also basically working the most time.

Unfortunately there is one single problem left which seems to be a timing issue somehow:

Sometimes when doing a EP0-IN transfer because of a descriptor read or a audio-ctrl read (e.g. reading the frequency of the clock unit), I see an endpoint-stall status (-EPIPE) on wireshark although the corresponding data is appended in the packet. Yes: There is a packet with a -EPIPE status but containing the data requested.

I don't have any clue where this could originate from, as all breakpoints that I try to generate examining the STALL (e.g. where it is set) do not come true.

So may that be something in the core?

Is there any further Idea where / what to examine where this single stall status could come from although "USBD_CtlError" is not hit?

I see that a stall condition is also set on end of a successfull transfer of the data-in endpoint and is reset on a new Setup-Packet. But as the whole stack seems to be interrupt-based (from the same interrupt source), I'm wondering who could interfere here.

I just added one dissected trace which shows the problem:

Here you can see 2 tries to get the Manufacturer-String-Descriptor.

The 1st one fails, the 2nd one come true although this is completely the same request.

The same happens for other control-in transfers.

Maybe s.o. has some ideas how to debug this further. If one needs further information (code-parts, descriptors, wireshark-dissections) pleas contact me.. I just didn't want to flood the board.

Thanks in advance.

Best Regards

Pascal

waclawek.jan · ‎2022-04-08

Which 'L4? Within 'L4 family, there are two entirely different USB modules - the device only in lower-end, and the Synopsys OTG in higher end models.

Wireshark records are not that easy to interpret, as they "merely" visualize what Windows drivers deemed to be worthy of recording (i.e. primarily stuff which is important for Windows, not necessarily that needed for low-level communication debugging).

Broken pipe, or halt (terminology is very badly defined and used in the USB world), at host side is not necessarily result of STALL, although of course it's on the interpretation of the particular host. It may well be any other error, e.g. response timeout, CRC error, unexpectedly short packet, unexpected packet type. But those should be rare.

Do you actually have problems with delivering descriptors to host? If not, you can most probably ignore this, as obviously host retried and resulted in success.

If you suspect genuine problems on the bus, or on the device side, it may be better to observe the bus itself. There are relatively inexpensive bus analyzers out there for FS, also some LA/oscilloscope provide protocol decoding.

JW

stst9180 · ‎2022-04-08

Hi Jan,

it's an STM32L476, sorry for missing that information.

I already saw the errata: 2.25.4 and first thought it could be related here. But as all register writes "should" orginate from the USB-Interrupt there should not be any parallel processing here.. so this doesn't seem to be the case.

I also commented out "CDC_Write()"s which would be the only USB-Method called from within normal thread context. All other methods currently only write to buffers and the buffers are then examined on next ISOC-Transfer, so I seem to respect the errata.

It's not only happening on descriptor reads.

Because of Windows not supporting feedback-endpoints on USB1.0 devices I changed to a (self-written) UAC2 Audio-Interface-Class implementation. Therefore it's mandatory to answer the clock_source frequency via an control-transfer. The host does not retry to do so, so if this fails the device is not detected. No way to ignore the problem here.

The problem is also observed with linux-host, so it doesn't seem to be dependent on the host-os/ or core here.

It's completely the same happening on GET_DESCRIPTOR or CONTROL_IN transfers, so I also don't think it's an implementation-bug of mine.

I'll try to gain an usb-analyzer but inbetween I would be happy for any other comment that would help me in investigating this issue further.

Regards Pascal

waclawek.jan · ‎2022-04-10

You can try some simple things - check connections/pins soldering, check USB cable, add or remove USB hub, use a different USB port or a completely different PC, review power supply/make sure PC and target grounds are properly connected and there's no ground loop or noise current flowing through ground.

You can also try to cut down the software to bare minimum (or start a new minimal test project), i.e. not a composite but "single" device, no RTOS if you were using any, no other "processes". Review clocks routing and stability.

JW

stst9180 · ‎2022-04-12

Well.. - I tried a few of those simple things.

What I not did until now is creating a completely new project without rtos, but I removed all other threads so that usb is "on it's own".. I removed the cdc from the composite-device.. All without success.

Pin solderings seem fine, other cable is tested, signal integrity on scope seems fine.

I now managed to add an LA with usb-decoding and found out the following issue:

I recorded a device connect and put a trigger on the (sometimes) failing control request.

In appendix you'll see the record of an successfull control-request (SCR) and one of an unsuccessful control-request (UCR) where host is complaining about not getting the data.

On Packet #30 of the UCR you'll see a stall answered on the ZLP out packet which is always occuring after that in transaction.

On Packet #29 of the SCR you'll see that the device "should" answer with an "NAK" status to those out-transactions.

So I found the cause on the wire, but not in the software until now.

Best Regards Pascal

stst9180 · ‎2022-04-12

stst9180 · ‎2022-04-12

I digged in a bit more and found out that I reach the HAL_PCD_SetupStageCallback with the STALL bit set on EP0-OUT Endpoint.

IMHO this is illegal as a Setup-Token should always clear this STALL bit in Hardware (as it can't be cleared in software)

I'll doublecheck if this is due to a deferred interrupt handling (as irq's seem to be handled one-after-another)

Regards Pascal

waclawek.jan · ‎2022-04-12

The STALL occured on the STATUS stage of IN control transfer which is started by USBD_CtlReceiveStatus() and going down the rabbit hole of Cube's USB implementation all the way to the bottom where it writes to EP0's DOEPCTL; but if it's a bug then DOEPCTL might've get incorrectly written starting anywhere else... but the least you can do is track each value written to DOEPCTL (an expensive debugger providing tracking would help here, but you can e.g. by write a copy to an array in memory, together with some suitable timestamp and/or some other context-determining data).

[EDIT] You can also start your debugging based on content of SETUP packet preceding the STALL. It appears to be a CLASS-specific request. [/EDIT]

JW

stst9180 · ‎2022-04-12

Yes I know the request which is issued, but the STALL on the out-endpoint is already there DIRECTLY in the interrupt of the corresponding SETUP packet (#18)( I read DOEPCTL of ep0 in the HAL_PCD_SetupStageCallback ) to verify this, which in my opinion is "NOT ALLOWED" as the SETUP token itself should clear all STALLs ( so the RM states ). In my opinion there could be some race condition where the DOEPCTL is set to STALL just before this IRQ is handled (e.g. by a deferred handling of a previous token)?! I don't know how the core internally handles resetting the STALL bit of the endpoint as it's not resetable via software but if it's done on receiving the SETUP from the host, the uC might still be in an IRQ of a previous packet which writes the stall-bit afterwards.

But I don't really believe that as there is a rather bit time-gap between Packet #17(ACK) and #18(SETUP)

I have to state here that I'm using an rather old version of ST's Firmware (1.13.0).. Maybe there had been an issue which is already fixed.. Unfortunately it's a bunch of work in that SW-Structure to move to the newest version, but nevertheless this could be worth a try.

waclawek.jan · ‎2022-04-12

The previous packet was the same ZLP In in STATUS stage, so you still need to look at the same places.

It perhaps all boils down to trying to observe all writes to EP0's DOEPCTL.

Reviewing the RM, I stumbled upon an interesting obscure bit, OTG_FS_DCFG.NZLSOHSK . It should not be set and even if it would be, in this particular case it should do nothing wrong as the incoming STATUS packet was ZLP indeed. But, why not having one look at OTG_FS_DCFG?

JW