STM32F745 USB endpoint EPENA stuck CDC/MSC

James Murray · ‎2020-11-09

I have an intermittent issue with a USB composite device on STM32F745 where an endpoint is failing to send data back to the PC and then the PC resets the connection.

I've applied the interrupt disabling from

https://community.st.com/s/question/0D50X00009XkhXw/usb-cdc-device-receive-fails-on-transmit and that did improve things somewhat, however it isn't fixed.

My code is derived from ST's example codes.

In normal operation, the MSC is queried by the PC periodically with a "Test Unit Ready" command. The CDC is in full flow at about 150 useful data packets per second.

After a random time period (between a few minutes and multiple hours) the USB send function is unable to add data to the IN FIFO.

This line fails:

if ( ((USBx_INEP(ep)->DIEPCTL & USB_OTG_DIEPCTL_EPENA) == 0)

&& ((USBx_INEP(ep)->DTXFSTS & USB_OTG_DTXFSTS_INEPTFSAV) >= len32b) ) {

because EPENA is set.

My code was previously just ignoring this and hoping that the host asked for the data again. However, when this fails on the Test Unit Ready command, the host resets the USB connection after about 30 seconds which disrupts my CDC data.

Also, once that has happened, it continues to reset every 30 seconds because the MSC never works again - the endpoint seems broken. A device reset in the debugger clears it up.

I tried adding code for a short wait and retry (no effect) and also tried to set EPDIS and then check for the DIEPINT_EPDISD but that doesn't appear to ever get set.

In another post ( https://community.st.com/s/question/0D50X00009Xki6G/stm32f429-usb-bug-on-updating-diepctl-register-while-epena-set ) a customer has code that clears the EPENA bit directly - can this work? The datasheet only mentions the core clearing the bit.

What are the causes for the EPENA bit getting "stuck" on? Why isn't setting the DIEPCTL_EPDIS bit clearing it?

Any other ideas?

James

James Murray · ‎2020-11-19

Thanks for replying!

I _think_ I've actually resolved the problem now. My FIFOs were set wrong (and have been since 2018)., my minimal test code has been running for 20h now.

The HAL_PCDExSetTxFifo code doesn't do any error checking or present a return code if you set FIFOs that exceed the USB RAM zone. I was trying to set them in bytes instead of words. The example code I started with didn't explain the numbers. When I'm 100% confident that I've resolved it I'll post a sample of fixed code with notes that may help anyone else stumbling this post in the future.

James

View solution in original post

James Murray · ‎2020-11-17

Since this last post, I've been working continuously on my code without any real success.

I've removed everything except MSC/CDC - no change.

Undone my streamlining and flattening to use the HAL as it was intended - no change.

Upgraded/merged the HAL code to use the relevant portions that come in STM32Cube V1.16

With these latest changes, I haven't yet seen the MSC problem, because CDC gets locked up far sooner...

I appear to be having the same issue as:

https://community.st.com/s/question/0D50X00009XkfhNSAR/stm32cube-usb-device-driver-interrupt-handler-fails-to-clear-txstate-leaving-cdc-unable-to-send-in-transfers

and

https://community.st.com/s/question/0D50X00009XkfgL/hi-facing-usb-write-issue-when-usb-traffic-is-more

I tried the suggestion from the first post of moving the IRQ sections around, but it didn't appear to make any difference to me.

After about 30 minutes of 300 frames per second (13 data byte PC -> STM32) (112 data bytes STM32->PC) the TxState is stuck at 1. The O/S is asking the MSC for a Test Unit Ready and then Request Sense every two of seconds.

I'm processing the CDC data in the mainloop, so the TxState should be cleared by the interrupt handler, but it isn't. I already added a delay and retry for testing purposes, but no luck.

My thoughts on potential fault areas :

1. My code (most likely)

2. ST's code

3. Synopsys USB IP.

4. PC operating system

I'm mostly testing in Linux, but did test a little in Windows 10 and got the earlier MSC problem, so that would seem to rule out the PC O/S.

The HAL upgrade looked promising as there was mention of changing the USB IRQ handler and some bugfix in MSC SCSI handling.

Can anyone else confirm if they've seem similar issues, or conversely that they have developed rock-solid CDC/MSC HS composite devices.

Any suggestions on what else to try?

This feels like an interrupt race or FIFO handling issue. I'm considering removing the PCD and LL-USB code and writing those low-level portions afresh using the datasheet.

James

James Murray · ‎2020-11-18

Testing again this morning and the CDC stopped responding to requests after 1h15.

I have PC Linux kernel debugging enabled and see this:

Nov 18 13:03:13 jsm3 kernel: [1770659.790280] ehci-pci 0000:00:1a.7: detected XactErr len 0/1024 retry 1

Nov 18 13:03:13 jsm3 kernel: [1770659.790405] ehci-pci 0000:00:1a.7: detected XactErr len 0/1024 retry 2

..snip...

Nov 18 13:03:13 jsm3 kernel: [1770659.793905] ehci-pci 0000:00:1a.7: detected XactErr len 0/1024 retry 30

Nov 18 13:03:13 jsm3 kernel: [1770659.794030] ehci-pci 0000:00:1a.7: detected XactErr len 0/1024 retry 31

Nov 18 13:03:13 jsm3 kernel: [1770659.794156] ehci-pci 0000:00:1a.7: devpath 2 ep2in 3strikes

Nov 18 13:03:13 jsm3 kernel: [1770659.794164] cdc_acm 1-2:1.1: acm_read_bulk_callback - non-zero urb status: -71

This matches up with a Wireshark trace that shows a packet from EP2 with "URB status: Protocol error (-EPROTO) (-71)"

I don't know what the protocol error is however as I don't have a USB bus analyser (yet.)

What could cause this?

James

waclawek.jan · ‎2020-11-18

I feel your pain. The Synopsys OTG is a breast with historical layers, extremely picky to sequence of events; and ST plays dead but when it comes to details.

(Plus USB being a pile of *** in itself, and the hosts adding issues of their own (Linux being no exception)).

As an illustration, https://community.st.com/s/question/0D50X00009nKK7w/synopsys-otg-fifo-clash . I don't say this is your problem, as I experienced automatic recovery, but then it was an 'F4, and there are different versions of the OTG module in different STM32 (something ST also kindly omits to mention, denying even the existence of the version/setup read-only registers, so that users don't pester with more questions).

Start perhaps with carefully observing all the relevant registers when problem occurs (beware, observing registers by debugger on the run may be intrusive). Also try increasing the occurrence of the problem - try overload/stress the mcu.

JW

James Murray · ‎2020-11-19

Thanks for replying!

I _think_ I've actually resolved the problem now. My FIFOs were set wrong (and have been since 2018)., my minimal test code has been running for 20h now.

The HAL_PCDExSetTxFifo code doesn't do any error checking or present a return code if you set FIFOs that exceed the USB RAM zone. I was trying to set them in bytes instead of words. The example code I started with didn't explain the numbers. When I'm 100% confident that I've resolved it I'll post a sample of fixed code with notes that may help anyone else stumbling this post in the future.

James

waclawek.jan · ‎2020-11-19

Thanks for sharing the solution.

Another tile to the puzzle...

JW

James Murray · ‎2021-03-12

I'm not so sure about my "solution" any more. I've had the same problem recurr sometimes, but have not dedicated any more time to investigating.

James