USB Host HS with DMA corrupts memory after the receive buffer

Rick Sladkey · ‎2018-04-04

Posted on April 05, 2018 at 05:38

After many forum searches, patching and considerable effort, I finally got good results with my CDC/ACM USB Host HS with DMA, using STM32Cube_FW_F7_V1.11.0. I thought I was past the worst of it, but long term testing has revealed infrequent hard crashes or mysterious USB stack hangs. With the help of the debugger, I discovered that the memory AFTER the my USB receive buffer was being corrupted. With the help of the MPU, I ruled out that any code running on the processor was even accessing the memory that was being corrupted. The only remaining conclusion is that the USB DMA operation was somehow overrunning the specified buffer and size options passed in, in my case, to USBH_CDC_Receive.

After reverse engineering how USB HS DMA host library is supposed to work, and referring to the rather vague and incomplete OTG_HS section of the reference manual, my guess is that the HCDMA register, which is set by the host library in USB_HC_StartXfer, and is auto-incremented as the DMA is being performed, is somehow being re-used without being rewound for a second transfer, and without transfer complete occurring in the meantime.

After patiently waiting many hours for it to crash, I did finally catch it in a state where I could observe from the USB data structure handles:

A 'data toggle error' had occurred on the input channel
Memory after the receive buffer had been overwritten
No receive callback had occurred, indicating the receive was still in progress
And yet the input channel was not enabled

Basically, the stack was hung. However, in this state, simple issuing a new

USBH_CDC_Receive kick-started the receive process and everything resumed working normally, except that memory after the receive buffer was corrupted, but I had left a large surplus to avoid corrupting other critical data structures in my program.

#usb-hs-dma-phy-host-cdc

Rick Sladkey · ‎2018-04-05

Posted on April 05, 2018 at 17:44

I found a new part of the puzzle, and this time it is related to the CDC host class driver. When an URB status change notification for URB_DONE occurs, the CDC driver does this:

 /*Check the status done for reception*/
 if(URB_Status == USBH_URB_DONE )
 { 
 length = USBH_LL_GetLastXferSize(phost, CDC_Handle->DataItf.InPipe);
 
 if(((CDC_Handle->RxDataLength - length) > 0) && (length > CDC_Handle->DataItf.InEpSize))
 {
 CDC_Handle->RxDataLength -= length ;
 CDC_Handle->pRxData += length;
 CDC_Handle->data_rx_state = CDC_RECEIVE_DATA; 
 }
 else
 {
 CDC_Handle->data_rx_state = CDC_IDLE;
 USBH_CDC_ReceiveCallback(phost);
 }
#if (USBH_USE_OS == 1)
 osMessagePut ( phost->os_event, USBH_CLASS_EVENT, 0);
#endif 
 }
�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?

but since the RxDataLength and length values are both unsigned, the subtraction is always positive (unless they are equal). Since this is the check that is intended to protect against overrunning the user's buffer, it seems certain that this is the cause of the corrupted memory. The question is, if RxDataLength is chosen to equal the size of DataItf.InEpSize, (in my case 512 for high speed USB), why is USBH_LL_GetLastXferSize ever returning a value larger than that? My guess is that it is a multi-packet transfer, but then why does it occur so rarely?

Here is the contents of CDC_Handle to prove it:

(gdb) p *$CDC_Handle

$3 = {CommItf = {NotifPipe = 2 '\002', NotifEp = 133, buff = '...', NotifEpSize = 10}, DataItf = {InPipe = 4, OutPipe = 3, OutEp = 3, InEp = 132, buff = '...', OutEpSize = 512, InEpSize = 512}, pTxData = 0x200014e0 <vcp_transmit_buffer> '', pRxData = 0x200012d6 <vcp_receive_buffer+534> '...', TxDataLength = 0, RxDataLength = 4294967274,

You can see that pRxData which started out pointing to my 512-byte receive buffer, has been incremented by 534 bytes and RxDataLength which started out 512, has beendecremented 'past zero' and is now a huge positive number.

Rick Sladkey · ‎2018-04-06

Posted on April 06, 2018 at 22:38

Afterafter fixing the above problem, I encountered corruption again after 24 hours of continuous operation. This time I can see that the xfer_count in the host channel is huge, clearly due to an underflow:

(gdb) set $phost = &hUsbHostHS

(gdb) set $CDC_Handle = (CDC_HandleTypeDef*) $phost->pActiveClass->pData (gdb) set $ch_num = $CDC_Handle->DataItf.InPipe (gdb) set $hhcd = &hhcd_USB_OTG_HS (gdb) set $hc = &$hhcd->hc[$ch_num] (gdb) p *$hc $3 = {dev_addr = 1 '\001', ch_num = 4 '\004', ep_num = 4 '\004', ep_is_in = 1 '\001', speed = 0 '\000', do_ping = 0 '\000', process_ping = 0 '\000', ep_type = 2 '\002', max_packet = 512, data_pid = 0 '\000', xfer_buff = 0x200010c0 <vcp_receive_buffer> '...'..., xfer_len = 512, xfer_count = 4294444032, toggle_in = 0 '\000', toggle_out = 0 '\000', dma_addr = 0, ErrCnt = 0, urb_state = URB_DONE, state = HC_XFRC}

From looking at the code, xfer_count should ALWAYS be less than or equal to the xfer_len. Since xfer_count is only set twice by the code, and the first time sets it to zero, it must be the other time in the transfer complete interrupt handler:

 else if ((USBx_HC(chnum)->HCINT) & USB_OTG_HCINT_XFRC)
 {
 if (hhcd->Init.dma_enable)
 {
 hhcd->hc[chnum].xfer_count = hhcd->hc[chnum].xfer_len - \
 (USBx_HC(chnum)->HCTSIZ & USB_OTG_HCTSIZ_XFRSIZ);
 }
�?�?�?�?�?�?�?�?

Frankly, I have no idea what is going on in the host controller to cause the XFRSIZ field of the HCTSIZ register to behave this way. Any ideas?

I need the host stack to behave reliably for days at a time.