cancel
Showing results for 
Search instead for 
Did you mean: 

STM32H7 USB FS Host Core how to recover from "Data toggle error"

Microman
Associate III

Hi all,

Preamble:

I have an STM32H743 application that needs to constantly read some data stream from an USB memory stick. Basically all works. I needed to apply the NAK fix to avoid FreeRTOS getting blocked from too many NAK IRQs. Done, no longer an issue. Customers are running the device every day with no issues at all. But there is one thing: One customer replaced the USB stick we ship with the product with some other USB stick. Now this stick works 100% fine on PC and iMacs and also on other embedded devices but not on the STM32H7 host.

Problem:

This 8GB stick causes "Data toggle errors", that trigger a MCU flag called USB_OTG_HCINT_DTERR. The current host stack from STM32Cube H7 V1.7.0 is unable to recover from this. It loops forever in MSC_read. No error is thrown. Its just nothing else happening untill the stick is pulled out (disconnect event). I tried fixing this by re-activating the host with the usual sequence clearing USB_OTG_HCCHAR_CHDIS and setting USB_OTG_HCCHAR_CHENA in HCCHAR - does not help.

Has anyone had this before and maybe are there some ideas how to get back fo transfer without completly un-remounting the stick?

7 REPLIES 7

IMO this should be handled as any other generic communication error, by the 3x retrying mechanism per 10.2.6 Transmission Error Handling. I'm not sure if and how this is implemented in the Cube USB Host stack, I don't use it. Of course, after the 3x errors the stack should return a comprehensible error and than there's nothing else to do than restart.

I'd also suspect that this is may be a consequence of excessive raw data error rate (which may be hidden by hardware and other parts of the stack), so you may want to look at the DP/DM signals' integrity, too.

What stick is this, exactly?

JW

Thanks for sharing your thoughts. What I do not see inside the host stack is is any kind of transmittion retry, not even any try to recover something. It only doing this in the HCD_HC_IN_IRQHandler:

   __HAL_HCD_UNMASK_HALT_HC_INT(ch_num);

   (void)USB_HC_Halt(hhcd->Instance, (uint8_t)ch_num);

   __HAL_HCD_CLEAR_HC_INT(ch_num, USB_OTG_HCINT_NAK);

   __HAL_HCD_CLEAR_HC_INT(ch_num, USB_OTG_HCINT_DTERR);

and if if would had happened in tout OUT handler (which it never reaches because this ony triggers in tht IN handler)

   hhcd->hc[ch_num].state = HC_DATATGLERR;

There it yould set a "Glitch Error".

The signals are ok I would say. The lines between STM and USB Jack have the same length (designed with AD19), they are 90 Ohms impedance, the total length is a bit more that 2cm. The signals (DP and DM) look at any point along the line like this:

0693W000001q8jRQAQ.png

BTW it's officially upgraded to 3024 200MHz BW.

The stick is some some chinese "no name".

 0693W000001q8qvQAA.png

I looked in 'F4 Cube of some vintage v1.21 version I have here, and it appears to be handled correctly - in HCD_RXQLVL_IRQHandler() after reading from FIFO it does nothing, leaving it to the interrupt handler; that upon USB_OTG_HCINT_DTERR calls USB_HC_Halt() with setting  hhcd->hc[chnum].state = HC_DATATGLERR; and that in turn when the CHH interrupt occurs, performs the 3x retry and then setting urb_state = URB_ERROR, which should be signal to upper level that transfer failed.

JW

Ok thanks yes I might have deleted the state = HC_DATATGLERR inside the IN handler.

I don't get it how this works....

in stm32h7xx_hal_hcd.c: Line 1275, HCD_HC_IN_IRQHandler()

there it handles USB_OTG_HCINT_DTERR:

  else if ((USBx_HC(ch_num)->HCINT & USB_OTG_HCINT_DTERR) == USB_OTG_HCINT_DTERR)

 {

  __HAL_HCD_UNMASK_HALT_HC_INT(ch_num);

  (void)USB_HC_Halt(hhcd->Instance, (uint8_t)ch_num);

  __HAL_HCD_CLEAR_HC_INT(ch_num, USB_OTG_HCINT_NAK);

  hhcd->hc[ch_num].state = HC_DATATGLERR;

  __HAL_HCD_CLEAR_HC_INT(ch_num, USB_OTG_HCINT_DTERR);

}

after that leaves HCD_HC_IN_IRQHandler()

then it is called again and it sees a "USB_OTG_HCINT_CHH"

  else if ((hhcd->hc[ch_num].state == HC_XACTERR) ||

           (hhcd->hc[ch_num].state == HC_DATATGLERR)) <-- this is the current state

  {

    hhcd->hc[ch_num].ErrCnt++; <-- increases ErrCnt from 0 to 1

    if (hhcd->hc[ch_num].ErrCnt > 3U)

    {

      hhcd->hc[ch_num].ErrCnt = 0U; <-- BUT it NEVER gets here because ErrCnt never reaches 3!

      hhcd->hc[ch_num].urb_state = URB_ERROR;

    }

    else

    {

      hhcd->hc[ch_num].urb_state = URB_NOTREADY;

    }

because ErrCnt is 1, it sets urb_state = URB_NOTREADY, trys reactivation

/* re-activate the channel */

    tmpreg = USBx_HC(ch_num)->HCCHAR;

    tmpreg &= ~USB_OTG_HCCHAR_CHDIS;

    tmpreg |= USB_OTG_HCCHAR_CHENA;

    USBx_HC(ch_num)->HCCHAR = tmpreg;

Next call of HCD_HC_IN_IRQHandler(), a USB_OTG_HCINT_XFRC occurred which makes it reset ErrCnt again to 0.

 else if ((USBx_HC(ch_num)->HCINT & USB_OTG_HCINT_XFRC) == USB_OTG_HCINT_XFRC)

 {

   if (hhcd->Init.dma_enable != 0U)

   {

     hhcd->hc[ch_num].xfer_count = hhcd->hc[ch_num].xfer_len - \

                                   (USBx_HC(ch_num)->HCTSIZ & USB_OTG_HCTSIZ_XFRSIZ);

   }

   hhcd->hc[ch_num].state = HC_XFRC;

   hhcd->hc[ch_num].ErrCnt = 0U;

   __HAL_HCD_CLEAR_HC_INT(ch_num, USB_OTG_HCINT_XFRC);

And this is how it loops forever, at least as I can tell with simple JTAG code/data breakpoints and no trace facility.

XFRC means that transfer finished correctly, so it's OK to reset the error count there.

I was under the impression that you are talking about a randomly occurring phenomenon, if it's regular then maybe you are dealing with a devious device. At this point I recommend to use a bus analyser and to the usual reply "I don't have that" I say "get one, USB can't be seriously dealth with without one, FS are not that prohibitively expensive and many oscilloscopes/LA these days come with USB analyzing option".

You also mentioned some fix for the interrupt storm, without much analyzing what it is you can experimentally remove it.

JW

Yes it's quite reproducible. The first few sectors (3 to 10 something) can be read, then the toggle error occurs. MIght be some clock drift problem? My Agilent scope can trigger SOP, EOP full speed/low speed, that's it. There is no protocol analyzing feature for USB. Any suggestions for an USB analyzer? I only need FS.

I use a cheap LA, Asix Sigma2 with USB plugin, not the greatest but workable. I am by no means an USB expert and try to avoid it as much as possible.

I have heard praise on the Beagle USB analyzer.

Signals look good. PID 0x5678 and VID 0xffff, that tells a lot. Unfortunately, it's hard to explain to the customer.

Even with tools, this is no joy. I personally would try the widest selection of flash disks as you can get hands on (including whatever family and friends can offer), and if this turns out to be specific to this one particular disk, I would try to talk to the customer.

JW