cancel
Showing results for 
Search instead for 
Did you mean: 

Byte is lost from UART packet after being put into memory (DMA)

KVIZK.1
Associate II

We have run into a problem where a new run of boards with STM32F429ZIT MCU's have been losing a byte (always loses a single byte which seems odd) from the UART communications that are sent to DMA.

We send packets of UART data to the MCU and the data is then sent to DMA. When we read the data back from DMA we will occasionally read back a packet with a missing byte. The failure rate is about 1 in 10 Megabytes or 1 in 50,000 packets.

I have scoped the UART communications during the point of failure and the failed data packet looks perfect at the pin (so UART signal is ruled out) and there are no transients on VDD or 3V3. Several people have looked at the design and have not determined anything obviously wrong (though of course something is)

We have solved this issue by using retry logic in software, but management is demanding an answer as to why the new board run is having this issue.

So I have two questions:

1: Has anyone encountered an issue like this and have any advice on what could be the root cause?

2: From your experience, is such a failure rate of 1 in 10 Megabytes or 1 in 50,000 packets considered a normal or expected problem with these kinds of systems? In other words, did we just get lucky in the past with not needing retry logic for this kind of system?

Thanks for the help!

30 REPLIES 30

0693W00000FBQPdQAP.png

KVIZK.1
Associate II

*Update

I mis-specified the clock speed. We have an external 8MHz oscillator that is for the real-time-clock and we use the internal clock to set CPU clock at 168MHz, sorry for the confusion.

I break-pointed at the point of communication failure and sure enough both of those bits are flipped:

Breakpoint set during normal operation

0693W00000FCGrPQAX.pngBreakpoint set at point of communication failure

0693W00000FCGuEQAX.pngI'm a little baffled by this since the UART data at the pin of the MCU looks perfectly fine when we get the failure. Is something going wrong inside of the MCU?

Ty, see my reply below...

As convenient as it would be to blame it on the hardware, likely the problem is in your setup. Probably the clock is off a tiny amount. Perhaps your clock source is not as accurate as you think.
If you feel a post has answered your question, please click "Accept as Solution".

> Probably the clock is off a tiny amount. Perhaps your clock source is not as accurate as you think.

+1

Show us your clocks setup (relevant RCC registers content).

Also the clock frequency of the data source may be a little bit off.

JW

Clock setup is as follows:

0693W00000FCHPgQAP.png 

There is a ton of data in RCC registers I don't know what is relevant.

0693W00000FCHX1QAP.png 

From the logic analyzer there is an occasional pulse width deviation on UART TX and RX of 0.8% (496-504ns for fastest bit). That's the most I've seen.

RCC_CFGR.SW (and SWS) to confirm that system clock runs off PLL; all fields of RCC_PLLCFGR to confirm PLL settings, mainly that it runs off HSE (and perhaps RCC_RCC_CR.HSEON/HSERDY to confirm HSE is up and running).

RCC_CFGR.PPRE2 to confirm that you've indeed set APB2 = AHB/4 = 42MHz. That's unusual (are you trying to spare power?). This setting also puts the baudrate divider to "strongly fractional" (not that it would be any better with 84MHz APB clock) 1.3125 (yes I should've noticed this on the USART registers' screenshot). With non-integer baudrate divisors, the requirements for precise baudrate matching become tighter. These things add up.

USART_SR.NE may (although not necessarily) also indicate that the edges are not as clean as they ought to be; you should perhaps look at them using an oscilloscope, and you should perhaps also review return/ground arrangements.

JW

KVIZK.1
Associate II

*Update

The source of the UART failure looks to be a transient on SYSCLK. I outputted SYSCLK onto MCO2 and put a scope on the pin and have been monitoring the signal. This is SYSCLK (168MHz) downscaled by 4x to 42MHz. At the point of the communication failure this is what we see

0693W00000GVvizQAD.png 

During this transient the frequency of the clock will dip down from 42MHz to as low as 20 MHz with quite a bit of noise. These transients occur anywhere from several times a minute to a few times a day.

Note that there is no device connected to MCO2 during this test, MCO2 pin is open.

Here is SYSCLK normally as seen on pin MCO2

0693W00000GVvdqQAD.png 

As the glitch occurs the SYSCLK signal begins to breakdown and form this shape. It appears that there is another sinewave at about 1/3 frequency riding on the SYSCLK signal. The amplitude and frequency of SYSCLK are also changed

0693W00000GVvk2QAD.png 

I’ve monitored the HSE clock on the crystal and outputted HSE on MCO1 and do not see any transients on HSE during the UART failure or this clock glitch. I also do not see any transients on the 3.3V bus during these failures. So again I am scratching my head on the cause!

Well, that would certainly explain the results you're seeing. Thanks for posting those. Your clock settings above look fine to me. I haven't seen anything like this. I can't think of an explanation that is consistent with HSE not being interrupted.

Please post if you solve it. Does it happen on all boards or just some?

If you feel a post has answered your question, please click "Accept as Solution".

Can you set up a simple continuous PWM output of day 1MHz on any of the timers and measure its frequency change during the event?

JW