2021-10-04 07:34 AM
We have run into a problem where a new run of boards with STM32F429ZIT MCU's have been losing a byte (always loses a single byte which seems odd) from the UART communications that are sent to DMA.
We send packets of UART data to the MCU and the data is then sent to DMA. When we read the data back from DMA we will occasionally read back a packet with a missing byte. The failure rate is about 1 in 10 Megabytes or 1 in 50,000 packets.
I have scoped the UART communications during the point of failure and the failed data packet looks perfect at the pin (so UART signal is ruled out) and there are no transients on VDD or 3V3. Several people have looked at the design and have not determined anything obviously wrong (though of course something is)
We have solved this issue by using retry logic in software, but management is demanding an answer as to why the new board run is having this issue.
So I have two questions:
1: Has anyone encountered an issue like this and have any advice on what could be the root cause?
2: From your experience, is such a failure rate of 1 in 10 Megabytes or 1 in 50,000 packets considered a normal or expected problem with these kinds of systems? In other words, did we just get lucky in the past with not needing retry logic for this kind of system?
Thanks for the help!
2021-10-04 01:40 PM
2021-10-12 10:03 AM
*Update
I mis-specified the clock speed. We have an external 8MHz oscillator that is for the real-time-clock and we use the internal clock to set CPU clock at 168MHz, sorry for the confusion.
I break-pointed at the point of communication failure and sure enough both of those bits are flipped:
Breakpoint set during normal operation
Breakpoint set at point of communication failure
I'm a little baffled by this since the UART data at the pin of the MCU looks perfectly fine when we get the failure. Is something going wrong inside of the MCU?
2021-10-12 10:04 AM
Ty, see my reply below...
2021-10-12 10:12 AM
2021-10-12 11:08 AM
> Probably the clock is off a tiny amount. Perhaps your clock source is not as accurate as you think.
+1
Show us your clocks setup (relevant RCC registers content).
Also the clock frequency of the data source may be a little bit off.
JW
2021-10-12 12:03 PM
Clock setup is as follows:
There is a ton of data in RCC registers I don't know what is relevant.
From the logic analyzer there is an occasional pulse width deviation on UART TX and RX of 0.8% (496-504ns for fastest bit). That's the most I've seen.
2021-10-12 12:58 PM
RCC_CFGR.SW (and SWS) to confirm that system clock runs off PLL; all fields of RCC_PLLCFGR to confirm PLL settings, mainly that it runs off HSE (and perhaps RCC_RCC_CR.HSEON/HSERDY to confirm HSE is up and running).
RCC_CFGR.PPRE2 to confirm that you've indeed set APB2 = AHB/4 = 42MHz. That's unusual (are you trying to spare power?). This setting also puts the baudrate divider to "strongly fractional" (not that it would be any better with 84MHz APB clock) 1.3125 (yes I should've noticed this on the USART registers' screenshot). With non-integer baudrate divisors, the requirements for precise baudrate matching become tighter. These things add up.
USART_SR.NE may (although not necessarily) also indicate that the edges are not as clean as they ought to be; you should perhaps look at them using an oscilloscope, and you should perhaps also review return/ground arrangements.
JW
2021-10-29 07:55 AM
*Update
The source of the UART failure looks to be a transient on SYSCLK. I outputted SYSCLK onto MCO2 and put a scope on the pin and have been monitoring the signal. This is SYSCLK (168MHz) downscaled by 4x to 42MHz. At the point of the communication failure this is what we see
During this transient the frequency of the clock will dip down from 42MHz to as low as 20 MHz with quite a bit of noise. These transients occur anywhere from several times a minute to a few times a day.
Note that there is no device connected to MCO2 during this test, MCO2 pin is open.
Here is SYSCLK normally as seen on pin MCO2
As the glitch occurs the SYSCLK signal begins to breakdown and form this shape. It appears that there is another sinewave at about 1/3 frequency riding on the SYSCLK signal. The amplitude and frequency of SYSCLK are also changed
I’ve monitored the HSE clock on the crystal and outputted HSE on MCO1 and do not see any transients on HSE during the UART failure or this clock glitch. I also do not see any transients on the 3.3V bus during these failures. So again I am scratching my head on the cause!
2021-10-29 09:33 AM
Well, that would certainly explain the results you're seeing. Thanks for posting those. Your clock settings above look fine to me. I haven't seen anything like this. I can't think of an explanation that is consistent with HSE not being interrupted.
Please post if you solve it. Does it happen on all boards or just some?
2021-10-29 12:35 PM
Can you set up a simple continuous PWM output of day 1MHz on any of the timers and measure its frequency change during the event?
JW