2021-10-04 07:34 AM
We have run into a problem where a new run of boards with STM32F429ZIT MCU's have been losing a byte (always loses a single byte which seems odd) from the UART communications that are sent to DMA.
We send packets of UART data to the MCU and the data is then sent to DMA. When we read the data back from DMA we will occasionally read back a packet with a missing byte. The failure rate is about 1 in 10 Megabytes or 1 in 50,000 packets.
I have scoped the UART communications during the point of failure and the failed data packet looks perfect at the pin (so UART signal is ruled out) and there are no transients on VDD or 3V3. Several people have looked at the design and have not determined anything obviously wrong (though of course something is)
We have solved this issue by using retry logic in software, but management is demanding an answer as to why the new board run is having this issue.
So I have two questions:
1: Has anyone encountered an issue like this and have any advice on what could be the root cause?
2: From your experience, is such a failure rate of 1 in 10 Megabytes or 1 in 50,000 packets considered a normal or expected problem with these kinds of systems? In other words, did we just get lucky in the past with not needing retry logic for this kind of system?
Thanks for the help!
2021-11-02 02:00 PM
Manipulating the PLL settings affects the rate of failure. I will post my test results soon.
It happens on all of our new run of boards but I know of at least one "old" board that does it. The old boards have the same MCU/Memory/Crystal and have a slightly different layout. For the most part the old boards do not show the issue though. The only thing that jumps out is that the external oscillator on the older boards is placed a little more optimally. However when I monitor HSE it looks almost identical between known good and bad board.
2021-11-06 06:53 PM
> We have an external 8MHz oscillator
It seems that HSE is fed with an external oscillator, but screenshot shows that BYPASS mode is not enabled.
2021-11-16 01:08 PM
I have not done this test but I've confirmed that switching from HSE to HSI fixes the issue. So I've identified the HSE as the root cause.
2021-11-16 01:09 PM
It appears that the HSE is in fact the problem, details below....
2021-11-16 02:17 PM
*Update #2
I switched the PLL clock source from HSE to HSI and the SYSCLK transient vanished.
Since this test I turned my attention back to the HSE since there is no other explanation that I can think of. It does appear that there is some measurable difference between a board with and without the transient.
Good board - No SYSCLK or PLL transient
Bad board - SYSCLK and PLL transients can occur several times a minuteWe can see that the "bad board" has more ringing on the HSE than the good board and I believe this may have something to do with the issue.
Here are some snapshots of the crystal circuits on good vs bad board. For reference my crystal is ATS08ASM-1E with 18pF load caps C7 and C8.
Good board - No SYSCLK or PLL transient
Bad board - SYSCLK and PLL transients can occur several times a minute. This varies from board to boardIt is also obvious (I think) that the xtal circuit layout on the "bad" board is not optimal and could explain why HSE on these new boards is not working as well.
This all makes sense and I can say that I found the root cause, but there is one annoying thing about this problem that I can't seem to resolve:
If I monitor SYSCLK and HSE at the same time I do not see anything significant happening on HSE during the SYSCLK transient
HSE (yellow) vs SYSCLK (blue) during transientNotice that the HSE is getting very noisy when the SYSCLK transient occurs and this is due to crosstalk from the SYSCLK signal affecting the HSE. When I disconnect the SYSCLK (blue) signal from the scope, the noise on HSE goes away. What I'm left with is basically nothing observable on the HSE while the board is having SYSCLK transients constantly.
Here's another capture with less crosstalk, this time HSE in Blue and SYSCLK in YellowIt's hard to make out the HSE at this zoom level but there is nothing really obvious here. I can take more captures if anyone wants to see something in particular.
Here are some other observations:
My assumption based on all of this is that the HSE is unstable and that the PLL is control loop is 'desyncing' briefly causing the SYSCLK transient. It also appears that there are certain levels of instability that allow the PLL to operate without issue? That's the part that is really annoying me haha....
I think it may be time to re-design the HSE circuit.
Anyone have any thoughts on this?
Thanks again for all the help!
2021-11-16 02:27 PM
Show ground.
Does any power or HF current flow through that ground from the caps to the GND pin closest to OSC pins (probably VSSA)? Can this current be correlated to occurence of instability?
Or, the other way round - lift the capacitors and hang on them a wire connected to the GND pin closest to the OSC pins; and observe. PLL/oscillator can still be thrown off by impure VDDA/VSSA, though (I don't know exactly the internal routing of the chip).
JW
2021-11-16 08:46 PM
I had been experiencing similar problem with STM32F429IGT working at 180MHz. Using external 25MHz crystal.
I was using USART3 DMA Rx with 921600bps to send and receive data from LTE terminal. I was to receive about 100k bytes text data with packet size 1k using MCU. The interval of packets is 200ms. Working at 921600 and 460800, I had some non-readable chars in the middle. All the data transmitted and received are designed to be base64 and text. If there are non-readable chars, then the communication has errors.
Later I lowered the baud rate to 115200 and it works fine now.
2021-11-18 07:28 AM
There's a ground plane underneath everything.
I removed the load caps and wired them in from a breadboard with ground going to C14/C15 and that did not help. Having the load caps wired in did make the HSE less stable which is not surprising.
There is some minor noise when I measure between the load cap grounds and the MCU grounds but it looks to be the same on all the boards.
I removed the xtal and load caps from the board and am now injecting an 8MHz waveform that matches the xtal signal from one of the good boards. When I do this the HSE as measured from MCO on my scope looks the most stable I've seen it. The transient is still present in this configuration though which is very surprising. Next test is to experiment with some series resistors on the xtal and then try this test on a good board. If the good board does not show the transient with an external generator that is wired in the same way then it seems to me that the HSE is not really to blame.
Of course the simplest and most likely explanation is that the HSE is to blame and I'm just missing something here....
2021-11-18 07:48 AM
> There's a ground plane underneath everything.
OK. It's not necessarily the best method. You may want to separate analog and digital grounds. It's not simple to "see" how currents flow on a plane with holes in it, and where does a heavy digital load generate a possibly high frequency voltage drop.
The 144-pin package is there for a reason. External memories, perhaps? Any other heavy digital circuitry?
> I removed the load caps and wired them in from a breadboard
Breadboard? I meant, put them "standing", one pad directly on the crystal's pad, the other pad connected to flying lead going as short as possible to the mcu's analog ground.
> The transient is still present in this configuration
Are you talking about signal ultimately derived from PLL? PLL may be affected by ground currents, too.
Try switching off everything possible, maybe running a minimal program, and compare the HSE stability and ground differences.
I know this sounds like I'm fixated on one particular theme - and I may be, given past experience, but I am aware of that your case may be different, so please take these just as items of discussion.
JW
2021-11-18 08:12 AM
It makes sense that slowing down the data rate would resolve the issue since a 'wider' bit is less susceptible to noise or a clock error. Unfortunately we cannot slow our Baud rate on our system.