STM32F407 Half bit error on SPI3 MISO on multi SPI solution

sm5wpx · ‎2014-09-22

Posted on September 22, 2014 at 14:38

Hi all!

I am running a 3 MCU solution were all MCU:s are communicating 21MHz full duplex SPI to eachother. As:

MCU1 SPI3 Master <-> MCU2 SPI3 Slave

MCU1 SPI2 Master <-> MCU3 SPI2 Slave

MCU2 SPI2 Master <-> MCU3 SPI1 Slave.

The SPI data transfers are with DMA and most of the time it works very well. Communication between MCU1 <-> MCU2 and MCU1 <-> MCU3 starts almost at the same time, and MCU2 <-> MCU3 starts when half the MCU1 <-> MCU2 traffic is finished (half buffer interrupt) and the data between MCU2 <-> MCU3 is approx. half the size so all traffic will end approx. at the same time. The traffic cycle repeats every 5 ms.

But at random occations, a peak pulse about half a bittime (or half CLK period) is emerging on the SPI3 MISO on the MCU1 <-> MCU2 SPI. This pulse is in phase with the CLK signal on the MCU1 <-> MCU3 SPI. The amplitude is the same as a normal signal. It's like the CLK signal from SPI2 is fed out on SPI3 MISO pin for a very short time but I cannot see that PC11 signals or PB13 can be cross connected somehow.

The upper signal in the attached photo is the SPI3 MISO signal, and the lower wave is the SPI2 CLK signal. The error peak is second from the end in the upper wave. It differs compared to the other pulses in the upper wave. The last pulse in the upper wave is the last bits of SPI3 MISO transaction. The SPI2 (CLK signal on lower wave) ends a couple of bytes later...

How is this possible? The similarities between the error pulse in the upper MISO wave and the lower CLK wave, when it comes to amplitude and phase, are so big that I almost can exclude other disturbance sources on the board.

Do anyone know any sensitive parts in code to take into notice, as when clearing ISR flags or something?

Kind regards

#stm32f407-spi-dma

stm322399 · ‎2014-09-22

Posted on September 22, 2014 at 16:02

Stop searching a software issue, you have a serious electrical problem. No SPI clock or data shall looks like in your picture.

At 21MHz, it is easy to get MUC to communicate when they are on the same PCB. In your case, I guess that you use separate PCB, with long wires, probably not very well grounded, connectors etc ... no wonder that it does not work well.

Tell us more about your test conditions.

Is this possible to run a test at lowaer speed ? Say 2Mbit max.

stm322399 · ‎2014-09-23

Posted on September 23, 2014 at 09:18

Suddently I realized that yuor HW problem is possibly that you *over-zoomed* into your scope trace, leaving so few acquired points to draw the trace that the scope interpolated into those stranges pseudo-sinusoidal curves.

In such a case, try to get a trace using delayed trigger, or anything else. Then we can draw conclusion about what's wrong, HW or SW.

sm5wpx · ‎2014-09-23

Posted on September 23, 2014 at 11:37

Hi and thank you for the answers.

First, the headline is wrong. It shall say SPI3 not SPI8 =).

Yea sorry for the poor trace. I shall try get an other one.

I just manage to trig an error and I can now see that this time the error pulse emerged after the SPI2 CLK has ended! So it seams as the error pulse has less to do with the CLK signal on the other SPI interface. It is very strange though, that even if most of the bytes are 0x00 (buffer size 1500 Bytes) except in the beginning, in the middle and in the end of the buffer, the error emerges inside or close to the shunks of data. I have not seen the error while just sending zeros.

The MCUs are located ~40mm from eachother in a line, with MCU1 in the middle and MCU2 below MCU1. All on the same PCB.

I have tested to run at 10 MHz and I have not seen this problems at that speed.

BR

sm5wpx · ‎2014-09-24

Posted on September 24, 2014 at 09:28

I have a 100us timer TMR2 that generates interrupts periodically. Its interrupt priority is the lowest among all other IRQ:s. The ISR does some simple variable tick-up and ofcource clears the IRQ flag for the timer.

Further tests shows that disabeling the TMR2 IRQ on MCU2 makes the error pulse disappear. I have not yet done an overnight test, but so far it runs without hickups.

How can TMR2 IRQ influence the SPI3 MISO signal? No register settings are done in the IRQ except clearing the IRQ flag ofcource.

BR

stm322399 · ‎2014-09-24

Posted on September 24, 2014 at 09:48

It is really hard to give any useful answer without seeing any proper signal trace or code.

Generally speaking, having a fast source of interrupt, increase chances to space consecutive statements. Normally this shall increase occurrence of bugs, in particular for (quasi-)atomic operations that requires to perform several tasks in a short time frame. I your case this is the opposite, for sure there is an explanation, but we are not aware of enough details to answer precisely.

sm5wpx · ‎2014-09-24

Posted on September 24, 2014 at 13:18

I have attached a new scope image taken with 20ns/div. I.e. not much zoom. The error pulse is the first positive pulse in the image.

I did not get eny more errors when disabeling the TMR2 IRQ. I also tried enable the IRQ again and remove all code, except clearing flag, from the ISR. Then the error came back.

My current test is to switch timer from TMR2 to TMR10 instead, and keep all code as is. Just a timer change. I selected TMR10 because is uses a different clock source compared to SPI3. I will let it go over night and see the progress tomorrow. So far it looks good, to be continued...

All the above information is related to MCU2 only.

________________

Attachments :

IMAG0194.jpg : https://st--c.eu10.content.force.com/sfc/dist/version/download/?oid=00Db0000000YtG6&ids=0680X000006I0WF&d=%2Fa%2F0X0000000bdH%2FPMrLJeWrh5.g4CgoNarkTyKcSVM61R0J8echmxJpZZo&asPdf=false

stm322399 · ‎2014-09-24

Posted on September 24, 2014 at 14:26

The good news is that I misread your test with timer, and now it appears more clear that your handling of SPI is wrong. When a fast interrupt increases the occurrences of this half bit, it means that you managed something with the assumption that 2, 3 or more operation are done very quickly in a row. When the interrupt inserts itself into this flow, the bug appears.

Having half a bit can be caused by a lot of thing. This could be an electrical conflict between master and slave. This could be a stop operation that effectively stop a bit too late. Etc...

sm5wpx · ‎2014-09-24

Posted on September 24, 2014 at 21:29

Ok I see... I shall investigate the workflow.

Before I left ofice today, I concluded that using TMR10 instead of TMR2 solved the issue. I do not know if it is because of change of timer clock source or not. I will confirm the solution tomorrow.

Perhaps I can post my SPI+DMA init code as well as timer init code here later...

sm5wpx · ‎2014-09-24

Posted on September 25, 2014 at 07:00

I now confirm that switching from TMR2 to TMR10 solves the issue =)