STM32 as I2C slave - sporadic errors - how to diagnose?

SKled.1 · ‎2020-11-04

EDIT: Note change of direction of things in 3rd post.

I have a stm32f091, which has a program that basically runs one mainloop that triggers ADC measurements and polling values from i2c chips using the i2c2 peripheral in master mode. ADC uses a DMA irq per mux sequence.

There is the systick interrupt that does some stuff.

The stm32 itself acts as a i2c-slave on the i2c1 peripheral.

When I now have a linux system connected that periodically polls the stm32 i2cslave for a bunch of "registers" that it defines, this may run for 30 to 60 minutes or so without problems, and then an error occurs which the logic of my i2c module detects, e.g. basically that it's stuck with busy flag is one possibility.

I don't currently know exactly yet - as I am also having some trouble with the debugger connection being lost / "target is not responding, ...", when this happens, but also at other occasions.

That's on custom hardware.

Now I have also tried that with a Nucleo board and a raspberry pi, which is of not much help as apparently the pi i2cmaster has a buggy clock stretching implementation (there are threads on the pi forums about it) and so I am getting an i2c error far more quickly & consistently, when using that setup.

Debugging there works without problems. But if the error there is due to clock stretching not being handled properly by the pi i2c master, it is a different scenario than what I am facing now.

My custom HW does not seem to have power problems or something, though, and the Linux system that also runs on the board, well I have kept the PuTTY terminal connection to it when the MCU i2c error happened.

Weird is that, for a test, I have set the IRQ priority of the SysTick to 1, but the I2Cslave one to 0. To test whether the systick ISR might be too long. (this distorts the timebase, but it's only a test now)

All other systems' interrupts (other than hardfault and such) are at 2 or 3.

So there should, hypothetically, be nothing that can get in the way of the i2c ISR doing its thing.

Also, the register polling that the Linux test program does is a loop that does exactly the same for every iteration, and the reaction on the MCU i2c handler should also be doing the same thing every time.

Still I get the error (and then can't step in with the debugger, as when I hit the Pause button in CubeIDE, I start to get the "Target is not responding, retrying..." a bunch of times before it's giving up.

During that, the status LED still shows a blinking pattern done by the systick ISR - so there is something happening.

While the i2c ISR stuff certainly may have potential for being made more brief - it does work for quite a while and si triggered by the highest IRQ prio there is.

I don't get how there can still be that error - if it is related to the MCU sometimes not being responsive enough. Which seemed to be the case initially.

Any ideas on how to debug this?

SKled.1 · ‎2020-11-05

Ok I now caught this happening when a breakpoint was there and debugging worked.

I have a timeout of 5ms and the mainloop checks whether the last time that the state of the i2c slave was not busy is longer ago than that, and if the state machine is set to error, a snapshot of the ISR and some other things is made, to be looked at later.

When this problem occured - this time it took a run of constant polling by thew Linux system as i2c master of over 3.5 hours (last time it took a bit over an hour only).

- the ISR BUSY flag was indeed set.

When I understand the datasheet correctly, that basically means that the master i2c hardware did not send a STOP condition (for my timeout of 5ms), or it was somehow not electrically recognized?

I would assume that there can be no such long delay, assuming the hardware does this, and no timing of Linux can spoil the party here.

I'll test it with bigger timeout anyway.

No idea to look at this with a logic analyzer that can only store some milliseconds worth of data, when this needs to run for hours to occur.

It seems like something is necessary that could storte logic analyzer data over a certain amount in a FIFO and discard the oldest samples, and stop when a signal comes (like whyn my firmware detects the error), so that I can look at logic analyzer data of the time before it happened.

Is there something like that, or something else that helps?

SKled.1 · ‎2020-11-10

(too bad this site has no post history, I accidentally trashed my lengthy post that was here when I just wanted to paste a reformatted code piece)

I investigated this further, in lack of means to electrically/logic level wise make long term observations, I dug deeper in the code.

Basically the topic now changes: it's nothing from external, there is no real i2c problem here.

It's about interrupts and what the compiler or processor does.

Should I make a new thread of this, as it is an own subject?

So there is the function below. It is to determine whether a weird condition occured when e.g. the Linux, i2c master, made bad requests and my i2c slave module in the firmware got into a weird state.

So this function was supposed to enable me to identify and react to the condition.

Running a Linux app that polls I2C registers of my stm32-as-slave repeatedly, within about 0.5 ... 3 hours this would report the error.

But as you can see in the comments, the calculated time diff would often be like 2400 ms - i.e. it would mean the function would have to have been interrupted by one or more interrupts for 2.4 seconds.

The remainder of my firmware and how it runs offers no further evidence of such a huge delay.

Also, when I make a breakpoint at the __NOP, and with the debugger, read the lastTimeSetToBusyState variable, it has the same value as "now", so "diff" should be zero, but is shown as e.g. 2400.

The code in the #else...#endif block below the first one now immediately does a "retry", and that "fixes" it - I can poll my stm32 for a full day continuously even at higher polling speeds than before,

and this error does not occur anymore.

So those 2.4 seconds or other delay was not real.

Things that are touched by interrupts are declared volatile. I tried to insert a memory barrier to prevent one certain glitch scenario.

But I don't really understand what's going on exactly, and why this retry hack works.

Does anyone have an idea?

(alas I was not able to replicate this in a small, self-contained example that uses timer interrupts instead of requiring I2C connections)

Some explanations:

I2cSlave is a struct, with member function pointer sysTimeGetMilsecsFunc which calls a function that just returns a static volatile unsigned counter value which gets incremented in the systick ISR (prio 0).
The member timeoutUntilStuckBusyError is e.g. 50 (ms) and set only once. It remains constant, I tested that.
Member lastTimeSetToBusyState is set in the I2C interrupt handler (prio 1).
The function is_stuck_in_busy_state() is, indirectly, called by the main loop. No RTOS is used.

So we have the main loop and 2 interrupts at play. The system has more interrupts active, all at priorities 2 or 3.

static bool is_stuck_in_busy_state(volatile I2cSlave* m)
{
#if 1
    const unsigned lastSetBusy = m->lastTimeSetToBusyState;
	
    __DMB();
	
    // fetch the current time (set in systick ISR) AFTER the lastTimeSetToBusyState (set in I2C1 ISR)
    // Or there may be a off-by-one situation where now is smaller than lastTimeSetToBusyState,
	// resulting in an overflow when subtracting.
    const unsigned now = m->sysTimeGetMilsecsFunc();
    const unsigned diff = now - lastSetBusy;
 
    const bool bad = diff > m->timeoutUntilStuckBusyError && (I2cSlaveState_Idle != m->state);
	
    if (bad)
    { 	// FIXME: ?? how can, when we break here, m->lastTimeSetToBusyState be the same as "now", while
		// lastSetBusy is over two SECONDS back in time???
		// Is there something stuck somewhere that gets loosened by the breakpoint mechanism ???
        __NOP(); // breakpoint
    }
    return bad;
 
#else
	// With RETRY - runs for >= 32 hours of constantly being polled with i2c register read requests from
	// the master without problem (i.e. this always returns false),
	// without it runs for 0.5 .. 3 hours before it returns true
    bool bad = true;
    for (unsigned try=0;  try<4 && bad;    ++try)
    {
        const unsigned lastSetBusy = m->lastTimeSetToBusyState;
        __DMB();
        // fetch the current time (set in systick ISR) AFTER the lastTimeSetToBusyState (set in I2C1 ISR !!)
        // Or there may be a off-by-one situation where now is smaller than lastTimeSetToBusyState,
		// resulting in an overflow when subtracting.
        const unsigned now = m->sysTimeGetMilsecsFunc();
        const unsigned diff = now - lastSetBusy;
        bad = diff > m->timeoutUntilStuckBusyError && (I2cSlaveState_Idle != m->state); // it's not stuck busy when it's idle
        if (bad)
        {
            __NOP();
        }
    }
    if (bad)
    {
        __NOP();
    }
    return bad;
#endif
}