Probability of peripheral failure on a STM32H7

Gpeti · ‎2023-12-07

In the context of functional safety application, we are wondering the probability of a status bit failure on a STM32H753.

A low-level software is frequently polling on status bits of peripherals. If this peripheral encounter a failure this is obviously an issue. Even if there are solutions to this issue (watchdog for example) we'd like to assess the probability of such failures.

Are there any studies regarding the probability of a peripheral failure on STM32 (due to cosmic ray, bitshift or bit stuck in a register, whatever) ?

waclawek.jan · ‎2023-12-07

Why would such assesment be constrained on peripheral status bits?

Surely, cosmic rays etc. impact in similar way all other registers in the STM32, of which many are in the processor itself, the bus matrix and surrounding support circuitry; and the possibly largest pool of registers, ie. the RAM.

Generally, what you need and probably can get upon request to ST (directly, through FAE or web support form) is a FIT number.

JW

Tesla DeLorean · ‎2023-12-07

Not sure rated for space operation.

But generally why you don't infinite loop things and have timeouts so you can respond / recover in a orderly fashion.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

mƎALLEm · ‎2023-12-07

Hello,

I don't know if it does help:

https://www.opensourcesatellite.org/downloads/KS-DOC-01251_STM32H7_Radiation_Test_Report.pdf

To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.

Pavel A. · ‎2023-12-07

Why space operation? just a usual freakin' laser anti-drone gun ; )

Gpeti · ‎2023-12-07

Adding loop control mechanism (like timeout) can decrease performance in some cases (like SPI communication full speed in polling mode).

Danish1 · ‎2023-12-08

Failures will occur. Most likely due to programmer error. Including ST or other professional software vendors (HAL libraries). There are also mistakes in the implementation of peripherals where the protocol is complicated (for example I2C which supports multi-master operation). And then there are cases where poor circuit-design risks electrical signals being subject to interference and/or poor rise-time.

Adding loop control mechanism (like timeout) can decrease performance in some cases (like SPI communication full speed in polling mode).

Are you trolling us? Polling is not efficient. Where speed is important, use DMA. And if you insist on polling, you can arrange program-flow so that a "time-wasteful" check-for-timeout only happens when the peripheral says it is not ready so your amazing high-performance CPU is just twiddling its thumbs.

waclawek.jan · ‎2023-12-08

Are you interested in (A) real safety, or (B) in fulfilling some external requirements e.g. for some sort of certification?

In case of (A), the probability of one flip-flop (the status register's bit) failing is several orders of magnitude lower than probability of failure in the thousands of gates needed to execute the "control mechanism".

And that is several orders of magnitude lower than the probability of having a software bug.

In other words, the timeout mechanism may not be only redundant, it may even be harmful. I don't really have an universal solution for (A), as that has to be judged per case (yes, it's very, very, very, very expensive, and the result of analysis may turn out to be, that given circumstances, there's nothing to make it better).

Now in case of (B), you have no choice, no decision, and no options to contemplate. You simply have to fulfill the requirements given externally, not questioning their rationale, as they may be purely legal or administrative i.e. non-technical, non-rational. Plus you have the extra burden of not making things much worse in that process.

Sorry, but it is what it is.

JW

Gpeti · ‎2023-12-08

In some context like functional safety you need to reduce the usage of interruptions. Also DMA is used for memory transfer, not (for example) to wait for end of communication. Things are more complicated than you think, so avoid being rude.

Pavel A. · ‎2023-12-08

One of possible answers to this question is Cortex-R. Basically it is duplication of all MCU functions that should help against "random glitches". But this is exotic, expensive and efficiency not obvious. When possible, double or triple reservation is done on the level of larger modules that include a normal Cortex MCU (not R), together with sensors, memories etc.; integrated by external "arbiter" circuit. The "arbiter" decides which instance will drive output signals and which is faulty. Of course this is complex and costly - but can be done with widely available parts.