2019-11-07 11:29 AM
Hardware parity generation and check mechanism can be found in the datasheet of several STM32 controllers (e.g. STM32F0/3x/L4x, etc.).
I have here a few questions/concerns related to this feature and would be grateful if these could be addressed:
1) Use of the feature
Based on the reference manual of related devices (e.g. STM32F0xx), HW parity check is capable of detecting faults only (not correcting e.g. in case of single event upsets).
Furthermore, detecting here means notifying that data have been modified "externaly", but not identifying the cause (e.g. affected address like in case of an MPU fault.)
I used the following procedure to configure and test the feature (the firmware was launched via an onboard debugger over SWD):
a) Enable HW parity check in option bytes (OB) of the flash (SRAM2_PE/RAM_PARITY_CHECK flag in FLASH OB).
b) (optional) Cleanup the SRAM2 area (in this case, SRAM2 was protected).
c) Enable interrupts.
d) Pause debugging.
e) Inject a fault in SRAM2 (by manipulating memory with debugger over SWD).
f) Resume debugging.
Ok, the controller triggers an NMI (NonMaskableInt_IRQn).
However, it was expected that (in addition to the NMI) the SPF/PEF flag in the SYSCFG_CFGR2 register is set, but was not.
To make sure that the NMI is comming from the HW parity check mechanism and not from some other source, I repeated the sequence with disabled HW parity check and the behavior is as expected, i.e. no NMI in this case.
My questions:
* Did someone experienced the same issue with the SPF/PEF flag (or could there be an issue with the sequence above)?
* Is there a "better" way to verify/test the feature?
* Is there a way to identify the faulty memory address?
Thanks!
Zike
(The remaining points are just general points/concerns and are provided in the attachment.)
2019-11-07 11:42 AM
I suspect it is there to identify a failure that renders the system untrustworthy (systemic failure), not something that can be remediated, so it can fail "safely" whatever that means in your context, ie turn off the spinning knife blades, illuminate a big red light..
In much the same way as a BIST would give a GO, NO-GO indication. You print some message, and stop.
Something I think you need to go over with your FAE, or ask as an awkward question when they highlight "safety" in the slide deck for the N-teenth time at the seminars.
2019-11-07 11:45 AM
>>* Is there a way to identify the faulty memory address?
I think it affords you an opportunity to scan the entire memory array, and quantify just how bad the situation has gotten.
Isn't one of the "safety" things that you periodically scan memory, and CRC the ROM/FLASH to determine system integrity?
2019-11-07 03:46 PM
Many thanks for taking your time, I appreciate it! The question was actually targeted to the STM implementation, not the method as such.
Maybe I was not concrete enough. Sorry!
A short analysis (to argue the statement):
a) Systematic failures would not be detected here.
E.g. stuck-at high/low: this would require destructive BIST such as galpat, march, ram pattern test, etc.
b) Random hardware failures -- maybe, part of.
b1) Transient events caused by soft errors, yes.
E.g. EMI, radiation (strikes by ionizing particles/alpha particles, cosmic neutrons).
b2) Other hw failures (e.g. SRAM controller or bus failure), no.
c) Acc. to the ref. manual, the mechanism could be used to meet normative requirements (safety applications).
Now, conclusions from the points above:
1) From a) to b) it follows that the mechanism has been proposed to deal with soft errors.
2) If used to detect faults and meet specific safety regulations (e.g. IEC61508 requires to deal with s. errors from SIL3), the mechanism with 1 parity bit for 1 byte raw data would not have sufficient detection quality (i.e. 1-bit and some multiple-bit failures could be detected but not all).
Normally, one would deploy typical BISTs to solve this (e.g. data redundancy and comparator).
So, this would bring us to conclusion that the mechanism is indeed made to support recovery.
But, as pointed out previosuly, we lack some relevant data for recovery (faulty address, readout of parity bits).
If these assumptions hold, I do not see a reasonable use case/application field for this STM mechanism.
2019-11-07 06:09 PM
I don't believe the parity error checking implemented on the STM32F0 series is meant to give you the ability to recover gracefully from an error. It's simply meant as a self-check bit. Once you see an error, the device can't be trusted and needs reset and/or verified it's performing correctly.
That's not useless. The absence of errors gives you an excellent probability that the device does not experience any memory errors. I'd imagine it would be cost-prohibitive to implement SRAM ECC on this series.
You can of course roll your own ECC solution. Triple buffer each array, or something else.
Many of the points you mentioned are also given in ST's own docs:
2019-11-08 08:13 AM
Thanks for the update, this summary has some very interesting hints.
I would like to agree that I can deploy it for fault detection, but it is quite difficult to find supporting arguments for it.
For example, according to the doc. provided via link, the (highest quality) target are industrial controllers with safety integrity up to IEC61508/SIL3.
Now, when just taking a look here:
Ch 5.4, p.14:
"As not all the multiple bit errors are covered, the method is efficient when distributed design of bits collected at single word is applied".
I.e.:
"... physical distance of two columns carrying two logically neighbored bits - should be kept greater than 4".
We would need to make a major reorganization of our memory to meet the quality requirements.
I would definitely not use it for fault detection :)
(Maybe the another standard referenced in the document (IEC60335 household ap.) is less restrictive here and may provide some room for playing. But, on the other side, for lower quality/integrity/diag. coverage, there is usually no need to deal with soft errors at all.)
2019-11-08 08:23 AM
I would guess that ECC is to detect radiation flipping bits in the ram, which becomes more sensitive when shrinking in technology. Probably put the chip in space and plenty of ECC events shall trigger. I guess the event shall run a flash function, do something and reset. As stack is in ram... Maybe just wait for a watchdog reset....