STM32H7 Hard Fault

NSR · ‎2021-06-22

I have a program where data is read into the dataStore buffer of a double buffered system from external flash memory via SPI upon detection of an interrupt, around every 700ms; the only ISR sets a value and if the block in my main loop detects this value then it resets the value and processes the block. During processing, it sets up a queue to read new data into the opposing buffer to be undertaken at an appropriate time after completion of the current processing and it all starts again.

There is plenty of time between updates and varying the speed of the flash (between 7.5 and 60 MHz) can either make it worse or better; irrespective of this, the actual read via SPI can range from about <100us to 5ms, depending on the speed that the interface is configured to. At 7.5 MHz, it trips over frequently, at 60 MHz it is relatively stable. I know the obvious run it at 60 MHz then, but I need to get to the bottom of this instability issue.

The Fault Analyzer of STM32CubeIDE is indicating a Hard Fault from Bus, memory or usage fault (FORCED). The Bus Fault Details indicate Imprecise data access violation (IMPRECISERR). The Register Content During Fault Exception has the PC pointing at the following line:

myData = dataStore[ buff[object] ][object][position];

The variables in this line were previously defined as:

uint32_t dataStore[2][16][128];
uint32_t object;
uint32_t buff[16];
uint32_t position;
uint32_t myData;

and these variables have the following state upon error:

object == 0
buff[0] == 0
position == 6
dataStore[0][0][] == { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ..., 126, 127 };
myData == 5

So far so good, accessing the data within dataStore appears to have triggered the fault somehow as the new value hasn't yet been transferred to the myData variable. The question is how has this happened?

Unfortunately, I don't have a JTAG interface on the custom board (only SWD), nor do I have an external debugger, except an external ST-Link V2 device to connect to the SWD. However, I have managed to get some information from the HardFault Analyzer of the STM32CCubeProgrammer:

13:17:47:533 : -------------------------------------------------------------------

13:17:47:535 : STM32CubeProgrammer HardFault Analyzer

13:17:47:536 : -------------------------------------------------------------------

13:17:47:536 : halt ap 0

13:17:47:542 : Execution Mode : Handler

13:17:47:542 : r ap 0 @0x2407FEE8 0x00000020 bytes

13:17:47:542 : r ap 0 @0xE000ED2A 0x00000002 bytes

13:17:47:542 : No Usage Fault detected

13:17:47:542 : r ap 0 @0xE000ED29 0x00000001 bytes

13:17:47:543 : Bus Fault detected in instruction located at 0x08001B44

13:17:47:579 : IMPRECISERR : a data bus error has occurred but the return address

13:17:47:580 : in the stack frame is not related to the instruction that caused the error.

13:17:47:580 : r ap 0 @0xE000ED28 0x00000001 bytes

13:17:47:580 : No MemManage Fault detected

13:17:47:580 : HardFault detected :

13:17:47:628 : Faulty function called at this location 0x080017D1

13:17:47:628 : r ap 0 @0xE000ED2C 0x00000004 bytes

13:17:47:629 : HardFault State Register information :

13:17:47:629 : FORCED : forced HardFault.

13:17:47:629 : Exception return information :

13:17:47:630 : Return to Thread mode, exception return uses non-floating-point

13:17:47:630 : state from MSP and execution uses MSP after return.

This again points to the 'rogue' instruction identified above - for some reason.

Of further relevance, all variables in question have been defined within the same D1 memory domain as the rest of the program and I'm not using the cache at all. Both minimum stack and heap size were initially increased to 0x2000 and then 0x10000. The hardware is a custom board with the STM32H750 and also have experienced this with the Nucleo H743ZI2 boards, the latter I had put the problem down to noise owing to the leads required to connect the external flash memory.

NSR · ‎2021-06-24

I've been tracing through this problem the best I can and I have a question regarding the LR register and the stack. When a function is called, it appears that the return address is the next instruction + 1 byte, resulting in an odd address being stored in both the LR register and the stack; is this normal behaviour? I'm kind of confused with this and wondering if this could be (part of) the problem?

SP: 0x2407ff50, LR: 0x800180F, PC: 0x8001F64
0x2407FF40  2407FF70 40000000 2407FF50 2407FF50 566210FE 38000000 00000000 00000000
		startRead();
 8001f64:	f001 fbd2 	bl	800370c <startRead>
 		if(systemRunning == 2)
 8001f68:	4b68      	ldr	r3, [pc, #416]	; (800210c <main+0xd38>)
 
SP: 0x2407ff48, LR: 0x8001F69,c PC: 8003710
0x2407FF40  2407FF70 40000000 2407FF50 08001F69 566210FE 38000000 00000000 00000000
void startRead(void)
{
 800370c:	b580      	push	{r7, lr}
 800370e:	af00      	add	r7, sp, #0
	while(currentIndex < lastIndex)
 8003710:	e0c0      	b.n	8003894 <startRead+0x188>

The above snippet was cross-referenced from the resultant .list file that was generated during the compilation.

Tesla DeLorean · ‎2021-06-24

In ARM lore an ODD instruction address indicates a THUMB (16-bit) instruction, rather than an ARM (32-bit) instruction. The Cortex-Mx can only run THUMB code.

The LR the Hard Fault Handler gets is a call-gate that handles the context switching, and unstacking the context. ie 0xFFFFFFFD or 0xFFFFFFF9 as I recall depending on which stack is involved.

I'd be looking at the code immediately prior to the imprecise location, it suggests an earlier write into the write buffers, at an address that the MCU isn't decoding. ie disabled memory, or peripheral, or some entirely errant pointer. For example an IRQ Handler depending on a peripheral context structure that hasn't been initialized.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

NSR · ‎2021-06-28

Thanks for taking the time to respond.

I saw the odd instruction address, which did indeed look strange owing that instructions are supposed to be at least a half-word in size and therefore should reside at least on an even boundary. Reading up I came to the same conclusion as you described.

Looking into this further, I've discovered some weird clock-stretching going on; yes I know it's an I2C convention but I don't know how else to describe it within an implementation using SPI. I've attached an image to show this. Now the MCU is the master and therefore in charge of the clocks (via the HAL); the question is, what would be causing the clock to stall some 3us when it's running at a cycle rate of 133.33ns or 7.5MHz? This is a huge delay, especially considering that I'm running the internal clock at 480MHz.

NSR · ‎2021-07-02

It appears that I have an erroneous interrupt occurring. At the beginning of my blocking read from SPI function, I've inserted:

	__disable_irq();

and at the end:

	__enable_irq();

The problem appears to have gone away. Now I don't use interrupts for the SPI and while I do use a periodic update interrupt, the ISR simply sets a flag that is picked up in the main loop and therefore kept well out of the way of the SPI reading functionality; it's also long enough to not interfere. The way I see it, I have the following choices:

find out which of the currently enabled interrupts is performing a service update and address it
re-write my SPI function to use DMA.

I think for now, the best way forward would be to find out which of the currently enabled interrupts is being triggered whilst the SPI read is being performed. Looking through the NVIC table I have the following configuration:

NVIC Table					Preemption-Priority	Sub-Priority
 
Non-maskable interrupt					0		0
Hard fault interrupt					0		0
Memory management fault					0		0
Pre-fetch fault, memory access fault			0		0
Undefined instruction or illegal state			0		0
System service call via SWI instruction			0		0
Debug monitor						0		0
Pendable request for system service			0		0
Time base: System tick timer				0		0
TIM2 global interrupt					0		0
USB OTG FS global interrupt				0		0

Any thoughts and suggestions would be greatly appreciated.

NSR · ‎2021-07-05

Ok, I've tried identifying the IRQ that's interfering with my SPI reads by going against the grain and implementing a long loop within an ISR so that I can periodically monitor the flags at 0xE000E100 to 0xE000E4EF, to no avail. I'm guessing that there is a higher less-than-zero priority interrupt or exception in play that's pulling out of my ISR rather than tailing it.

In this situation, I think the best solution is to just wrap my blocking SPI read in a disable / enable IRQ block; I figure the max delay of 16us should hopefully be benign overall.

However, this begs the question to how the HAL_SPI_Receive_IT(hspi, pData, Size) instruction can be expected to work reliably?