2024-05-14 04:35 AM
Heyho,
I recently finished - or so I thought - the ethernet bootloader for STM32H73x.
The bootloader can update the application and vice versa, updating, jumping back and forth, all working fine.
Until yesterday all of a sudden (when I was working on some other http POST stuff unrelated to the internal flash) the bootloader somehow got stuck when verifying the freshly written application.
Verification goes like:
- read (Octo-) SPI flash image page (256 bytes)
- compare SPI flash buffer to internal flash via pointer
The hard fault info that I got was not really useful, I was looking for all kinds of stuff, but could not solve the problem.
Until I added some memory barriers, as you can see below - combined with loading each flash byte into a variable, and then comparing (before that I used directly used the flash byte pointer sFlashIntCtl.pu8ChkAddr[i] in the for loop).
Now it's working again.
BUT... does that make sense?
Or was that just coincidence, and I have another problem?
/* ++++++++++++++++++++++++++++++++++++++++++++++++++++++ */
/* read from SPI flash
* BLOCKING &
* ### outside of state machine ###
*/
sFlashIntCtl.u8ChkError = OspiFlashRdPage();
__DSB();
if( sFlashIntCtl.u8ChkError != HAL_OK )
{
#if DEBUG_FLASH_INT
uart_printf("\n\r# ERR: FLINT_STATE_CHECK OspiFlashRdPage()\n\r");
#endif /* DEBUG_FLASH_INT */
sFlashIntCtl.u32ChkErrors++;
u8FlashIntState = FLINT_STATE_ERROR;
}
else
{
/* compare */
uint32_t u32ErrOld = sFlashIntCtl.u32ChkErrors;
uint8_t u8FlashByte = 0;
/* compare internal flash to SPI flash page buffer */
for( uint32_t i = 0; i < (uint32_t)OSPI_FLASH_PAGE_SIZE; i++ )
{
u8FlashByte = sFlashIntCtl.pu8ChkAddr[i];
__DSB();
if( u8OspiFlashPageBuf[i] != u8FlashByte )
{
sFlashIntCtl.u32ChkErrors++;
}
if( (sFlashIntCtl.u32ChkBtDone + i) >= (sSpiFileInfo.u32Size - 1) ) break;
}
sFlashIntCtl.u32ChkBtDone += sFlashIntCtl.u32ChkLen;
PS: why is that very useful </> code insert button back in line 2? This should be the first one to appear, it's surely more often used than other stuff that appears first.
2024-05-16 02:36 AM
@waclawek.jan thanks!
I put breakpoint at start of hard fault handler, now I can check the registers when debugging.
What I see is what I saw earlier, when sometimes the saving of the registers in the hard fault handler worked:
CFSR:
BFARVALID and PRECISERR are set
BFAR and MMFAR hold the same value, this time it's 0x6B6B6B6B
This "address" seems to be a recurring pattern: it has always been this kind of four times alternating hex numbers (like 0x1D1D1D1D, or 0x23232323, or as above the 0x6B6B6B6B)
Does this make sense to anyone?
I thought with BFARVALID the address in BFAR would make sense
But where can I find what actually triggered the hard fault if the BFAR content does not make sense?
2024-05-16 02:46 AM
Addition to the post above:
I find the BFAR value 0x6B6B6B6B also on the CPU register R1:
2024-05-16 03:07 AM
Checking list / map files, some of the registers contain addresses pointing to some lwIP / http / SSI / CGI functions.
This could actually make some sense, because the client (PC browser) gets a self-refreshing html page to display update progress.
Getting closer...
2024-05-16 04:35 AM
As I've said, you have to look at the stack for the PC before the fault handler is called (actual content of PC points to the fault handler). Then look at the code where this reconstructed PC points, in particular look before that point (in case of precise error it should be the previous instruction, in case of imprecise error it may be a couple of instructions before that. You will probably see some access - load/store - to the offending address (i.e. here through [r1]. To find out what and why is that access, you have to read a bit the asm, together with sources - that's why mixed source/asm is the best way to look at these things.
IDEs may help to decipher this for you, i.e. show the reconstructed registers, point to the source where the reconstructed PC points, show stack backtrace - I don't use IDE.
JW
2024-05-16 08:29 AM
I hope and think I got it:
lwIP's httpd calls http_parse_request() calls http_find_file() which also checks the URI for CGI commands.
CGI application info is stored in a table containing 2 pointers for each entry, one to a constant string (like "/fileup.cgi") and the other to the respective function (called something like CGI_Handler_FileUp()).
These are constants stored in flash.
I check the contents and addresses of these at http server init and get these results, which make sense:
CGI HttpServerInit():
CgiTable[0].pszCGIName = /fileup.cgi @ 080239BC
CgiTable[1].pszCGIName = /admin.cgi @ 080239B1
When not working with the internal flash, these pointers seem to work well.
But when there's flash writing going on, these constants cannot be accessed, and somehow some weird value is assigned, see picture below.
And these are the values found in BFAR and MMAR (0x53535353).
2024-05-16 08:48 AM - edited 2024-05-16 09:03 AM
Further info:
I trigger the flash programming, then let the STM32 do other stuff - which makes absolutely no sense, because this H7 canNOT do anything while flash is erased or programmed.
So I changed that to "blocking".
Now even with that, the SysTick seems to be crazy, which I use for a timeout check... SysTick okay, stoopid me...
BTW, in debug mode I was looking for the stack and stuff in the right top window, until I found it very comdortable on the left...
2024-05-16 08:57 AM
> somehow some weird value is assigned
Place data breakpoint (a.k.a. watchpoint) at the variables which are overwritten, and observe what overwrites them.
JW
2024-05-16 11:09 PM
> Place data breakpoint (a.k.a. watchpoint) at the variables which are overwritten, and observe what overwrites them.
I cannot find the setting for that in STM32CubeIDE.
I have these variables in the "Expressions" window, but that doesn't show me when they are changed.
BUT again my UART debugging helps:
Right before the very first FlashIntWrPage() (programming 1 flash page in application area):
httpd_cgis[0].pszCGIName = /fileup.cgi @ 08023B27
httpd_cgis[0].pfnCGIHandler is @ 080025E1
httpd_cgis[1].pszCGIName = /admin.cgi @ 08023B1C
httpd_cgis[1].pfnCGIHandler is @ 080026B5
Right after the very first FlashIntWrPage() these pointers are reset:
httpd_cgis[0].pszCGIName = @ 00000000
httpd_cgis[0].pfnCGIHandler is @ 00000000
httpd_cgis[1].pszCGIName = @ 00000000
httpd_cgis[1].pfnCGIHandler is @ 00000000
This is really crazy...
Here's the FlashIntWrPage function - it checks the address to be written to make sure that only addresses in the application area are written (FLASH_INT_ADDR_START_APP = 0x08040000).
It's basically like HAL_FLASH_Program(), without the extra bank option.
/* +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ */
/* FLASH write page
*/
/**
* @brief write page / word of 256 bits = 8x 32bit words
* waiting for programming page done in state machine
* only writing APPLICATION flash from bootloader
* @PAram u8WrBlock wait here for finish
* write settings elsewhere,
* for now fixed page buffer is used
* @retval 0 = ok
*/
uint8_t FlashIntWrPage(uint8_t u8WrBlock)
{
uint8_t u8RetVal = 0;
/* check source pointer */
if( NULL == sFlashIntCtl.pu8WrBuf )
{
#if DEBUG_FLASH_INT
uart_printf("# ERR: FlashIntWrPage(): pu8WrBuf = NULL\n\r");
#endif /* DEBUG_FLASH_INT */
return HAL_ERROR;
}
/* check flash destination address */
if( sFlashIntCtl.u32WrAddr < FLASH_INT_ADDR_START_APP ||
sFlashIntCtl.u32WrAddr >= FLASH_INT_ADDR_END_APP )
{
#if DEBUG_FLASH_INT
uart_printf("# ERR: FlashIntWrPage(): u32WrAddr out of range\n\r");
#endif /* DEBUG_FLASH_INT */
return HAL_ERROR;
}
/* check LOCK bit */
if( FLASH->CR1 & FLASH_CR_LOCK )
{
if( FlashIntUnlock() != HAL_OK )
{
#if DEBUG_FLASH_INT
uart_printf("# ERR: FlashIntWrPage(): FlashIntUnlock()\n\r");
#endif /* DEBUG_FLASH_INT */
return HAL_ERROR;
}
}
/* check if last operation done */
if( FlashIntWaitLastOp(FLASH_INT_TO_BUSY_MS) != HAL_OK )
{
#if DEBUG_FLASH_INT
uart_printf("# ERR: FlashIntWrPage(): QW\n\r");
#endif /* DEBUG_FLASH_INT */
return HAL_ERROR;
}
/* wait for BUSY release to write registers */
if( FlashIntCheckBusyTo(FLASH_INT_TO_BUSY_MS) != HAL_OK )
{
#if DEBUG_FLASH_INT
uart_printf("# ERR: FlashIntWrPage(): BUSY\n\r");
#endif /* DEBUG_FLASH_INT */
return HAL_ERROR;
}
/* reset some error flags */
/* get error flags */
uint32_t u32RegVal = (FLASH->SR1 & FLASH_INT_SR_ERR_ALL);
/* in case of error reported in Flash SR1 */
if( u32RegVal != 0 )
{
/* save error code*/
sFlashIntCtl.u32SrErrors |= u32RegVal;
/* clear error programming flags */
FLASH->CCR1 = u32RegVal;
#if DEBUG_FLASH_INT
uart_printf("# ERR: FlashIntWrPage(): SR errors = %08lX\n\r", sFlashIntCtl.u32SrErrors);
#endif /* DEBUG_FLASH_INT */
return HAL_ERROR;
}
/* check SR1 once more */
u32RegVal = FLASH->SR1;
if( u32RegVal != 0 )
{
#if DEBUG_FLASH_INT
uart_printf("# ERR: FlashIntWrPage(): SR = %08lX\n\r", u32RegVal);
#endif /* DEBUG_FLASH_INT */
return HAL_ERROR;
}
/* +++++++++++++++++++++++++++++++++++++++++++ */
/* start programming */
__disable_irq();
__IO uint32_t *pu32SrcAddr = (__IO uint32_t *)sFlashIntCtl.pu8WrBuf;
__IO uint32_t *pu32DstAddr = (__IO uint32_t *)sFlashIntCtl.u32WrAddr;
uint8_t u8RowIndex = FLASH_NB_32BITWORD_IN_FLASHWORD; /* = 8 -> * 32= 256 */
/* set PG program enable bit */
FLASH->CR1 |= FLASH_CR_PG;
__ISB();
__DSB();
/* program the flash word */
do
{
*pu32DstAddr = *pu32SrcAddr;
pu32DstAddr++;
pu32SrcAddr++;
u8RowIndex--;
} while( u8RowIndex != 0 );
__ISB();
__DSB();
__enable_irq();
/* wait in state machine ? */
#if( 1 )
if( FLASH_INT_WRITE_BLOCK == u8WrBlock )
{
/* wait for last operation to be completed */
//u8RetVal = FlashIntWaitLastOp(FLASH_INT_TO_PROG_PAGE);
uint32_t u32TickWrStr = HAL_GetTick();
do
{
u8RetVal = FlashIntCheckLastOp();
if( (HAL_GetTick() - u32TickWrStr) > FLASH_INT_TO_PROG_PAGE )
{
u8RetVal = HAL_TIMEOUT;
break;
}
} while( u8RetVal != HAL_OK );
/* if the program operation is completed, disable PG bit */
FLASH->CR1 &= ~FLASH_CR_PG;
}
#endif
return u8RetVal;
}
2024-05-17 12:58 AM
2024-05-17 04:58 AM
All of a sudden the compiler moved the variable CGI table to the start of AXI SRAM D1 - no more problems.
No matter what I tried, the compiler didn't move it anywhere else in further compilations.
But I still have the "application", which showed the same behavior when updating the bootloader.
And yes, the current app does the same:
It looks like the variable holding the CGI table pointers above gets reset while reading from OSPI flash, when the "bytes received counter" is incremented.
It looks like the pointer is moved to somewhere else, in this case to one of the http constants for pdfs...
Now I'd like to understand: how can this happen?
Picture 1: CGI table as it should be
Picture 2: CGI table changed by OSPI flash reading