HardFault on unaligned access, after enabling MPU

patrislav · ‎2018-10-17

Hi

I want to start using the MPU on STM32F7, however when I enable the MPU then unaligned accesses lead to a HardFault immediately.

Code to reproduce the HardFault:

static unsigned short foo[5] = { 0 };
memcpy(&foo[1], &foo[3], 4);

I wrote a minimal HardFault handler to output some exception information, it also points to a alignment-related issue (SCB->HFSR=FORCED, SCB->CFSR=UNALIGNED):

SCR   : 0x00000000
CCR   : 0x00070200
CFSR  : 0x01000000
HFSR  : 0x40000000
DFSR  : 0x0000000b
MMFAR : 0x00000000
BFAR  : 0x00000000
AFSR  : 0x00000000

The above dump also shows that CCR->UNALIGN_TRP is not set.

I don't get a HardFault and everything runs smoothly if the MPU is not activated.

The program also runs smoothly with MPU activated if no unaligned memory accesses are happening.

I'm kind of puzzled, as I searched all the MPU documentation and found no hints to any side effects regarding unaligned memory accesses. Some sources indicate that unaligned exceptions may be raised if the meory region is marked as device region (bufferable, shareable, not cacheable). However I observe this error with all possible combination of MPU region attributes.

Thanks in advance, Patrick

MPU initialization:

void configureMPU()
{
    LL_MPU_Disable();
 
    // start addresses of RAM, FLASH, SRAM1 sections (from linker script)
    extern int _sdata, _flash_start, _ssram1;
 
    unsigned int sram = (unsigned int)&_sdata;          // 0x20000000
    unsigned int sflash = (unsigned int)&_flash_start;  // 0x08000000
    unsigned int ssram1 = (unsigned int)&_ssram1;       // 0x20020000
 
    /* Configure DTCM RAM region */
    LL_MPU_ConfigRegion(LL_MPU_REGION_NUMBER0, 0x00, sram, LL_MPU_REGION_SIZE_128KB
        | LL_MPU_REGION_FULL_ACCESS
        | LL_MPU_ACCESS_BUFFERABLE
        | LL_MPU_ACCESS_CACHEABLE
        | LL_MPU_ACCESS_NOT_SHAREABLE
        | LL_MPU_TEX_LEVEL1
        | LL_MPU_INSTRUCTION_ACCESS_DISABLE);
 
    /* Configure FLASH region */
    LL_MPU_ConfigRegion(LL_MPU_REGION_NUMBER1, 0x00, sflash, LL_MPU_REGION_SIZE_2MB
        | LL_MPU_REGION_FULL_ACCESS
        | LL_MPU_ACCESS_NOT_BUFFERABLE
        | LL_MPU_ACCESS_CACHEABLE
        | LL_MPU_ACCESS_NOT_SHAREABLE
        | LL_MPU_TEX_LEVEL0
        | LL_MPU_INSTRUCTION_ACCESS_ENABLE);
 
    /* Configure SRAM1 region */
    LL_MPU_ConfigRegion(LL_MPU_REGION_NUMBER2, 0x00, ssram1, LL_MPU_REGION_SIZE_256KB
        | LL_MPU_REGION_FULL_ACCESS
        | LL_MPU_ACCESS_BUFFERABLE
        | LL_MPU_ACCESS_NOT_CACHEABLE
        | LL_MPU_ACCESS_SHAREABLE
        | LL_MPU_TEX_LEVEL0
        | LL_MPU_INSTRUCTION_ACCESS_ENABLE);
 
    LL_MPU_Enable(LL_MPU_CTRL_HFNMI_PRIVDEF);
}

Bob S · ‎2018-10-17

Yeah - I missed the "short". Sorry.

And the PC pointed somewhere in the memcpy() function?

In the Cortex M4 (L4/F3/F4) programmers manual there is a table that shows the MPU TEX, C, B and S bit encodings, and which combinations make the memory region a "device" region, which requires aligned access (in a section titled "MPU access permission attributes"). Later under "Recommended MPU configuration" there is a paragraph that states that the C and S bits do not affect functionality (on the L4/F3/F4). I would presume that at least the C bit *does* function on the F7 series.

Anyway, see if the F7 programmers manual has a similar table, then figure out how your settings map to the C/B/S bits (TEX is pretty obvious) and see if the sram section is somehow configured as a "device" memory type. Of course I am presuming that your array is being placed in the "sram" section.

Or - I just stumbled across this: from the same L4/F3/F4 Cortex programmers manual, in the description of the CCR "UNALIGN_TRP" bit, the LDM, STM, LDRD and STRD will *always* fault on unaligned access regardless of the UNALIGN_TRP setting. So maybe see if the memcpy() code is using one of those instructions.

View solution in original post

Bob S · ‎2018-10-17

Its been a while since I've dealt with MPU stuff, but at least on the F4 series I recall that there were some memory regions that had MPU attributes that were fixed in hardware and you could not alter that in with settings in the MPU. I don't know if the F7 family has a similar issue. Could it be that the RAM where you are storing your array is permanently flagged with some attribute that implies "aligned access only".

This may or may not be on purpose, but you are reading data past the end of your array.

And to ask the obvious question - are you sure it is code in the memcpy() call that generates the fault?

Tesla DeLorean · ‎2018-10-17

What toolchain is involved here?

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

patrislav · ‎2018-10-17

@Bob S , I'm not reading past the end of the array, it's an array of unsigned shorts. so 4 bytes means 2 array elements, so foo[3] and foo[4] are copied. Anyway that code is just for illustration.

Yes, in the meantime I extended my fault handler to reveal the PC and it shows that the PC is located where the unaligned access is taking place.

If there was any MPU flag that results in "aligned access only", that would of course be interesting - but I didn't find anything like that in the MPU documentation.

@Community member I'm using GNU ARM embedded toolchain from developer.arm.com

$ /opt/gcc-arm-none-eabi-7-2017-q4-major/bin/arm-none-eabi-gcc --version
arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 7-2017-q4-major) 7.2.1 20170904 (release) [ARM/embedded-7-branch revision 255204]

Bob S · ‎2018-10-17

Yeah - I missed the "short". Sorry.

And the PC pointed somewhere in the memcpy() function?

In the Cortex M4 (L4/F3/F4) programmers manual there is a table that shows the MPU TEX, C, B and S bit encodings, and which combinations make the memory region a "device" region, which requires aligned access (in a section titled "MPU access permission attributes"). Later under "Recommended MPU configuration" there is a paragraph that states that the C and S bits do not affect functionality (on the L4/F3/F4). I would presume that at least the C bit *does* function on the F7 series.

Anyway, see if the F7 programmers manual has a similar table, then figure out how your settings map to the C/B/S bits (TEX is pretty obvious) and see if the sram section is somehow configured as a "device" memory type. Of course I am presuming that your array is being placed in the "sram" section.

Or - I just stumbled across this: from the same L4/F3/F4 Cortex programmers manual, in the description of the CCR "UNALIGN_TRP" bit, the LDM, STM, LDRD and STRD will *always* fault on unaligned access regardless of the UNALIGN_TRP setting. So maybe see if the memcpy() code is using one of those instructions.

Tesla DeLorean · ‎2018-10-17

The LDRD frequently bites with Keil, when it tries to pull a double out of a byte pointer.

memcpy() is supposed to be safe, watch if it is optimized based on the length, or cast as a (void *) to give the compiler pause.

Still, not sure I understand the role in the MPU here.

What instructions is the compiler generating for the code snippet?

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

patrislav · ‎2018-10-18

Hi @Bob S and @Community member , thank you for your comments.

In an effort to further narrow this down and isolate it from any other stuff that might or might not have side effects, I created a new test, starting with an empty CubeMX project, using the default ST supplied .ioc for the standard Nucleo-767ZI board.

So this time, the only "custom" code in this test is the MPU initialization, the alignment test code and the hardfault handler - EVERYTHING else is standard CubeMX boilerplate.

Bottom line, the behavior stays the same: a misaligned access raises a HardFault, if the MPU is activated. If MPU is inactive, the very same code doesn't raise the HardFault.

To keep memcpy() out of this, I also made a new test just using standard C operators.

See below for the C and assembler code (objdump -S output). I also commented the assembler code as far I understood it and marked the location where the exception is raised ("str r4, [r2, r3]")

The bottom line stays the same, I understand there's an unaligned access happening, but I still don't unterstand, how the resulting fault is related to whether the MPU is activated or not.

Test code C:

static struct teststruct_t {
  short val[3];
} teststruct_array[10] __attribute__ ((used));
 
static void save_teststruct(struct teststruct_t *val, unsigned int index)
{
  teststruct_array[index] = *val;
}
 
// somewhere in main():
struct teststruct_t tmp = { { 1, 2, 3} };
save_teststruct(&tmp, 1);

Test code assembler:

/*
08000238 <save_teststruct>:
static void save_teststruct(struct teststruct_t *val, unsigned int index)
{
 8000238:   b410        push    {r4}                ; function entry. r0 = *val, r1 = index. save r4
  teststruct_array[index] = *val;
 800023a:   4a06        ldr r2, [pc, #24]           ; 8000254: Get address of teststruct_array
 800023c:   eb01 0141   add.w   r1, r1, r1, lsl #1  ; r1 = index * 3
 8000240:   004b        lsls    r3, r1, #1          ; r3 = r1 * 2   ==> actual byte offset of target element
 8000242:   18d1        adds    r1, r2, r3          ; r1 = base address of teststruct_array + target element offset
 8000244:   6804        ldr r4, [r0, #0]            ; r4 = first 32 bit of source value
 8000246:   50d4        str r4, [r2, r3]            ; store first 32 bit to target element <-- PC is here, when HardFault is raised
 8000248:   8883        ldrh    r3, [r0, #4]        ; load the last 16 bit of source value
 800024a:   808b        strh    r3, [r1, #4]        ; store the last 16 bit to target element
}
 800024c:   f85d 4b04   ldr.w   r4, [sp], #4        ; restore r4
 8000250:   4770        bx  lr                      ; return
 8000252:   bf00        nop
 8000254:   20000020    .word   0x20000020          ; <-- RAM location of teststruct_array[] in BSS
 
    struct teststruct_t tmp = { { 1, 2, 3} };
 80005d0:   4b0d        ldr r3, [pc, #52]           ; (8000608 <main+0x50>)
 80005d2:   e893 0003   ldmia.w r3, {r0, r1}        ; Get init values { 1, 2, 3 } from rodata
 80005d6:   9000        str r0, [sp, #0]            ; Initialize local variable "tmp" on the stack
 80005d8:   f8ad 1004   strh.w  r1, [sp, #4]        ; ...
    save_teststruct(&tmp, 1);
 80005dc:   2101        movs    r1, #1              ; Pass index parameter (=1) to save_teststruct()
 80005de:   4668        mov r0, sp                  ; Pass address of local variable (=sp) to save_teststruct()
 80005e0:   f7ff fe2a   bl  8000238 <save_teststruct>   ; call save_teststruct()
 
 8000608:   08001204    .word   0x08001204          ; address in rodata where init values { 1, 2, 3 } for teststruct are stored
 
*/

AvaTar · ‎2018-10-18

If I remember correctly, MPU faults are raised to hardfaults if no appropriate handler exists.

patrislav · ‎2018-10-18

Ok, after some further digging I got it!

What I'm experiencing is the superposition of 2 problems.

Problem 1)

As soon as a memory region is set to bufferable and non-cacheable, it becomes a "device" region.

And this means that every unaligned access raises an exception.

If caching needs to be disabled on actual memory, then buffering has to be disabled as well!

Problem 2)

The region's base address used for MPU configuration has to be aligned to the size of the region!

So in the above example, where I try to set a 256KB region at 0x20020000 (begin of SRAM1), the MPU is masking out the lower 18 bits of the start address, which then magically becomes 0x20000000 (begin of DTCM).

Since the higher numbered regions take precedence over the lower ones, the first region supposed to configure DTCM is just overwritten by the last one, and because that one uses a device-ish configuration, the code accessing the DTCM region throws unaligned exceptions.

Glad to finally have that sorted. Thanks to everyone who helped out!

Best regards, Patrick