STM32MP157C: How to increase the M4 image size

rhaberkorn · ‎2025-01-12

Hello,

I am running a Linux image based on meta-st-stm32mp (hardknott branch).

I currently have the following in my host device tree:

reserved-memory {
                #address-cells = <1>;
                #size-cells = <1>;
                ranges;

                mcuram2: mcuram2@10000000 {
                        compatible = "shared-dma-pool";
                        reg = <0x10000000 0x40000>;
                        no-map;
                };

                vdev0vring0: vdev0vring0@10040000 {
                        compatible = "shared-dma-pool";
                        reg = <0x10040000 0x1000>;
                        no-map;
                };

                vdev0vring1: vdev0vring1@10041000 {
                        compatible = "shared-dma-pool";
                        reg = <0x10041000 0x1000>;
                        no-map;
                };

                vdev0buffer: vdev0buffer@10042000 {
                        compatible = "shared-dma-pool";
                        reg = <0x10042000 0x4000>;
                        no-map;
                };

                mcuram: mcuram@30000000 {
                        compatible = "shared-dma-pool";
                        reg = <0x30000000 0x40000>;
                        no-map;
                };

                retram: retram@38000000 {
                        compatible = "shared-dma-pool";
                        reg = <0x38000000 0x10000>;
                        no-map;
                };

                gpu_reserved: gpu@d4000000 {
                        reg = <0xd4000000 0x4000000>;
                        no-map;
                };
        };

        mlahb: ahb {
                compatible = "st,mlahb", "simple-bus";
                #address-cells = <1>;
                #size-cells = <1>;
                ranges;
                dma-ranges = <0x00000000 0x38000000 0x10000>,
                             <0x10000000 0x10000000 0x60000>,
                             <0x30000000 0x30000000 0x60000>;

                m4_rproc: m4@10000000 {
                        compatible = "st,stm32mp1-m4";
                        reg = <0x10000000 0x40000>,
                              <0x30000000 0x40000>,
                              <0x38000000 0x10000>;
                        /* ... */
                };
        };

mlahb is from the stm32mp151.dtsi.
Let's say, we'd like to increase the allowed image size to 128kb by changing 0x10000 to 0x20000:

&retram {
	reg = <0x38000000 0x20000>;
};

&mlahb {
	dma-ranges = <0x00000000 0x38000000 0x20000>,
	             <0x10000000 0x10000000 0x60000>,
	             <0x30000000 0x30000000 0x60000>;

	m4_rproc {
		reg = <0x10000000 0x40000>,
		      <0x30000000 0x40000>,
		      <0x38000000 0x20000>;
	};
};

Unfortunately, when trying to load a firmware (echo start > /sys/class/remoteproc/remoteproc0/state),
I still get a kernel crash:

[  376.591779] 8<--- cut here ---
[  376.593848] Unhandled fault: imprecise external abort (0x1c06) at 0x004627a4
[  376.600885] pgd = fc518049
[  376.603571] [004627a4] *pgd=c5ef0835, *pte=00000000, *ppte=00000000
[  376.609836] Internal error: : 1c06 [#1] PREEMPT SMP ARM
[  376.615042] Modules linked in: rpmsg_tty rpmsg_core etnaviv gpu_sched spi_stm32 stm32_rproc sch_fq_codel ipv6
[  376.624955] CPU: 1 PID: 298 Comm: sh Not tainted 5.10.10 #1
[  376.630507] Hardware name: STM32 (Device Tree Support)
[  376.635648] PC is at memcpy+0x54/0x330
[  376.639374] LR is at 0x6263000a
[  376.642499] pc : [<c060d914>]    lr : [<6263000a>]    psr: 20000013
[  376.648753] sp : c5cfbe2c  ip : 2172656c  fp : e0d20000
[  376.653965] r10: 00000000  r9 : 00000000  r8 : 646e6168
[  376.659180] r7 : 206b6361  r6 : 626c6c61  r5 : 63206f4e  r4 : 09007265
[  376.665696] r3 : 6c646e61  r2 : 000011b4  r1 : e0bc9220  r0 : e0d30120
[  376.672213] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[  376.679336] Control: 10c5387d  Table: c5ea006a  DAC: 00000051
[  376.685072] Process sh (pid: 298, stack limit = 0x86cbec92)
[  376.690629] Stack: (0xc5cfbe2c to 0xc5cfc000)
[  376.694979] be20:                            00011354 00011354 00000001 00000000 e0d20000
[  376.703147] be40: e0bb9054 c0a0a584 00011354 10041000 00000000 00000000 00000000 c01175f4
[  376.711314] be60: 00000100 00000000 00000006 00000001 00000020 c34cd800 c5ea7cc0 c121f270
[  376.719482] be80: e0bb9000 c34cd820 c0a06c88 c34cd800 c34cd800 c34cd820 c5ea7cc0 c35c2840
[  376.727649] bea0: 00000000 00000000 00000000 c0be8e54 00000000 c34cd800 c5ea7cc0 c34cd820
[  376.735817] bec0: c35c2840 c0be9380 c34cd800 c34cd9e8 00000000 c34cd9f4 c34cd820 c0a07f78
[  376.743984] bee0: c5ea7cc0 3a12e441 00000cc0 c34cd820 c5ea7a80 00000006 c34cd800 c5ea6610
[  376.752151] bf00: 00000051 c0a097bc 00000006 c5ea6600 c5ea7a80 c5cfbf80 c5ea6610 c03a3c48
[  376.760317] bf20: 00000000 00000000 0000000b c3337c00 00000006 c03a3b50 00534fb0 c5cfbf80
[  376.768484] bf40: c2de3c00 00000004 0052f2f0 c0302348 c316d900 3a12e441 c316d900 00000000
[  376.776650] bf60: c1894900 c3337c00 c3337c00 00000000 00000000 c0100264 c5cfa000 c03026e8
[  376.784816] bf80: 00000000 00000000 c5cfa000 3a12e441 00000003 00000006 00534fb0 b6f6c1e0
[  376.792984] bfa0: 00000004 c0100060 00000006 00534fb0 00000001 00534fb0 00000006 00000000
[  376.801152] bfc0: 00000006 00534fb0 b6f6c1e0 00000004 00000000 bee1b7f0 00534fb0 0052f2f0
[  376.809318] bfe0: 00000004 bee1b7a0 b6e7135f b6dfd386 60000030 00000001 00000000 00000000
[  376.817517] [<c060d914>] (memcpy) from [<c0a0a584>] (rproc_elf_load_segments+0x1a0/0x288)
[  376.825669] [<c0a0a584>] (rproc_elf_load_segments) from [<c0be8e54>] (rproc_start+0x24/0x154)
[  376.834171] [<c0be8e54>] (rproc_start) from [<c0be9380>] (rproc_fw_boot+0x170/0x1a4)
[  376.841898] [<c0be9380>] (rproc_fw_boot) from [<c0a07f78>] (rproc_boot+0x150/0x1a4)
[  376.849541] [<c0a07f78>] (rproc_boot) from [<c0a097bc>] (state_store+0x40/0xc8)
[  376.856841] [<c0a097bc>] (state_store) from [<c03a3c48>] (kernfs_fop_write+0xf8/0x21c)
[  376.864750] [<c03a3c48>] (kernfs_fop_write) from [<c0302348>] (vfs_write+0xc0/0x318)
[  376.872481] [<c0302348>] (vfs_write) from [<c03026e8>] (ksys_write+0x60/0xe4)
[  376.879603] [<c03026e8>] (ksys_write) from [<c0100060>] (ret_fast_syscall+0x0/0x54)
[  376.887237] Exception stack(0xc5cfbfa8 to 0xc5cfbff0)
[  376.892282] bfa0:                   00000006 00534fb0 00000001 00534fb0 00000006 00000000
[  376.900451] bfc0: 00000006 00534fb0 b6f6c1e0 00000004 00000000 bee1b7f0 00534fb0 0052f2f0
[  376.908613] bfe0: 00000004 bee1b7a0 b6e7135f b6dfd386
[  376.913659] Code: f5d1f07c e8b151f8 e2522020 e8a051f8 (aafffffa)
[  376.919737] ---[ end trace e8ef0d82ecc3eec4 ]---
[  376.925593] 8<--- cut here ---
[  376.927384] Unhandled fault: imprecise external abort (0x1c06) at 0x004627a4
[  376.934422] pgd = 6e154e4d
[  376.937107] [004627a4] *pgd=c5c11835, *pte=00000000, *ppte=00000000

This only happens if the actual firmware image has a code+data section larger than 64kb.
Does anybody have any clue what I am missing in the device tree?
I should also try to get a proper backtrace of that crash...

What's the hard limit for M4 firmwares?
In this comment, @PatrickF said that it's at most 448kb:
https://community.st.com/t5/stm32-mpus-products/possible-advisable-to-increase-the-size-of-m4-coprocessor-text/m-p/51879/highlight/true#M52

What's colliding with the RETRAM section or is it because the M4 core won't execute directly from
RETRAM, but copies code into one of the mcuram regions?
See https://wiki.st.com/stm32mpu/wiki/STM32MP15_MCU_SRAM_internal_memory
Is it the mcuram region that I have to increase?

Yours sincerely,
Robin Haberkorn

ArnaudP · ‎2025-01-27

Do you also update the "vdev0XXXX" memory regions accordingly in the Linux Device tree?
https://elixir.bootlin.com/linux/v6.13-rc3/source/arch/arm/boot/dts/st/stm32mp15xx-dkx.dtsi#L22

rhaberkorn · ‎2025-01-29

@ArnaudP wrote:
Do you also update the "vdev0XXXX" memory regions accordingly in the Linux Device tree?
https://elixir.bootlin.com/linux/v6.13-rc3/source/arch/arm/boot/dts/st/stm32mp15xx-dkx.dtsi#L22

Yes, of course. For instance I tried these adaptions on top of the default memory layout (which I summarized at the very beginning of this thread):

	reserved-memory {
		mcuram3: mcuram3@10020000 {
			compatible = "shared-dma-pool";
			reg = <0x10020000 0x3A000>;
			no-map;
		};

		/delete-node/vdev0vring0;
		vdev0vring0@1005A000 {
			compatible = "shared-dma-pool";
			reg = <0x1005A000 0x1000>;
			no-map;
		};

		/delete-node/vdev0vring1;
		vdev0vring1@1005B000 {
			compatible = "shared-dma-pool";
			reg = <0x1005B000 0x1000>;
			no-map;
		};

		/delete-node/vdev0buffer;
		vdev0buffer@1005C000 {
			compatible = "shared-dma-pool";
			reg = <0x1005C000 0x4000>;
			no-map;
		};
	};
};

&mcuram2 {
	reg = <0x10000000 0x20000>;
};

&m4_rproc {
	memory-region = <&retram>, <&mcuram>, <&mcuram2>, <&mcuram3>,
	                <&vdev0vring0>, <&vdev0vring1>, <&vdev0buffer>;

	/* Why are the vdev0ringX sections missing in this node? */
	reg = <0x10000000 0x20000>, /* mcuram2 */
	      <0x10020000 0x3A000>, /* mcuram3 */
	      <0x30000000 0x40000>, /* mcuram (???) */
	      <0x38000000 0x10000>; /* retram */

	m4_system_resources {
		status = "okay";
	};
};

I had analogous declarations in the Zephyr device tree:

        zephyr,ipc_shm = &mcuram4;
        /* ... */

    mcuram4: memory4@1005A000 {
        compatible = "mmio-sram";
        reg = <0x1005A000 0x6000>;
    };

But unfortunately, I could not get rpmsg_create_ept() to initialize with these settings. I didn't do any further debugging as this reshuffling of memory sections wasn't strictly necessary for what I was trying to achieve - I just declared a third code section at 0x10046000 without moving the IPC buffers from their default location.

rhaberkorn · ‎2025-02-21

@rhaberkorn wrote:
I did get it to work with Zephyr's code relocation feature.

Just that it didn't work. I have the problem that global symbols/variables aren't correctly initialized. I have XIP enabled and use NOCOPY, since there shouldn't be any reason to copy anything around after memory has been initialized by the Linux kernel. Let's say I relocate a file like this:

zephyr_code_relocate(FILES src/main.c LOCATION MCUSRAM1 NOCOPY NOKEEP)

I would expect both code (text) and (ro)data to be mapped into the MCUSRAM1 section. The linker should output ELF program header entries that map the ELF file contents (offset) into the appropriate addresses of MCUSCRAM1 (vaddr; in my case beginning with 0x10000000). For the data section I get something like the following ELF entry (cf. elfdump):

p_type: PT_LOAD
p_offset: 58204
p_vaddr: 0x10010000
p_paddr: 0xe23c
p_filesz: 648
p_memsz: 648
p_flags: PF_W|PF_R
p_align: 4

I can confirm that a certain global variable initializer is in the file after the given p_offset. From that follows a certain vaddr of that symbol. I can confirm that it's the same as the address assigned by the linker (ie. the address C thinks the symbol is at). If the Linux kernel's drivers/remoteproc/remoteproc_elf_loader.c would load the file contents into the correct virtual address (beginning at 0x10010000), everything should be fine. But it's obviously not what happens. I thought it might be because the loader actually uses paddr (see rproc_elf_load_segments()) and tried to patch that to vaddr, but it didn't help either. I do not yet understand where the ELF header's paddr fields actually come from and why they sometimes differ from the vaddr fields. Perhaps the memory is also overwritten by something else - it appears to contain data garbage. It doesn't overlap with any ELF entry, though.

I am certain, this could be fixed somehow and might just be overlooking something, but currently this breaks code relocation for me.

ArnaudP · ‎2025-02-26

Please could you provide a full picture of your elf file using the readelf -l command?

rhaberkorn · ‎2025-02-26

@ArnaudP wrote:
Please could you provide a full picture of your elf file using the readelf -l command?

I have since found a workaround for the issue. See here for details.

It turned out that gen_relocate_app.py generated linker scripts which place the global initializers (data section) into "ROM", obviously expecting them to be copied on startup into their "RAM" addresses, even when compiling for execute-in-place (CONFIG_XIP). This makes sense for most MCUs even with XIP, since you usually cannot write into flash memory with random access. I commented that out and got everything to work. I will prepare a Zephyr upstream patch once I completely understand why they are doing it the way they are doing it right now. (The data symbols are directly linked into the target memory areas if XIP is disabled - it's this bit which I do not understand as with non-executable flash, you would usually still have to copy the data initializers into RAM. Or perhaps they are copied into RAM along with text/code on those systems?)

I have however another important question related to my solution. As you have seen, I rearranged RAM areas from 0x10000000 rather freely even across memory bank boundaries. SRAM1-4 are still separate memory banks. According to this article, "Unaligned accesses that cross memory map boundaries are architecturally Unpredictable." The question is whether SRAM1-4 boundaries are actually "memory map boundaries" (cf. RM0436, Figure 4, p.158). All 4 SRAM banks are apparently separate entities on the MLAHB (see STM32MP157C datasheet, p.37).

So is it safe to access memory across bank boundaries? As far as I understand, the only potential "unaligned" access would be reading a double word exactly on the bank boundaries. But I might be mistaken. I will try to test this in code as well.

rhaberkorn · ‎2025-02-27

@rhaberkorn wrote:
So is it safe to access memory across bank boundaries? As far as I understand, the only potential "unaligned" access would be reading a double word exactly on the bank boundaries. But I might be mistaken. I will try to test this in code as well.

I did some tests to answer this question experimentally. First, here are two unaligned reads across the SRAM1-2 boundary:

uint32_t v32_low, v32_high;
*(uint32_t *)(0x10000000+128*1024-4) = 0xDEADBEEF;
*(uint32_t *)(0x10000000+128*1024)   = 0x015DE5E1;
__asm__ volatile ("ldr %0, [%1]" : "=r" (v32_low) : "r" (0x10000000+128*1024-1) : );
LOG_INF("Unaligned 32-bit read across SRAM1-2: %08X", v32_low);
__asm__ volatile ("ldrd %0, %1, [%2]" : "=r" (v32_low), "=r" (v32_high) : "r" (0x10000000+128*1024-4) : );
LOG_INF("Unaligned 64-bit read across SRAM1-2: %08X %08X", v32_low, v32_high);

The first read word is 5DE5E1DE, the second one DEADBEEF 015DE5E1, which seems to prove that this kind of access is safe.

Also, I managed to put a 32-bit opcode directly on the SRAM3-4 boundary - the only boundary where I actually store code - and can execute the code successfully.