Very bad performances on the stm32N657

Franzi.Edo · ‎2025-08-18

Dear all,
I am facing very poor performance with the STM32N657.
I have some benchmarks that manipulate arrays of data in different ways.

I ran these benchmarks on
Nucleo_H753 @ 480 MHz, with caches ON
Nucleo_N657 @ 600 MHz, with caches ON

For the STM32N657 I measured the CPU and AXI clocks via MCO2:
fCPU = 600 MHz
fAXI = 400 MHz

The compiler used is gcc-15.2.0

Bench for the Nucleo_H753
-------------------------

uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
          the internal memory. Then, compute the
          X-Y projections and the histogram.
          Fill the array                               t =     17 [us]
          X projection                                 t =     41 [us]
          Y projection                                 t =     18 [us]
          Histogram                                    t =     30 [us]

Bench 01: Fill a small 2D array (200 x 200) elements
          in the internal memory. Then, compute the
          X-Y projections and the histogram.
          Fill the array                               t =    171 [us]
          X projection                                 t =    672 [us]
          Y projection                                 t =    288 [us]
          Histogram                                    t =    451 [us]

Bench 02: Fill a small 1D array (1000) elements in
          the internal memory with a random pattern.
          Then, compute the min / max values.
          Number of tests                              n =   1000 [-]
          Min / Max                                    t =   1110 [us]

Bench 03: Fill a big 1D array (50000) elements in
          the internal memory with a random pattern.
          Then, compute the min / max values.
          Number of tests                              n =    100 [-]
          Min / Max                                    t =    107 [us]

Bench 04: Compute the integer atan2 using the CORDIC
          algorithm
          Number of tests                              n =   1000 [-]
          1000 x atan2(y, x)                           t =   1088 [us]


Bench for the Nucleo_N657
-------------------------

uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
          the internal memory. Then, compute the
          X-Y projections and the histogram.
          Fill the array                               t =     29 [us]
          X projection                                 t =    173 [us]
          Y projection                                 t =    167 [us]
          Histogram                                    t =    330 [us]

Bench 01: Fill a small 2D array (200 x 200) elements
          in the internal memory. Then, compute the
          X-Y projections and the histogram.
          Fill the array                               t =    400 [us]
          X projection                                 t =   2766 [us]
          Y projection                                 t =   2640 [us]
          Histogram                                    t =   5369 [us]

Bench 02: Fill a small 1D array (1000) elements in
          the internal memory with a random pattern.
          Then, compute the min / max values.
          Number of tests                              n =   1000 [-]
          Min / Max                                    t =   3127 [us]

Bench 03: Fill a big 1D array (50000) elements in
          the internal memory with a random pattern.
          Then, compute the min / max values.
          Number of tests                              n =    100 [-]
          Min / Max                                    t =    323 [us]

Bench 04: Compute the integer atan2 using the CORDIC
          algorithm
          Number of tests                              n =   1000 [-]
          1000 x atan2(y, x)                           t =   2726 [us]

As shown, the N6 performance is not acceptable!
The clock measurements on MCO2 reflect the PLL values, but maybe some other elements
(not clearly identified) are influencing the code execution.

Any clue to get more decent results for the N6?
Kind regards,
Edo

Franzi.Edo · ‎2025-08-18

Hi TDK,

you are right ; let me investigate this inconsistency.

Btw, here is the benched routine:

/*
 * \brief local_minMax
 *
 * - Compute the min / max of an array
 *
 */
static	void	local_minMax(uint32_t *array, uint64_t *time, uint32_t *min, uint32_t *max) {
	uint64_t	tStamp[2];
	uint32_t	i;

	kern_getTickCount(&tStamp[0]);
	*min = 0xFFFFFFFF; *max = 0x00000000;
	for (i = 0; i < KNB_ELEMENTS; i++) {
		if (*(array + i) < *min) { *min = *(array + i); }
		if (*(array + i) > *max) { *max = *(array + i); }
	}
	kern_getTickCount(&tStamp[1]);

	*time = tStamp[1] - tStamp[0];
}

Franzi.Edo · ‎2025-08-19

Dear all,

To investigate my speed problem, I created a very simple test.
I run an infinite loop (with interrupts off) where I execute a simple NOP loop 1,000,000 times.
At the end of each loop execution, I toggle a GPIO pin.

Here is the C code:

#include	"uKOS.h"

#define	KNB_TESTS			1000000

// CLI tool specific
// =================

static	void	 local_loop(uint32_t nb);

/*
 * \brief bench_05
 *
 * - loop
 *
 */
bool	bench_05(void) {

	dprintf(KSYST, "Bench 05: For scope tests!\n");

	kern_suspendProcess(1000);

	INTERRUPTION_OFF_HARD
	while (true) {

		ANALYSER_TOGGLE;
		local_loop(KNB_TESTS);
	}

	return (true);
}

// Local routines
// ==============

/*
 * \brief local_loop
 *
 * - Execute the nop
 *
 */
static	void local_loop(uint32_t nb) {
	volatile	uint32_t	i;

	for (i = 0; i < nb; i++) {
		NOP;
	}
}

here is the H7 assembly

Here is the H7 assembly

08016278 <bench_05>:
 8016278:	b507      	push	{r0, r1, r2, lr}
 801627a:	490e      	ldr	r1, [pc, #56]	@ (80162b4 <bench_05+0x3c>)
 801627c:	480e      	ldr	r0, [pc, #56]	@ (80162b8 <bench_05+0x40>)
 801627e:	f01f ff21 	bl	80360c4 <dprintf>
 8016282:	f44f 707a 	mov.w	r0, #1000	@ 0x3e8
 8016286:	f7ed fa4b 	bl	8003720 <kern_suspendProcess>
 801628a:	b672      	cpsid	i
 801628c:	f3bf 8f6f 	isb	sy
 8016290:	4b0a      	ldr	r3, [pc, #40]	@ (80162bc <bench_05+0x44>)
 8016292:	2100      	movs	r1, #0
 8016294:	480a      	ldr	r0, [pc, #40]	@ (80162c0 <bench_05+0x48>)
 8016296:	695a      	ldr	r2, [r3, #20]
 8016298:	f082 0201 	eor.w	r2, r2, #1
 801629c:	615a      	str	r2, [r3, #20]
 801629e:	695a      	ldr	r2, [r3, #20]
 80162a0:	9101      	str	r1, [sp, #4]
 80162a2:	9a01      	ldr	r2, [sp, #4]
 80162a4:	4282      	cmp	r2, r0
 80162a6:	d8f6      	bhi.n	8016296 <bench_05+0x1e>
 80162a8:	bf00      	nop
 80162aa:	9a01      	ldr	r2, [sp, #4]
 80162ac:	3201      	adds	r2, #1
 80162ae:	9201      	str	r2, [sp, #4]
 80162b0:	e7f7      	b.n	80162a2 <bench_05+0x2a>
 80162b2:	bf00      	nop
 80162b4:	08043ee1 			@ <UNDEFINED> instruction: 08043ee1
 80162b8:	73797374 			@ <UNDEFINED> instruction: 73797374
 80162bc:	58020400 			@ <UNDEFINED> instruction: 58020400
 80162c0:	000f423f 			@ <UNDEFINED> instruction: 000f423f
 80162c4:	00000000 			@ <UNDEFINED> instruction: 00000000

The inner loop executed 1000000 is:
8016296:	695a      	ldr	r2, [r3, #20]
 8016298:	f082 0201 	eor.w	r2, r2, #1
 801629c:	615a      	str	r2, [r3, #20]
 801629e:	695a      	ldr	r2, [r3, #20]
 80162a0:	9101      	str	r1, [sp, #4]
 80162a2:	9a01      	ldr	r2, [sp, #4]
 80162a4:	4282      	cmp	r2, r0
 80162a6:	d8f6      	bhi.n	8016296 <bench_05+0x1e>

here is the N6 assembly

Here is the N6 assembly

34014bd4 <bench_05>:
34014bd4:	b507      	push	{r0, r1, r2, lr}
34014bd6:	490e      	ldr	r1, [pc, #56]	@ (34014c10 <bench_05+0x3c>)
34014bd8:	480e      	ldr	r0, [pc, #56]	@ (34014c14 <bench_05+0x40>)
34014bda:	f017 fd91 	bl	3402c700 <dprintf>
34014bde:	f44f 707a 	mov.w	r0, #1000	@ 0x3e8
34014be2:	f7ef f80d 	bl	34003c00 <kern_suspendProcess>
34014be6:	b672      	cpsid	i
34014be8:	f3bf 8f6f 	isb	sy
34014bec:	2100      	movs	r1, #0
34014bee:	4b0a      	ldr	r3, [pc, #40]	@ (34014c18 <bench_05+0x44>)
34014bf0:	480a      	ldr	r0, [pc, #40]	@ (34014c1c <bench_05+0x48>)
34014bf2:	695a      	ldr	r2, [r3, #20]
34014bf4:	f082 0202 	eor.w	r2, r2, #2
34014bf8:	615a      	str	r2, [r3, #20]
34014bfa:	695a      	ldr	r2, [r3, #20]
34014bfc:	9101      	str	r1, [sp, #4]
34014bfe:	9a01      	ldr	r2, [sp, #4]
34014c00:	4282      	cmp	r2, r0
34014c02:	d8f6      	bhi.n	34014bf2 <bench_05+0x1e>
34014c04:	bf00      	nop
34014c06:	9a01      	ldr	r2, [sp, #4]
34014c08:	3201      	adds	r2, #1
34014c0a:	9201      	str	r2, [sp, #4]
34014c0c:	e7f7      	b.n	34014bfe <bench_05+0x2a>
34014c0e:	bf00      	nop
34014c10:	9fdc      	ldr	r7, [sp, #880]	@ 0x370
34014c12:	3403      	adds	r4, #3
34014c14:	7374      	strb	r4, [r6, #13]
34014c16:	7379      	strb	r1, [r7, #13]
34014c18:	1800      	adds	r0, r0, r0
34014c1a:	5602      	ldrsb	r2, [r0, r0]
34014c1c:	423f      	tst	r7, r7
34014c1e:	000f      	movs	r7, r1

The inner loop executed 1000000 is:
34014bf2:	695a      	ldr	r2, [r3, #20]
34014bf4:	f082 0202 	eor.w	r2, r2, #2
34014bf8:	615a      	str	r2, [r3, #20]
34014bfa:	695a      	ldr	r2, [r3, #20]
34014bfc:	9101      	str	r1, [sp, #4]
34014bfe:	9a01      	ldr	r2, [sp, #4]
34014c00:	4282      	cmp	r2, r0
34014c02:	d8f6      	bhi.n	34014bf2 <bench_05+0x1e>

As you can see, the two inner loops are identical.
However, the logic analyzer on the GPIO shows:
H7: 10.4 ms
N6: 21.59 ms

Execution time ratio = 2.75

The measured frequency on MCO2 is:
H7 = 480 MHz
N6 = 588 MHz
Clock ratio = 1.22

So, the H7 @ 480 MHz is effectively 3.35× faster than the N6 @ 588 MHz.
Clearly something is wrong somewhere in the chain, but at the moment the only concrete measurement I have is the MCO2 frequency output values and the GPIO measurement.
Question: Where am I losing this factor of 3?
Best regards,

Pavel A. · ‎2025-08-19

Does the N6 test run in the RAM or external flash?

Franzi.Edo · ‎2025-08-19

Hi Pavel, the N6 run in the internal AXI SRAM1.

BR, Edo

RomainR. · ‎2025-08-19

Hello @Franzi.Edo

On the N6 side, can you read and share the contents of the MSCR register?
Refer to PM0273 - Rev 3 section 6.8.2 Memory System Control Register, MSCR
You should verify that bits 12 DCACTIVE and 13 ICACTIVE of the L1 data and instruction cache memory interfaces should be enabled.
This can be done using the macros in core_cm55.h and implementing the line below at the beginning of your code:

MEMSYSCTL->MSCR |= MEMSYSCTL_MSCR_DCACTIVE_Msk|MEMSYSCTL_MSCR_ICACTIVE_Msk;

Regarding measuring execution times, I suggest using DWT_CYCCNT (also available on H7) instead of a hardware timer. Count CPU cycles instead of microseconds, and compare with your assembler code. Once your CPU cycles reach what you expect, convert to time from the actual H7 and N6 CPU frequencies.

In attachment the main.c shows how to configure and use DWT_CYCCNT on N6.

Let me know if it helps?
Best regards,

Romain,

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

Franzi.Edo · ‎2025-08-19

Hi RomainR,
Thank you for your suggestion.
I just printed the content of MEMSYSCTL->MSCR, and its value is 0x300A. Unfortunately, the bits you suggested to set are already enabled.

Do you have any other suggestions I could try? It feels like there is some kind of divider between the clock observed on MCO2 and the actual CPU clock.

Regarding the DWT_CYCCNT, you are absolutely right. The issue is that these benchmarks are exactly the same across all the architectures supported by my OS (Cortex, RISC-V). So the simplest solution was to rely on a timer value provided by the OS, in order to avoid multiple code implementations.
Best regards,
Edo

TDK · ‎2025-08-19

Only 588 MHz measured? Why not 600? Could the oscillator be unstable?

If you feel a post has answered your question, please click "Accept as Solution".

Franzi.Edo · ‎2025-08-19

Hi TDK, probably this is coming from the multiple dividers. IC15 is the PLL1 / 2, MCO2 is the IC15 / 10 ... and finally, the precision of my digital analyser.

RomainR. · ‎2025-08-19

Hi @Franzi.Edo

My suggestions Iare:
1) Check the entirety of your clock tree configuration and RCC registers. I suggest also tu use STM32CubeMX to help you.
At the sysa_ck and sysb_ck levels (see figure below from RM0486), make sure that they are receiving the correct clocks from ic1_ck = 600MHz and ic2_ck=400MHz from PLL1 and the correct prescalers ic1 to ic11.
Regarding the timer you are using, also check the TIMPRE values. Check that your timer is clocked at 1MHz.

2) I assume that your application is running in SRAM2 or SRAM1. Try to configure MPU region according execution in AXISRAM2 or AXISRAM1. Refer to the discussions and knowledge below:

https://community.st.com/t5/stm32-mcus-products/stm32n6-mpu-config-code-execution-why-not-at-0x70000000/td-p/758122

https://community.st.com/t5/stm32-mcus/how-to-configure-the-mpu-of-an-stm32-using-stm32cubemx/ta-p/49825

3) Try changing the execution to TCM memory, making sure to initialize TCM before use it, because of the ECC and changing the linker file.

https://community.st.com/t5/stm32-mcus-products/stm32n6-hard-fault-when-accessing-tcm/td-p/757557

Best regards,

Romain,

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

LCE · ‎2025-08-19

For any accurate timing measurements, I would:

- use the ARM cycle counter

- turn off all interrupts ( __disable_irq() if possible, otherwise disable all interrupts not related to the tested functions)