2025-08-18 10:19 AM
Dear all,
I am facing very poor performance with the STM32N657.
I have some benchmarks that manipulate arrays of data in different ways.
I ran these benchmarks on
Nucleo_H753 @ 480 MHz, with caches ON
Nucleo_N657 @ 600 MHz, with caches ON
For the STM32N657 I measured the CPU and AXI clocks via MCO2:
fCPU = 600 MHz
fAXI = 400 MHz
The compiler used is gcc-15.2.0
Bench for the Nucleo_H753
-------------------------
uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 17 [us]
X projection t = 41 [us]
Y projection t = 18 [us]
Histogram t = 30 [us]
Bench 01: Fill a small 2D array (200 x 200) elements
in the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 171 [us]
X projection t = 672 [us]
Y projection t = 288 [us]
Histogram t = 451 [us]
Bench 02: Fill a small 1D array (1000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 1000 [-]
Min / Max t = 1110 [us]
Bench 03: Fill a big 1D array (50000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 100 [-]
Min / Max t = 107 [us]
Bench 04: Compute the integer atan2 using the CORDIC
algorithm
Number of tests n = 1000 [-]
1000 x atan2(y, x) t = 1088 [us]
Bench for the Nucleo_N657
-------------------------
uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 29 [us]
X projection t = 173 [us]
Y projection t = 167 [us]
Histogram t = 330 [us]
Bench 01: Fill a small 2D array (200 x 200) elements
in the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 400 [us]
X projection t = 2766 [us]
Y projection t = 2640 [us]
Histogram t = 5369 [us]
Bench 02: Fill a small 1D array (1000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 1000 [-]
Min / Max t = 3127 [us]
Bench 03: Fill a big 1D array (50000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 100 [-]
Min / Max t = 323 [us]
Bench 04: Compute the integer atan2 using the CORDIC
algorithm
Number of tests n = 1000 [-]
1000 x atan2(y, x) t = 2726 [us]
As shown, the N6 performance is not acceptable!
The clock measurements on MCO2 reflect the PLL values, but maybe some other elements
(not clearly identified) are influencing the code execution.
Any clue to get more decent results for the N6?
Kind regards,
Edo
Solved! Go to Solution.
2025-08-18 2:17 PM
Hi TDK,
you are right ; let me investigate this inconsistency.
Btw, here is the benched routine:
/*
* \brief local_minMax
*
* - Compute the min / max of an array
*
*/
static void local_minMax(uint32_t *array, uint64_t *time, uint32_t *min, uint32_t *max) {
uint64_t tStamp[2];
uint32_t i;
kern_getTickCount(&tStamp[0]);
*min = 0xFFFFFFFF; *max = 0x00000000;
for (i = 0; i < KNB_ELEMENTS; i++) {
if (*(array + i) < *min) { *min = *(array + i); }
if (*(array + i) > *max) { *max = *(array + i); }
}
kern_getTickCount(&tStamp[1]);
*time = tStamp[1] - tStamp[0];
}
2025-08-19 1:01 AM
Dear all,
To investigate my speed problem, I created a very simple test.
I run an infinite loop (with interrupts off) where I execute a simple NOP loop 1,000,000 times.
At the end of each loop execution, I toggle a GPIO pin.
Here is the C code:
#include "uKOS.h"
#define KNB_TESTS 1000000
// CLI tool specific
// =================
static void local_loop(uint32_t nb);
/*
* \brief bench_05
*
* - loop
*
*/
bool bench_05(void) {
dprintf(KSYST, "Bench 05: For scope tests!\n");
kern_suspendProcess(1000);
INTERRUPTION_OFF_HARD
while (true) {
ANALYSER_TOGGLE;
local_loop(KNB_TESTS);
}
return (true);
}
// Local routines
// ==============
/*
* \brief local_loop
*
* - Execute the nop
*
*/
static void local_loop(uint32_t nb) {
volatile uint32_t i;
for (i = 0; i < nb; i++) {
NOP;
}
}
here is the H7 assembly
Here is the H7 assembly
08016278 <bench_05>:
8016278: b507 push {r0, r1, r2, lr}
801627a: 490e ldr r1, [pc, #56] @ (80162b4 <bench_05+0x3c>)
801627c: 480e ldr r0, [pc, #56] @ (80162b8 <bench_05+0x40>)
801627e: f01f ff21 bl 80360c4 <dprintf>
8016282: f44f 707a mov.w r0, #1000 @ 0x3e8
8016286: f7ed fa4b bl 8003720 <kern_suspendProcess>
801628a: b672 cpsid i
801628c: f3bf 8f6f isb sy
8016290: 4b0a ldr r3, [pc, #40] @ (80162bc <bench_05+0x44>)
8016292: 2100 movs r1, #0
8016294: 480a ldr r0, [pc, #40] @ (80162c0 <bench_05+0x48>)
8016296: 695a ldr r2, [r3, #20]
8016298: f082 0201 eor.w r2, r2, #1
801629c: 615a str r2, [r3, #20]
801629e: 695a ldr r2, [r3, #20]
80162a0: 9101 str r1, [sp, #4]
80162a2: 9a01 ldr r2, [sp, #4]
80162a4: 4282 cmp r2, r0
80162a6: d8f6 bhi.n 8016296 <bench_05+0x1e>
80162a8: bf00 nop
80162aa: 9a01 ldr r2, [sp, #4]
80162ac: 3201 adds r2, #1
80162ae: 9201 str r2, [sp, #4]
80162b0: e7f7 b.n 80162a2 <bench_05+0x2a>
80162b2: bf00 nop
80162b4: 08043ee1 @ <UNDEFINED> instruction: 08043ee1
80162b8: 73797374 @ <UNDEFINED> instruction: 73797374
80162bc: 58020400 @ <UNDEFINED> instruction: 58020400
80162c0: 000f423f @ <UNDEFINED> instruction: 000f423f
80162c4: 00000000 @ <UNDEFINED> instruction: 00000000
The inner loop executed 1000000 is:
8016296: 695a ldr r2, [r3, #20]
8016298: f082 0201 eor.w r2, r2, #1
801629c: 615a str r2, [r3, #20]
801629e: 695a ldr r2, [r3, #20]
80162a0: 9101 str r1, [sp, #4]
80162a2: 9a01 ldr r2, [sp, #4]
80162a4: 4282 cmp r2, r0
80162a6: d8f6 bhi.n 8016296 <bench_05+0x1e>
here is the N6 assembly
Here is the N6 assembly
34014bd4 <bench_05>:
34014bd4: b507 push {r0, r1, r2, lr}
34014bd6: 490e ldr r1, [pc, #56] @ (34014c10 <bench_05+0x3c>)
34014bd8: 480e ldr r0, [pc, #56] @ (34014c14 <bench_05+0x40>)
34014bda: f017 fd91 bl 3402c700 <dprintf>
34014bde: f44f 707a mov.w r0, #1000 @ 0x3e8
34014be2: f7ef f80d bl 34003c00 <kern_suspendProcess>
34014be6: b672 cpsid i
34014be8: f3bf 8f6f isb sy
34014bec: 2100 movs r1, #0
34014bee: 4b0a ldr r3, [pc, #40] @ (34014c18 <bench_05+0x44>)
34014bf0: 480a ldr r0, [pc, #40] @ (34014c1c <bench_05+0x48>)
34014bf2: 695a ldr r2, [r3, #20]
34014bf4: f082 0202 eor.w r2, r2, #2
34014bf8: 615a str r2, [r3, #20]
34014bfa: 695a ldr r2, [r3, #20]
34014bfc: 9101 str r1, [sp, #4]
34014bfe: 9a01 ldr r2, [sp, #4]
34014c00: 4282 cmp r2, r0
34014c02: d8f6 bhi.n 34014bf2 <bench_05+0x1e>
34014c04: bf00 nop
34014c06: 9a01 ldr r2, [sp, #4]
34014c08: 3201 adds r2, #1
34014c0a: 9201 str r2, [sp, #4]
34014c0c: e7f7 b.n 34014bfe <bench_05+0x2a>
34014c0e: bf00 nop
34014c10: 9fdc ldr r7, [sp, #880] @ 0x370
34014c12: 3403 adds r4, #3
34014c14: 7374 strb r4, [r6, #13]
34014c16: 7379 strb r1, [r7, #13]
34014c18: 1800 adds r0, r0, r0
34014c1a: 5602 ldrsb r2, [r0, r0]
34014c1c: 423f tst r7, r7
34014c1e: 000f movs r7, r1
The inner loop executed 1000000 is:
34014bf2: 695a ldr r2, [r3, #20]
34014bf4: f082 0202 eor.w r2, r2, #2
34014bf8: 615a str r2, [r3, #20]
34014bfa: 695a ldr r2, [r3, #20]
34014bfc: 9101 str r1, [sp, #4]
34014bfe: 9a01 ldr r2, [sp, #4]
34014c00: 4282 cmp r2, r0
34014c02: d8f6 bhi.n 34014bf2 <bench_05+0x1e>
As you can see, the two inner loops are identical.
However, the logic analyzer on the GPIO shows:
H7: 10.4 ms
N6: 21.59 ms
Execution time ratio = 2.75
The measured frequency on MCO2 is:
H7 = 480 MHz
N6 = 588 MHz
Clock ratio = 1.22
So, the H7 @ 480 MHz is effectively 3.35× faster than the N6 @ 588 MHz.
Clearly something is wrong somewhere in the chain, but at the moment the only concrete measurement I have is the MCO2 frequency output values and the GPIO measurement.
Question: Where am I losing this factor of 3?
Best regards,
2025-08-19 4:24 AM
Does the N6 test run in the RAM or external flash?
2025-08-19 4:36 AM
Hi Pavel, the N6 run in the internal AXI SRAM1.
BR, Edo
2025-08-19 7:37 AM
Hello @Franzi.Edo
On the N6 side, can you read and share the contents of the MSCR register?
Refer to PM0273 - Rev 3 section 6.8.2 Memory System Control Register, MSCR
You should verify that bits 12 DCACTIVE and 13 ICACTIVE of the L1 data and instruction cache memory interfaces should be enabled.
This can be done using the macros in core_cm55.h and implementing the line below at the beginning of your code:
MEMSYSCTL->MSCR |= MEMSYSCTL_MSCR_DCACTIVE_Msk|MEMSYSCTL_MSCR_ICACTIVE_Msk;
Regarding measuring execution times, I suggest using DWT_CYCCNT (also available on H7) instead of a hardware timer. Count CPU cycles instead of microseconds, and compare with your assembler code. Once your CPU cycles reach what you expect, convert to time from the actual H7 and N6 CPU frequencies.
In attachment the main.c shows how to configure and use DWT_CYCCNT on N6.
Let me know if it helps?
Best regards,
Romain,
To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.
2025-08-19 9:18 AM
Hi RomainR,
Thank you for your suggestion.
I just printed the content of MEMSYSCTL->MSCR, and its value is 0x300A. Unfortunately, the bits you suggested to set are already enabled.
Do you have any other suggestions I could try? It feels like there is some kind of divider between the clock observed on MCO2 and the actual CPU clock.
Regarding the DWT_CYCCNT, you are absolutely right. The issue is that these benchmarks are exactly the same across all the architectures supported by my OS (Cortex, RISC-V). So the simplest solution was to rely on a timer value provided by the OS, in order to avoid multiple code implementations.
Best regards,
Edo
2025-08-19 9:30 AM
Only 588 MHz measured? Why not 600? Could the oscillator be unstable?
2025-08-19 9:34 AM
Hi TDK, probably this is coming from the multiple dividers. IC15 is the PLL1 / 2, MCO2 is the IC15 / 10 ... and finally, the precision of my digital analyser.
2025-08-19 10:45 AM
Hi @Franzi.Edo
My suggestions Iare:
1) Check the entirety of your clock tree configuration and RCC registers. I suggest also tu use STM32CubeMX to help you.
At the sysa_ck and sysb_ck levels (see figure below from RM0486), make sure that they are receiving the correct clocks from ic1_ck = 600MHz and ic2_ck=400MHz from PLL1 and the correct prescalers ic1 to ic11.
Regarding the timer you are using, also check the TIMPRE values. Check that your timer is clocked at 1MHz.
2) I assume that your application is running in SRAM2 or SRAM1. Try to configure MPU region according execution in AXISRAM2 or AXISRAM1. Refer to the discussions and knowledge below:
3) Try changing the execution to TCM memory, making sure to initialize TCM before use it, because of the ECC and changing the linker file.
https://community.st.com/t5/stm32-mcus-products/stm32n6-hard-fault-when-accessing-tcm/td-p/757557
Best regards,
Romain,
To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.
2025-08-19 11:09 PM
For any accurate timing measurements, I would:
- use the ARM cycle counter
- turn off all interrupts ( __disable_irq() if possible, otherwise disable all interrupts not related to the tested functions)