2025-08-18 10:19 AM
Dear all,
I am facing very poor performance with the STM32N657.
I have some benchmarks that manipulate arrays of data in different ways.
I ran these benchmarks on
Nucleo_H753 @ 480 MHz, with caches ON
Nucleo_N657 @ 600 MHz, with caches ON
For the STM32N657 I measured the CPU and AXI clocks via MCO2:
fCPU = 600 MHz
fAXI = 400 MHz
The compiler used is gcc-15.2.0
Bench for the Nucleo_H753
-------------------------
uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 17 [us]
X projection t = 41 [us]
Y projection t = 18 [us]
Histogram t = 30 [us]
Bench 01: Fill a small 2D array (200 x 200) elements
in the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 171 [us]
X projection t = 672 [us]
Y projection t = 288 [us]
Histogram t = 451 [us]
Bench 02: Fill a small 1D array (1000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 1000 [-]
Min / Max t = 1110 [us]
Bench 03: Fill a big 1D array (50000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 100 [-]
Min / Max t = 107 [us]
Bench 04: Compute the integer atan2 using the CORDIC
algorithm
Number of tests n = 1000 [-]
1000 x atan2(y, x) t = 1088 [us]
Bench for the Nucleo_N657
-------------------------
uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 29 [us]
X projection t = 173 [us]
Y projection t = 167 [us]
Histogram t = 330 [us]
Bench 01: Fill a small 2D array (200 x 200) elements
in the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 400 [us]
X projection t = 2766 [us]
Y projection t = 2640 [us]
Histogram t = 5369 [us]
Bench 02: Fill a small 1D array (1000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 1000 [-]
Min / Max t = 3127 [us]
Bench 03: Fill a big 1D array (50000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 100 [-]
Min / Max t = 323 [us]
Bench 04: Compute the integer atan2 using the CORDIC
algorithm
Number of tests n = 1000 [-]
1000 x atan2(y, x) t = 2726 [us]
As shown, the N6 performance is not acceptable!
The clock measurements on MCO2 reflect the PLL values, but maybe some other elements
(not clearly identified) are influencing the code execution.
Any clue to get more decent results for the N6?
Kind regards,
Edo
Solved! Go to Solution.
2025-08-20 12:24 AM
Hi LCE,
Thank you for your advice.
Regarding the DWT_CYCCNT, you are absolutely right. The issue is that these benchmarks are exactly the same across all the architectures supported by my OS (Cortex, RISC-V). So the simplest solution was to rely on a timer value provided by the OS, in order to avoid multiple code implementations.
For the moment I do not need a sub us measurement. Here the problem is the effective speed of the cpu. Just check my previous test (simple loop with NOP). It turns out that H7 is 3.3 x faster than the N6, and I cannot believe that.
Best regards,
Edo
2025-08-20 12:47 AM
2025-08-20 1:26 AM
CYCCNT:
I prefer this not only because of the sub-second accuracy, but also because it is not an STM32 peripheral, thus not depending on bus clock or peripheral settings.
2025-08-20 6:05 AM
Note also that the M7 core is faster than the M55 on a per-MHz level.
Might also be running into bus contention issues with code and data being transferring over the same bus. Some sources on the internet say a 2-3x speed difference, which is what you're seeing.
> So, the H7 @ 480 MHz is effectively 3.35× faster than the N6 @ 588 MHz.
I calculate a 2.53x difference, not 3.35x.
The N6 has a lot of NPU-specific computational power which is not being exercised here at all. That's what it was built for, not single-thread execution.
2025-08-20 6:41 AM - edited 2025-08-20 6:43 AM
Arm giving not much differring numbers :)
M7 just 20% faster than M55 . (on Coremark)
https://documentation-service.arm.com/static/6267de1c7e121f01fd22d677?token=
https://documentation-service.arm.com/static/61bb37962183326f2176f8cc
2025-08-20 8:53 AM
Hi TDK,
You are right: the M7 is faster than the M55 per MHz.
In my test, the N6 shows a speed advantage of a factor 1.22, which should place it roughly at the same level as the H7.
Looking at my NOP test:
H7: 10.4 ms
N6: 21.59 ms
Execution time ratio = 2.75
Measured frequency on MCO2:
H7 = 480 MHz
N6 = 588 MHz
Clock ratio = 1.22
If I multiply the execution time penalty (2.75) by the clock advantage (1.22), I get about 3.35.
I know this isn’t rigorous, since clock speed is already factored into the execution time ratio, but I’m just treating it as an order of magnitude estimate.
In other tests, I even see an execution time ratio of >4.
But I’ve identified the problem — I’ll explain in a moment.
Thanks.
2025-08-20 8:58 AM
Hi AScha.3
You are right: the M7 is faster than the M55 per MHz.
In my test, the N6 shows a speed advantage of a factor 1.22, which should place it roughly at the same level as the H7. So, the 588-MHz N6 CoreMark is 4.40 x 588 = 2'587.2 and the 480-MHz H7 CoreMark is 5.29 x 480 = 2'539. So, both machines should give very similar results on my tests.
In some tests, I even see an execution time ratio of >4.
But I’ve identified the problem — I’ll explain in a moment.
Thanks.
2025-08-20 9:11 AM
Dear all,
I have identified the main cause of my problem.
The poor N6 performance compared to the H7 was due to the MPU configuration.
I had initially specified the RAM as Sharable, which — for reasons that are not entirely clear — degraded memory performance.
After changing the RAM to Non-sharable, the results are now much more consistent and explainable.
With this adjustment, the N6 @ 588 MHz and the H7 @ 480 MHz deliver very close performance.
I consider this issue mostly resolved, although a few open points remain:
Why does Sharable RAM perform so poorly?
In my NOP test, the H7 is still about 30% faster than the N6, despite the N6 running at a higher clock speed. My assumption is that the H7’s memory scheme (code executed in FLASH with the ART accelerator) is more efficient than the N6’s cache-based approach.
Anyway, I’d like to thank you all for your great support.
Best regards,
Edo
2025-08-20 10:05 AM - edited 2025-08-20 10:08 AM
Hi @Franzi.Edo
Thank you for sharing your tests and results. It was almost certainly a matter of MPU configuration.
A memory area on the ST NOC AXI (SRAM1 and 2 of the N6) configured with shareable cacheable attributes will be translated by the CM55 processor as Normal Shareable Non-cacheable. This could penalize processor access and explains the degraded performances.
Here is a note on this subject in the Arm Cortex-M55 Processor Technical Reference Manual.
Section Memory system/Manager-AXI interface then Memory attribute conversion on M-AXI:
It is also possible that, as on Cortex-M7, a shareable and cacheable area may not have the data cache enabled, only the instruction cache is used:
Best regards,
Romain,
To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.
2025-08-21 8:12 AM
> A memory area on the ST NOC AXI (SRAM1 and 2 of the N6) configured with shareable cacheable attributes will be translated by the CM55 processor as Normal Shareable Non-cacheable.
@RomainR. This is even if this area is defined via the MPU as cacheable? Or this is when this area is in "background region" (without MPU)?