Very bad performances on the stm32N657

Franzi.Edo · ‎2025-08-18

Dear all,
I am facing very poor performance with the STM32N657.
I have some benchmarks that manipulate arrays of data in different ways.

I ran these benchmarks on
Nucleo_H753 @ 480 MHz, with caches ON
Nucleo_N657 @ 600 MHz, with caches ON

For the STM32N657 I measured the CPU and AXI clocks via MCO2:
fCPU = 600 MHz
fAXI = 400 MHz

The compiler used is gcc-15.2.0

Bench for the Nucleo_H753
-------------------------

uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
          the internal memory. Then, compute the
          X-Y projections and the histogram.
          Fill the array                               t =     17 [us]
          X projection                                 t =     41 [us]
          Y projection                                 t =     18 [us]
          Histogram                                    t =     30 [us]

Bench 01: Fill a small 2D array (200 x 200) elements
          in the internal memory. Then, compute the
          X-Y projections and the histogram.
          Fill the array                               t =    171 [us]
          X projection                                 t =    672 [us]
          Y projection                                 t =    288 [us]
          Histogram                                    t =    451 [us]

Bench 02: Fill a small 1D array (1000) elements in
          the internal memory with a random pattern.
          Then, compute the min / max values.
          Number of tests                              n =   1000 [-]
          Min / Max                                    t =   1110 [us]

Bench 03: Fill a big 1D array (50000) elements in
          the internal memory with a random pattern.
          Then, compute the min / max values.
          Number of tests                              n =    100 [-]
          Min / Max                                    t =    107 [us]

Bench 04: Compute the integer atan2 using the CORDIC
          algorithm
          Number of tests                              n =   1000 [-]
          1000 x atan2(y, x)                           t =   1088 [us]


Bench for the Nucleo_N657
-------------------------

uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
          the internal memory. Then, compute the
          X-Y projections and the histogram.
          Fill the array                               t =     29 [us]
          X projection                                 t =    173 [us]
          Y projection                                 t =    167 [us]
          Histogram                                    t =    330 [us]

Bench 01: Fill a small 2D array (200 x 200) elements
          in the internal memory. Then, compute the
          X-Y projections and the histogram.
          Fill the array                               t =    400 [us]
          X projection                                 t =   2766 [us]
          Y projection                                 t =   2640 [us]
          Histogram                                    t =   5369 [us]

Bench 02: Fill a small 1D array (1000) elements in
          the internal memory with a random pattern.
          Then, compute the min / max values.
          Number of tests                              n =   1000 [-]
          Min / Max                                    t =   3127 [us]

Bench 03: Fill a big 1D array (50000) elements in
          the internal memory with a random pattern.
          Then, compute the min / max values.
          Number of tests                              n =    100 [-]
          Min / Max                                    t =    323 [us]

Bench 04: Compute the integer atan2 using the CORDIC
          algorithm
          Number of tests                              n =   1000 [-]
          1000 x atan2(y, x)                           t =   2726 [us]

As shown, the N6 performance is not acceptable!
The clock measurements on MCO2 reflect the PLL values, but maybe some other elements
(not clearly identified) are influencing the code execution.

Any clue to get more decent results for the N6?
Kind regards,
Edo

Franzi.Edo · ‎2025-08-20

Hi LCE,

Thank you for your advice.

Regarding the DWT_CYCCNT, you are absolutely right. The issue is that these benchmarks are exactly the same across all the architectures supported by my OS (Cortex, RISC-V). So the simplest solution was to rely on a timer value provided by the OS, in order to avoid multiple code implementations.

For the moment I do not need a sub us measurement. Here the problem is the effective speed of the cpu. Just check my previous test (simple loop with NOP). It turns out that H7 is 3.3 x faster than the N6, and I cannot believe that.
Best regards,
Edo

Franzi.Edo · ‎2025-08-20

Hi RomainR,

Thank you for your suggestions and the links you shared.

I will go through them and also double-check the RCC configuration.

That said, I have the impression that my RCC config is correct, since I can observe the expected clock on MCO2.

Here are my current RCC settings:

PLLs:

PLL1, powered by HSI, N=190, M=5, f(VCO)=64*190/5=2432-MHz, Out=f(VCO)/4=608-MHz

PLL2, powered by HSI, N=245, M=7, f(VCO)=64*245/7=2240-MHz, Out=f(VCO)/3=746-MHz

PLL3, powered by HSI, N=125, M=5, f(VCO)=64*125/5=1600-MHz, Out=f(VCO)/4=400-MHz

PLL4, powered by HSI, N=125, M=5, f(VCO)=64*125/5=1600-MHz, Out=f(VCO)/4=400-MHz

Interconnects:

IC1, powered by PLL1, out / 1 = 608-MHz

IC2, powered by PLL4, out / 1 = 400-MHz

IC9, powered by PLL4, out / 1 = 400-MHz (not used)

IC15, powered by PLL3, out / 2 = 304-MHz (for MCO2)

IC20, powered by PLL3, out / 4 = 100-MHz (for MCO2)

Bus clocks:

CPU -> 608-MHz (sysa_ck)

SYS -> 400-MHz (sysb_ck)

PERCK -> HSI

Timers -> 100-MHz

HPRE -> 100-MHz

PBx (x, 1, 2, 4, 5) -> 100-MHz

Based on my previous NOP loop test, it looks like the CPU is indeed running close to 600 MHz. However, something seems wrong with the memory bus.

I observed better results when I completely disabled the MPU, so I will investigate further in that direction. In parallel, I’ll also continue looking into your suggestions.

Thank you again for your help.

Kind regards,

Edo

LCE · ‎2025-08-20

CYCCNT:

I prefer this not only because of the sub-second accuracy, but also because it is not an STM32 peripheral, thus not depending on bus clock or peripheral settings.

TDK · ‎2025-08-20

Note also that the M7 core is faster than the M55 on a per-MHz level.

Might also be running into bus contention issues with code and data being transferring over the same bus. Some sources on the internet say a 2-3x speed difference, which is what you're seeing.

> So, the H7 @ 480 MHz is effectively 3.35× faster than the N6 @ 588 MHz.

I calculate a 2.53x difference, not 3.35x.

The N6 has a lot of NPU-specific computational power which is not being exercised here at all. That's what it was built for, not single-thread execution.

If you feel a post has answered your question, please click "Accept as Solution".

AScha.3 · ‎2025-08-20

Arm giving not much differring numbers :)

M7 just 20% faster than M55 . (on Coremark)

https://documentation-service.arm.com/static/6267de1c7e121f01fd22d677?token=

https://documentation-service.arm.com/static/61bb37962183326f2176f8cc

If you feel a post has answered your question, please click "Accept as Solution".

Franzi.Edo · ‎2025-08-20

Hi TDK,

You are right: the M7 is faster than the M55 per MHz.
In my test, the N6 shows a speed advantage of a factor 1.22, which should place it roughly at the same level as the H7.

Looking at my NOP test:
H7: 10.4 ms
N6: 21.59 ms
Execution time ratio = 2.75
Measured frequency on MCO2:
H7 = 480 MHz
N6 = 588 MHz
Clock ratio = 1.22

If I multiply the execution time penalty (2.75) by the clock advantage (1.22), I get about 3.35.
I know this isn’t rigorous, since clock speed is already factored into the execution time ratio, but I’m just treating it as an order of magnitude estimate.
In other tests, I even see an execution time ratio of >4.
But I’ve identified the problem — I’ll explain in a moment.
Thanks.

Franzi.Edo · ‎2025-08-20

Hi AScha.3

You are right: the M7 is faster than the M55 per MHz.
In my test, the N6 shows a speed advantage of a factor 1.22, which should place it roughly at the same level as the H7. So, the 588-MHz N6 CoreMark is 4.40 x 588 = 2'587.2 and the 480-MHz H7 CoreMark is 5.29 x 480 = 2'539. So, both machines should give very similar results on my tests.

In some tests, I even see an execution time ratio of >4.
But I’ve identified the problem — I’ll explain in a moment.
Thanks.

Franzi.Edo · ‎2025-08-20

Dear all,
I have identified the main cause of my problem.
The poor N6 performance compared to the H7 was due to the MPU configuration.

I had initially specified the RAM as Sharable, which — for reasons that are not entirely clear — degraded memory performance.
After changing the RAM to Non-sharable, the results are now much more consistent and explainable.
With this adjustment, the N6 @ 588 MHz and the H7 @ 480 MHz deliver very close performance.
I consider this issue mostly resolved, although a few open points remain:
Why does Sharable RAM perform so poorly?

In my NOP test, the H7 is still about 30% faster than the N6, despite the N6 running at a higher clock speed. My assumption is that the H7’s memory scheme (code executed in FLASH with the ART accelerator) is more efficient than the N6’s cache-based approach.

Anyway, I’d like to thank you all for your great support.
Best regards,
Edo

RomainR. · ‎2025-08-20

Hi @Franzi.Edo

Thank you for sharing your tests and results. It was almost certainly a matter of MPU configuration.

A memory area on the ST NOC AXI (SRAM1 and 2 of the N6) configured with shareable cacheable attributes will be translated by the CM55 processor as Normal Shareable Non-cacheable. This could penalize processor access and explains the degraded performances.

Here is a note on this subject in the Arm Cortex-M55 Processor Technical Reference Manual.
Section Memory system/Manager-AXI interface then Memory attribute conversion on M-AXI:

https://developer.arm.com/documentation/101051/0101/Memory-system/Manager-AXI-interface/Memory-attribute-conversion-on-M-AXI

It is also possible that, as on Cortex-M7, a shareable and cacheable area may not have the data cache enabled, only the instruction cache is used:

https://www.youtube.com/watch?v=6IUfxSAFhlw&list=PLnMKNibPkDnEQXu4S6QUUHuSKj81MeqCz&ab_channel=STMicroelectronics

Best regards,

Romain,

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

Pavel A. · ‎2025-08-21

> A memory area on the ST NOC AXI (SRAM1 and 2 of the N6) configured with shareable cacheable attributes will be translated by the CM55 processor as Normal Shareable Non-cacheable.

@RomainR. This is even if this area is defined via the MPU as cacheable? Or this is when this area is in "background region" (without MPU)?