2025-08-18 10:19 AM
Dear all,
I am facing very poor performance with the STM32N657.
I have some benchmarks that manipulate arrays of data in different ways.
I ran these benchmarks on
Nucleo_H753 @ 480 MHz, with caches ON
Nucleo_N657 @ 600 MHz, with caches ON
For the STM32N657 I measured the CPU and AXI clocks via MCO2:
fCPU = 600 MHz
fAXI = 400 MHz
The compiler used is gcc-15.2.0
Bench for the Nucleo_H753
-------------------------
uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 17 [us]
X projection t = 41 [us]
Y projection t = 18 [us]
Histogram t = 30 [us]
Bench 01: Fill a small 2D array (200 x 200) elements
in the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 171 [us]
X projection t = 672 [us]
Y projection t = 288 [us]
Histogram t = 451 [us]
Bench 02: Fill a small 1D array (1000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 1000 [-]
Min / Max t = 1110 [us]
Bench 03: Fill a big 1D array (50000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 100 [-]
Min / Max t = 107 [us]
Bench 04: Compute the integer atan2 using the CORDIC
algorithm
Number of tests n = 1000 [-]
1000 x atan2(y, x) t = 1088 [us]
Bench for the Nucleo_N657
-------------------------
uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 29 [us]
X projection t = 173 [us]
Y projection t = 167 [us]
Histogram t = 330 [us]
Bench 01: Fill a small 2D array (200 x 200) elements
in the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 400 [us]
X projection t = 2766 [us]
Y projection t = 2640 [us]
Histogram t = 5369 [us]
Bench 02: Fill a small 1D array (1000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 1000 [-]
Min / Max t = 3127 [us]
Bench 03: Fill a big 1D array (50000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 100 [-]
Min / Max t = 323 [us]
Bench 04: Compute the integer atan2 using the CORDIC
algorithm
Number of tests n = 1000 [-]
1000 x atan2(y, x) t = 2726 [us]
As shown, the N6 performance is not acceptable!
The clock measurements on MCO2 reflect the PLL values, but maybe some other elements
(not clearly identified) are influencing the code execution.
Any clue to get more decent results for the N6?
Kind regards,
Edo
Solved! Go to Solution.
2025-08-20 9:11 AM
Dear all,
I have identified the main cause of my problem.
The poor N6 performance compared to the H7 was due to the MPU configuration.
I had initially specified the RAM as Sharable, which — for reasons that are not entirely clear — degraded memory performance.
After changing the RAM to Non-sharable, the results are now much more consistent and explainable.
With this adjustment, the N6 @ 588 MHz and the H7 @ 480 MHz deliver very close performance.
I consider this issue mostly resolved, although a few open points remain:
Why does Sharable RAM perform so poorly?
In my NOP test, the H7 is still about 30% faster than the N6, despite the N6 running at a higher clock speed. My assumption is that the H7’s memory scheme (code executed in FLASH with the ART accelerator) is more efficient than the N6’s cache-based approach.
Anyway, I’d like to thank you all for your great support.
Best regards,
Edo
2025-08-18 11:03 AM
Hi,
what optimizer setting you had ?
Try -O2 , compile, check then again.
2025-08-18 11:21 AM
Hi AScha.3,
Thank you for the suggestion.
Both target use the same gcc setting and the optimisation is -Os. Here is for the N6 the results for -O2 and for -O3. Even with these optimisation wa are very far from the -Os of the H753. More probably something is not good with the hardware, but I can measure only the PLL clocks!
Here the new results:
-O2
---
uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 24 [us]
X projection t = 169 [us]
Y projection t = 172 [us]
Histogram t = 331 [us]
Bench 01: Fill a small 2D array (200 x 200) elements
in the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 334 [us]
X projection t = 2752 [us]
Y projection t = 2632 [us]
Histogram t = 5349 [us]
Bench 02: Fill a small 1D array (1000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 1000 [-]
Min / Max t = 2963 [us]
Bench 03: Fill a big 1D array (50000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 100 [-]
Min / Max t = 307 [us]
Bench 04: Compute the integer atan2 using the CORDIC
algorithm
Number of tests n = 1000 [-]
1000 x atan2(y, x) t = 2747 [us]
-O3
---
uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 24 [us]
X projection t = 169 [us]
Y projection t = 171 [us]
Histogram t = 330 [us]
Bench 01: Fill a small 2D array (200 x 200) elements
in the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 334 [us]
X projection t = 2752 [us]
Y projection t = 2633 [us]
Histogram t = 5348 [us]
Bench 02: Fill a small 1D array (1000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 1000 [-]
Min / Max t = 2962 [us]
Bench 03: Fill a big 1D array (50000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 100 [-]
Min / Max t = 292 [us]
Bench 04: Compute the integer atan2 using the CORDIC
algorithm
Number of tests n = 1000 [-]
1000 x atan2(y, x) t = 2862 [us]
2025-08-18 11:57 AM
Ok,
was just because you didnt state the optimizer setting.
(I dont have the N6 , with M55 core , so just guessing...)
For your tests : where is the program ?
Did you try , to load it to RAM , or better to TCM RAM ? and supply...VOS high ?
Did you check with scope on MCO , clock setting is correct ?
2025-08-18 12:06 PM
Hi AScha.3,
For the H753 the code is running in Flash. For the N657 the code is running in the internal RAM (AXISRAM1).
Well, I didn't run in the TCM because the gap between the 2 machines cannot be explained. Maybe I have the verify again the VOS. The clocks observed with the scope on MCO2 are correct (600-MHz for the cpu and 400-MHz for the AXI). As the gap between the 2 cores is so important, I really suspect an hardware issue.
Thank you again
2025-08-18 12:42 PM - edited 2025-08-18 12:48 PM
So just curious : how did you measure the time (in us ) ?
Just because...to exclude some basic errors, that can happen...
because "histogram" 30 -> 330 us --> M55 at 600M is 10 times slower than M7 at 480M ? cannot be real.
Maybe ...float/double used and not the hardware , but soft ...? just guessing, sorry.
Because M55 can float , but not double in hardware; H7 has double float in hardware.
So from your tests : you didnt make clear : is INT , float and double used, and settings used.
2025-08-18 1:37 PM
Bench 02: Fill a small 1D array (1000) elements in the internal memory with a random pattern. Then, compute the min / max values. Number of tests n = 1000 [-] Min / Max t = 1110 [us] Bench 03: Fill a big 1D array (50000) elements in the internal memory with a random pattern. Then, compute the min / max values. Number of tests n = 100 [-] Min / Max t = 107 [us]
What exactly is being reported on the "Min / Max" line? The two readings aren't consistent with each other as far as I can see. If it's time per test, the smaller array should be faster. If its total time for all tests, the math doesn't add up. 1000*1000 values take 1110 us but 100*50000 values only take 107 us? Nah.
Showing actual code being used here might help and avoid the 20-questions back and forth and get an answer faster.
2025-08-18 1:47 PM
Hi AScha.3,
The time in us is coming from a 32-bit timer clocked at 1-MHz.
All the tests use intx_t types (x: 8, 32, 64). The only bench that uses double float is the bench_04 (the last one). I verify in the .lst and it uses the fpu and not the soft one. According to the ARM documentation, the fpu ... "FPU with support for half precision (fp16), single precision(fp32) and double precision (fp64) floating-point operations"
I verify VOS and the bit VOS is set (PWR->VOSCR |= PWR_VOSCR_VOS).
I will try to investigate with a more simple test.
Thank you
Kind regards
2025-08-18 2:00 PM
Hi,
yes ...arm doku, what could be in the cpu ....but you have a STM N6xx , so ds N657 says :
So : float , but no double in hardware. Would cost 30 cents more... :)
2025-08-18 2:11 PM
Hi AScha.3,
From the ST data sheet DS14791 Rev 5