2025-08-18 10:19 AM
Dear all,
I am facing very poor performance with the STM32N657.
I have some benchmarks that manipulate arrays of data in different ways.
I ran these benchmarks on
Nucleo_H753 @ 480 MHz, with caches ON
Nucleo_N657 @ 600 MHz, with caches ON
For the STM32N657 I measured the CPU and AXI clocks via MCO2:
fCPU = 600 MHz
fAXI = 400 MHz
The compiler used is gcc-15.2.0
Bench for the Nucleo_H753
-------------------------
uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 17 [us]
X projection t = 41 [us]
Y projection t = 18 [us]
Histogram t = 30 [us]
Bench 01: Fill a small 2D array (200 x 200) elements
in the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 171 [us]
X projection t = 672 [us]
Y projection t = 288 [us]
Histogram t = 451 [us]
Bench 02: Fill a small 1D array (1000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 1000 [-]
Min / Max t = 1110 [us]
Bench 03: Fill a big 1D array (50000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 100 [-]
Min / Max t = 107 [us]
Bench 04: Compute the integer atan2 using the CORDIC
algorithm
Number of tests n = 1000 [-]
1000 x atan2(y, x) t = 1088 [us]
Bench for the Nucleo_N657
-------------------------
uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 29 [us]
X projection t = 173 [us]
Y projection t = 167 [us]
Histogram t = 330 [us]
Bench 01: Fill a small 2D array (200 x 200) elements
in the internal memory. Then, compute the
X-Y projections and the histogram.
Fill the array t = 400 [us]
X projection t = 2766 [us]
Y projection t = 2640 [us]
Histogram t = 5369 [us]
Bench 02: Fill a small 1D array (1000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 1000 [-]
Min / Max t = 3127 [us]
Bench 03: Fill a big 1D array (50000) elements in
the internal memory with a random pattern.
Then, compute the min / max values.
Number of tests n = 100 [-]
Min / Max t = 323 [us]
Bench 04: Compute the integer atan2 using the CORDIC
algorithm
Number of tests n = 1000 [-]
1000 x atan2(y, x) t = 2726 [us]
As shown, the N6 performance is not acceptable!
The clock measurements on MCO2 reflect the PLL values, but maybe some other elements
(not clearly identified) are influencing the code execution.
Any clue to get more decent results for the N6?
Kind regards,
Edo
Solved! Go to Solution.
2025-08-21 8:33 AM
@Franzi.Edo wrote:
I had initially specified the RAM as Sharable, which — for reasons that are not entirely clear — degraded memory performance.
Sharable means not cacheable region even if you set it cacheable.