cancel
Showing results for 
Search instead for 
Did you mean: 

Very bad performances on the stm32N657

Franzi.Edo
Senior

Dear all,
I am facing very poor performance with the STM32N657.
I have some benchmarks that manipulate arrays of data in different ways.

I ran these benchmarks on
Nucleo_H753 @ 480 MHz, with caches ON
Nucleo_N657 @ 600 MHz, with caches ON

For the STM32N657 I measured the CPU and AXI clocks via MCO2:
fCPU = 600 MHz
fAXI = 400 MHz

The compiler used is gcc-15.2.0

Bench for the Nucleo_H753
-------------------------

uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
          the internal memory. Then, compute the
          X-Y projections and the histogram.
          Fill the array                               t =     17 [us]
          X projection                                 t =     41 [us]
          Y projection                                 t =     18 [us]
          Histogram                                    t =     30 [us]

Bench 01: Fill a small 2D array (200 x 200) elements
          in the internal memory. Then, compute the
          X-Y projections and the histogram.
          Fill the array                               t =    171 [us]
          X projection                                 t =    672 [us]
          Y projection                                 t =    288 [us]
          Histogram                                    t =    451 [us]

Bench 02: Fill a small 1D array (1000) elements in
          the internal memory with a random pattern.
          Then, compute the min / max values.
          Number of tests                              n =   1000 [-]
          Min / Max                                    t =   1110 [us]

Bench 03: Fill a big 1D array (50000) elements in
          the internal memory with a random pattern.
          Then, compute the min / max values.
          Number of tests                              n =    100 [-]
          Min / Max                                    t =    107 [us]

Bench 04: Compute the integer atan2 using the CORDIC
          algorithm
          Number of tests                              n =   1000 [-]
          1000 x atan2(y, x)                           t =   1088 [us]


Bench for the Nucleo_N657
-------------------------

uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
          the internal memory. Then, compute the
          X-Y projections and the histogram.
          Fill the array                               t =     29 [us]
          X projection                                 t =    173 [us]
          Y projection                                 t =    167 [us]
          Histogram                                    t =    330 [us]

Bench 01: Fill a small 2D array (200 x 200) elements
          in the internal memory. Then, compute the
          X-Y projections and the histogram.
          Fill the array                               t =    400 [us]
          X projection                                 t =   2766 [us]
          Y projection                                 t =   2640 [us]
          Histogram                                    t =   5369 [us]

Bench 02: Fill a small 1D array (1000) elements in
          the internal memory with a random pattern.
          Then, compute the min / max values.
          Number of tests                              n =   1000 [-]
          Min / Max                                    t =   3127 [us]

Bench 03: Fill a big 1D array (50000) elements in
          the internal memory with a random pattern.
          Then, compute the min / max values.
          Number of tests                              n =    100 [-]
          Min / Max                                    t =    323 [us]

Bench 04: Compute the integer atan2 using the CORDIC
          algorithm
          Number of tests                              n =   1000 [-]
          1000 x atan2(y, x)                           t =   2726 [us]


As shown, the N6 performance is not acceptable!
The clock measurements on MCO2 reflect the PLL values, but maybe some other elements
(not clearly identified) are influencing the code execution.

Any clue to get more decent results for the N6?
Kind regards,
Edo

 

30 REPLIES 30

@Franzi.Edo wrote:

I had initially specified the RAM as Sharable, which — for reasons that are not entirely clear — degraded memory performance.


Sharable means not cacheable region even if you set it cacheable.

To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.