STM32H723: How to optimize summation of an array?

yonatan · ‎2025-05-04

Hi folks.

I am trying to optimize (by time) the following piece of code.

	for (uint32_t i = 6 + adc_data_index; i < 35 + adc_data_index; i++)
	{
		raw[0] += (adc_data[i]);
		raw[1] += (adc_data[i + 35]);
		raw[2] += (adc_data[i + 70]);
		raw[3] += (adc_data[i + 115]);
	}

For now it takes 3.5 micro-second at 250 MHz clock

I want to make it less by at least factor of 2.

Do you have any ideas?

What I tried:

1. Change the optimization to be -Ofast

2. Using pointer

3. Also, thought about FMAC and DFSDM

How can I achieve that?

Thanks

Yonatan

TDK · ‎2025-05-04

> 3.5 micro-second at 250 MHz clock

So 875 cycles and you're doing 116 (4*29) summations. Probably some improvement to be made.

Storing raw and adc_data in DTCMRAM will help.

Enabling data cache if not already enabled will help a lot.

Executing the function out of ITCMRAM for the function will also help.

Looking at the disassembly will be the most useful here to understand what the compiler is doing and seeing what is unnecessary. That can help guide you to the right solution. I imagine using a pointer for access and comparing the loop variable to a pointer constant rather than 35 + X will help a bit.

If you feel a post has answered your question, please click "Accept as Solution".

View solution in original post

mbarg.1 · ‎2025-05-04

Suggestion: avoid computations in loop, like replacing i with an arrray before running the loop, plus run the loop from a to zero to optimize end of loop check.

yonatan · ‎2025-05-04

Thanks @mbarg.1

WDYM in "replacing i with an array"?

TDK · ‎2025-05-04

> 3.5 micro-second at 250 MHz clock

So 875 cycles and you're doing 116 (4*29) summations. Probably some improvement to be made.

Storing raw and adc_data in DTCMRAM will help.

Enabling data cache if not already enabled will help a lot.

Executing the function out of ITCMRAM for the function will also help.

Looking at the disassembly will be the most useful here to understand what the compiler is doing and seeing what is unnecessary. That can help guide you to the right solution. I imagine using a pointer for access and comparing the loop variable to a pointer constant rather than 35 + X will help a bit.

If you feel a post has answered your question, please click "Accept as Solution".

yonatan · ‎2025-05-04

Thanks.

1. Does enabling the I/D cache have any downsides?

2. Should I protect the adc_data buffer with the MPU? Is this mandatory?

3. Does placing the adc_data in the DTCM eliminate the need to use the MPU (Is DTCM always protected from cache issues?)

mbarg.1 · ‎2025-05-05

Cache will speed execution BUT you must manage it - up to you to decide if extra load and complexity can be a pros or a cons.

Protecting data is application dependent - ADC typically are primitives, aka uint16_t that cannot be invalid but you could need to have the whole set valid before processing - again, up to you to decide.

LCE · ‎2025-05-05

A mix of all of the above might help - although I'm afraid of caches... :D

But you probably use the ADC with DMA, so the ADC buffer cannot be placed there.

So I would try:

uint16_t *pu16Adat0 = &adc_data[adc_data_index + 6 + 0];    // pointer type must be same as adc_data!
uint16_t *pu16Adat1 = &adc_data[adc_data_index + 6 + 35];
uint16_t *pu16Adat2 = &adc_data[adc_data_index + 6 + 70];
uint16_t *pu16Adat3 = &adc_data[adc_data_index + 6 + 105];   // or is it really "115" ?

for( uint32_t i = 0; i < 29; i++ )
{
  raw[0] += pu16Adat0[i];
  raw[1] += pu16Adat1[i];
  raw[2] += pu16Adat2[i];
  raw[3] += pu16Adat3[i];
}

Interesting to see if using pointers and incrementing these might speed things up, like

raw[0] += *(puAdat0++);

AScha.3 · ‎2025-05-05

Hi,

I-cache can always be ON , no problems. Switch it on and see speed in -O2 or -Ofast , try.

But D-cache needs (your) thinking, what and how to manage.

If you can have the data in DTCM, same speed as with D-cache; but similar problems: how you get the data there?

(DMA is the problem, if you get/move the data with the core anyway, no problem then.)

If you feel a post has answered your question, please click "Accept as Solution".

yonatan · ‎2025-05-05

Thanks!

It saved me ~250 nS

I am counting every clock.

yonatan · ‎2025-05-05

As you guessed I use the DMA so I cant put the data in the DTCM.