STM32H723: How to optimize summation of an array?

yonatan · ‎2025-05-04

Hi folks.

I am trying to optimize (by time) the following piece of code.

	for (uint32_t i = 6 + adc_data_index; i < 35 + adc_data_index; i++)
	{
		raw[0] += (adc_data[i]);
		raw[1] += (adc_data[i + 35]);
		raw[2] += (adc_data[i + 70]);
		raw[3] += (adc_data[i + 115]);
	}

For now it takes 3.5 micro-second at 250 MHz clock

I want to make it less by at least factor of 2.

Do you have any ideas?

What I tried:

1. Change the optimization to be -Ofast

2. Using pointer

3. Also, thought about FMAC and DFSDM

How can I achieve that?

Thanks

Yonatan

waclawek.jan · ‎2025-05-06

Was explicit loop unrolling already mentioned?

uint32_t* p = &adc_data[6 + adc_data_index];
raw[0] = p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + 
         p[8] + p[9] + p[10] + p[11] + p[12] + p[13] + p[14] + p[15] + 
         p[16] + p[17] + p[18] + p[19] + p[20] + p[21] + p[22] + p[23] + 
         p[24] + p[25] + p[26] + p[27] + p[28] + p[29];
p += 35;
raw[1] = p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + 
         p[8] + p[9] + p[10] + p[11] + p[12] + p[13] + p[14] + p[15] + 
         p[16] + p[17] + p[18] + p[19] + p[20] + p[21] + p[22] + p[23] + 
         p[24] + p[25] + p[26] + p[27] + p[28] + p[29];
p += 35;
raw[2] = p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + 
         p[8] + p[9] + p[10] + p[11] + p[12] + p[13] + p[14] + p[15] + 
         p[16] + p[17] + p[18] + p[19] + p[20] + p[21] + p[22] + p[23] + 
         p[24] + p[25] + p[26] + p[27] + p[28] + p[29];
p += 35;
raw[3] = p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + 
         p[8] + p[9] + p[10] + p[11] + p[12] + p[13] + p[14] + p[15] + 
         p[16] + p[17] + p[18] + p[19] + p[20] + p[21] + p[22] + p[23] + 
         p[24] + p[25] + p[26] + p[27] + p[28] + p[29];

Observe disasm of the resulting code, you should see a repeating pattern of ld/add.

JW

PS. You can MDMA into DTCM. The idea is to gather data from peripherals into SRAM using the "normal" DMA, and then the DMA's transfer complete would trigger MDMA which would in turn move all that data to DTCM for the processor to process further.

yonatan · ‎2025-05-06

Hi @waclawek.jan

You are right but the problem is that '6' and '35' is not known at compilation time.

They are initialized at run time to their values.

Regarding the MDMA...

In general it is possible but the DMA action is in circular buffer so I am afraid of missing some signals (interrupts etc.)

waclawek.jan · ‎2025-05-06

> the problem is that '6' and '35' is not known at compilation time

That makes things more complicated but not hopeless.

If there's a limited number of '6' and '35' variants, you can have a separate function for each combination (i.e. you compile many functions), and then in runtime chosing whichever is appropriate.

If there are more variants than manageable reasonably, you can use "calculated jumps" amidst the series of additions. Switch/case may accomplish this, but it needs to be checked whether compiler actually compiles it reasonably.

nr = var35 - var6;
p = &adc_data[var6];
sum = 0;
switch(nr) {
  case 29: sum += *p++; // note the intentional fallthrough 
  case 28: sum += *p++;
  case 27: sum += *p++;
  [etc.]
}
p += whatever_remains;

One may here also want to resort to asm, inline or not, if C does not provide enough control over the resulting code - I'm not sure if any compiler recognizes the pattern and actually calculates the jump, most of them should at least use the table-jump instruction (TBB/TBH), but some may be stubborn and generate a branch of jumps, which is useless here.

A partial unroll, together with calculated jump can be used as a slightly worse simplified version, too. This combination is know as Duff's device.

Another option is to generate the code into RAM in runtime, or use self-modifying code (which may be as simple as inserting at the appropriate place in a sequence of additions a jump out of the sequence).

>> MDMA
> In general it is possible but the DMA action is in circular buffer so I am afraid of missing some signals (interrupts etc.)

I don't see why would anything got missed here, but I also don't know your whole application.

JW

LCE · ‎2025-05-06

MDMA:

if you're afraid of losing data, you could use also DMA's transfer half-complete interrupt, then trigger MDMA for first half of the buffer.

And / or the DMA's double buffer mode (DBM).