cancel
Showing results for 
Search instead for 
Did you mean: 

Need help understanding performance issue on STM32H743

fbar
Senior

I'm struggling to understand why I'm seeing a huge performance hit by slightly changing code in a loop. I narrowed down the problem to a manageable code size, and I'm hoping someone here can suggest how to investigate and resolve my issue.

I'm using a NUCLEO-H743ZI2 board, clock set to 480MHz, and my toolchain is STM32CubeIde 1.0.2 with he latest HAL libraries for the H743. Everything else running on that board works as I would expect

I am using the CMSIS arm_correlate_q15 library to cross correlate two signals I capture with the ADC. The CMSIS function takes 2 Q15 arrays as input, outputs a third array, twice as long as the biggest input with correlation values. Even if the library uses q63_t math internally, the output array is scaled down to q15_t. I'm using pretty big arrays (12288 elements), and the return array has too many values that are equal to the maximum possible for the q15 values, making it hard to pinpoint the exact correlation point. Also, due to the big array sizes, having an additional output array twice as big as the input uses too much memory (I plan to store the arrays in DTCM RAM for perf reasons, (and, yes, my ADC DMA goes into RAM_D1, then gets moved into DTCM RAM)

So I modified the CMSIS function to return the max value location instead of an array, by comparing each intermediate result with a "max" variable, and storing new values each time I find a new max. I expected, if anything, for that to be the same or faster than first calling arm_correlate followed by arm_max_q15 to find the max value

Instead I get a massive perf hit, roughly adding an extra 1,500 processor cycles per loop, for just a simple if statement. Clearly there is something I don't understand causing problems

So I created a test loop using just a small part of the arm_correlate_q15 code, and I can see that just by having a multiplication loop, without storing the q15 result in the destination array, I get a baseline of execution time. Storing the result as q15_t in the array and then comparing the just stored value to a "max" variable, I add ~10 clock cycles to each loop iteration, which makes perfect sense to me. If I instead compare the q63_t result to a q63_t max value (without storing anything), I get a ~1550 clock cycles hit per loo. Now , I would expect a 64bit if statement to take a bit longer than a 16 or 32 bit, but not that much of a hit

I'm enclosing the code in question, with the 3 options commented out.

int looptest( q15_t * pSrcA,
  uint32_t srcALen,
  q15_t * pSrcB,
  uint32_t srcBLen, q15_t * pDst)
{
 
	q15_t *pIn1;                                   /* inputA pointer               */
	q15_t *pIn2;                                   /* inputB pointer               */
	q15_t *pOut = pDst;                            /* output pointer               */
	q63_t sum, acc0, acc1, acc2, acc3;             /* Accumulators                  */
	q15_t *px;                                     /* Intermediate inputA pointer  */
	q15_t *py;                                     /* Intermediate inputB pointer  */
	q15_t *pSrc1;                                  /* Intermediate pointers        */
	q31_t x0, x1, x2, x3, c0;                      /* temporary variables for holding input and coefficient values */
	uint32_t j, k = 0U, count, blkCnt, blockSize2, blockSize3;  /* loop counter                 */
	q63_t max = 0;  			/* value of max value */
	q15_t *maxLoc;			 /* location of max value */
 
	/* Initialization of inputA pointer */
	pIn1 = (pSrcB);
 
	/* Initialization of inputB pointer */
	pIn2 = (pSrcA);
 
	/* srcBLen is always considered as shorter or equal to srcALen */
	j = srcBLen;
	srcBLen = srcALen;
	srcALen = j;
 
	/*  set the destination pointer to point to the last output sample */
	pOut = pDst + srcALen -1U;   
 
	blockSize3 = 4095;
	count = 4095;
 
  pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
  px = pSrc1;
 
  /* Working pointer of inputB */
  py = pIn2;
 
  while (blockSize3 > 0U)
  {
    /* Accumulator is made zero for every iteration */
    sum = 0;
 
    /* Apply loop unrolling and compute 4 MACs simultaneously. */
    k = count >> 2U;
 
    /* First part of the processing with loop unrolling.  Compute 4 MACs at a time.
     ** a second loop below computes MACs for the remaining 1 to 3 samples. */
    while (k > 0U)
    {
      /* Perform the multiply-accumulates */
      /* sum += x[srcALen - srcBLen + 4] * y[3] , sum += x[srcALen - srcBLen + 3] * y[2] */
      sum = __SMLALD(*__SIMD32(px)++, *__SIMD32(py)++, sum);
      /* sum += x[srcALen - srcBLen + 2] * y[1] , sum += x[srcALen - srcBLen + 1] * y[0] */
      sum = __SMLALD(*__SIMD32(px)++, *__SIMD32(py)++, sum);
 
      /* Decrement the loop counter */
      k--;
    }
 
    /* If the count is not a multiple of 4, compute any remaining MACs here.
     ** No loop unrolling is used. */
    k = count % 0x4U;
 
    while (k > 0U)
    {
      /* Perform the multiply-accumulates */
      sum = __SMLALD(*px++, *py++, sum);
 
      /* Decrement the loop counter */
      k--;
    }
 
    // nothing in this section, 28.928.560 clock cycles
 
    // using q15_t 28,973,622 clock cycles, ~11 extra cycles per loop
/*	 *pOut = (q15_t) (__SSAT(sum >> 15, 16));
 
		if ( *pOut > max)
		{
			max = *pOut;
			maxLoc = pOut;
		}*/
 
    // using sum directly 35,269,175 clock cycles, ~1540 extra cycles per loop
/*    if (sum > max)
    {
    	max = sum;
    	maxLoc = pOut;
    }*/
 
    /* Destination pointer is updated according to the address modifier, inc */
    pOut -= 1;
 
    /* Update the inputA and inputB pointers for next MAC calculation */
    px = ++pSrc1;
    py = pIn2;
 
    /* Decrement the MAC count */
    count--;
 
    /* Decrement the loop counter */
    blockSize3--;
  }
  return (maxLoc - pDst);
}

I call the function with the following:

DWT->CYCCNT = 0;
int cycleCount = DWT->CYCCNT;
value2 = looptest(aADC1_CH3, 4096, aADC2_CH5, 12288, dout);
value4 = DWT->CYCCNT - cycleCount;
printf("Clock cycles %d\r\n", value4);

With the if statements commented out (as per code above) I get a baseline of ~28.928M clock cycles. Enabling the below, I get ~29.973M clock cycles, a roughly 11 clocks per loop

*pOut = (q15_t) (__SSAT(sum >> 15, 16));
 
if ( *pOut > max)
{
			max = *pOut;
			maxLoc = pOut;
		}

Enabling instead this part, I get an execution time of ~35,269M cycles, a hit of ~1540 clock cycles compared to baseline

if (sum > max)
    {
    	max = sum;
    	maxLoc = pOut;
    }

In all cases, the HAL Tick is disabled to ensure nothing else happens in the background. There are no other interrupts enabled. The cycles above were measured using -ODebug optimizations, but the relative differences happen even when compiled with -OFast (which is how I plan to compile the final code).

I'm at a loss of how to further understand my issue and figure out a way to improve performance... any suggestion is appreciated, including pointing out how much of in idiot I am for having missed the obvious :grinning_face:

1 ACCEPTED SOLUTION

Accepted Solutions
fbar
Senior

As a follow up to my issue:

I could find no way to further optimize the correlation function when an if statement is included in the loop. The compiler optimizes a lot of variables out, and as soon as I try to compare values and return something, multiple variables are reintroduced, adding thousands of cycles per iteration. The perf hit of trying to determine the max value on the fly is simply too high for my case.

So I modified the arm_correlate_q15() function to return an array of q63_t values instead, and stored that array in RAM_D1 instead, of which I have more than plenty. it turns out that the performance penalty for having that q63_t array in RAM_D1 instead of DTCMRAM is less than 2 clock cycles per iteration, thanks no doubt to the H7 cache and optimized architecture.

So, in my case, it's vastly preferable to use extra RAM_D1 to store a temporary q63 array instead of trying to perform on-the-fly calculations. All the other arrays are still stored in DTCMRAM.

I have to say that I'm surprised by how small an effect DTCM RAM has in my particular case.

Thanks to @Community member​ and @embeddedt​  for the pointers that, in the end, helped solve the issue

View solution in original post

5 REPLIES 5

The processor does not execute C code, so you should start with comparing the disasm for the various cases.

IMO the compiler simply hit the point where due to register pressure it started to swap variables in/out of memory; plus maybe the stack is not in the best place.

I don't use the 'H7.

JW

embeddedt
Associate II

As @Community member​ said, the issue is likely in the generated assembly. If it does turn out to be variables being pushed to memory, try putting your stack in DTCM (if it's not already there), which should be much faster.

First of all, thanks for taking the time to answer. And, yes, I'm very familiar with how C gets turned into assembly (having written more hand tuned Z80 code than I care to remember when toolchains were simply not optimized enough). It's just that I'm not familiar at all with ARM assembly, and I was hoping for a suggestion that didn't require digging into it, since I'm basically useless at that point.

I also tried compiling my 3 scenarios with -Onone and the performance (while super-slow in general) is predictable: with no if statement it takes 128.16M cycles, with q15 comparison it takes 128.27M and with the q63_t comparison it takes 128.12M cycles, even faster since there's no __SSAT operation. And the assembly is much easier to read. But too slow to use for final compilation

I dumped the relevant assembly code for all 3 cases, with -OFast, and here they are.

I';m really not asking anyone to pore over the asm and find the problem. More like suggestions on how to try and work around the problem. It seems I have hit an optimization issue, and I have no idea how to work around it. Hand crafting assembly is clearly not an option for me, so I'm at a dead end

everything commented out
Clock cycles 24189027       50.393 msec
          looptest:
080063a4:   subs    r3, r3, r1
080063a6:   movw    r12, #4095      ; 0xfff
080063aa:   adds    r3, #1
1283      {
080063ac:   stmdb   sp!, {r4, r5, r6, r7, r8, r9, lr}
080063b0:   ldr     r7, [sp, #28]
1314        pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
080063b2:   add.w   r2, r2, r3, lsl #1
1330          while (k > 0U)
080063b6:   movs.w  r3, r12, lsr #2
080063ba:   beq.n   0x800643a <looptest+150>
080063bc:   mov.w   r8, r3, lsl #3
080063c0:   add.w   r1, r2, #8
1323          sum = 0;
080063c4:   movs    r3, #0
080063c6:   add.w   r4, r0, #8
080063ca:   add.w   lr, r1, r8
080063ce:   mov     r9, r3
1334            sum = __SMLALD(*__SIMD32(px)++, *__SIMD32(py)++, sum);
080063d0:   ldr.w   r5, [r1, #-8]
080063d4:   ldr.w   r6, [r4, #-8]
1931        __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
080063d8:   smlald  r3, r9, r5, r6
1336            sum = __SMLALD(*__SIMD32(px)++, *__SIMD32(py)++, sum);
080063dc:   ldr.w   r5, [r1, #-4]
080063e0:   ldr.w   r6, [r4, #-4]
080063e4:   smlald  r3, r9, r5, r6
080063e8:   adds    r1, #8
080063ea:   adds    r4, #8
080063ec:   cmp     lr, r1
080063ee:   bne.n   0x80063d0 <looptest+44>
080063f0:   add.w   r1, r2, r8
080063f4:   add     r8, r0
1346          while (k > 0U)
080063f6:   ands.w  r4, r12, #3
080063fa:   beq.n   0x8006428 <looptest+132>
1349            sum = __SMLALD(*px++, *py++, sum);
080063fc:   ldrsh.w r5, [r1]
08006400:   ldrsh.w r6, [r8]
1931        __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
08006404:   smlald  r3, r9, r5, r6
1346          while (k > 0U)
08006408:   cmp     r4, #1
0800640a:   beq.n   0x8006428 <looptest+132>
1349            sum = __SMLALD(*px++, *py++, sum);
0800640c:   ldrsh.w r5, [r1, #2]
08006410:   ldrsh.w r6, [r8, #2]
1931        __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
08006414:   smlald  r3, r9, r5, r6
1346          while (k > 0U)
08006418:   cmp     r4, #2
0800641a:   beq.n   0x8006428 <looptest+132>
1349            sum = __SMLALD(*px++, *py++, sum);
0800641c:   ldrsh.w r1, [r1, #4]
08006420:   ldrsh.w r4, [r8, #4]
08006424:   smlald  r3, r9, r1, r4
1320        while (blockSize3 > 0U)
08006428:   subs.w  r12, r12, #1
1377          px = ++pSrc1;
0800642c:   add.w   r2, r2, #2
1320        while (blockSize3 > 0U)
08006430:   bne.n   0x80063b6 <looptest+18>
1386        return (maxLoc - pDst);
08006432:   negs    r0, r7
1387      }

Here's the version using the q15_t pointer

Clock cycles 24231942       50.483
	 *pOut = (q15_t) (__SSAT(sum >> 15, 16));
 
		if ( *pOut > max)
		{
			max = *pOut;
			maxLoc = pOut;
		}
          looptest:
080063a4:   stmdb   sp!, {r4, r5, r6, r7, r8, r9, r10, r11, lr}
1309      	pOut = pDst + srcALen -1U;   //pOut = srcALen -1U; // start from end of output buffer, work back  //pDst + ((srcALen + srcBLen) - 2U);
080063a8:   mvn.w   r7, #2147483648 ; 0x80000000
1314        pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
080063ac:   subs    r1, r3, r1
1283      {
080063ae:   sub     sp, #12
1294      	q63_t max = 0;  																/* value of max value */
080063b0:   moveq   r4, #0
1309      	pOut = pDst + srcALen -1U;   //pOut = srcALen -1U; // start from end of output buffer, work back  //pDst + ((srcALen + srcBLen) - 2U);
080063b2:   add     r7, r3
1294      	q63_t max = 0;  																/* value of max value */
080063b4:   movs    r3, #0
1314        pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
080063b6:   movw    r6, #4095       ; 0xfff
080063ba:   strd    r3, r4, [sp]
1314        pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
080063be:   adds    r3, r1, #1
1309      	pOut = pDst + srcALen -1U;   //pOut = srcALen -1U; // start from end of output buffer, work back  //pDst + ((srcALen + srcBLen) - 2U);
080063c0:   ldr     r1, [sp, #48]   ; 0x30
1314        pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
080063c2:   add.w   r2, r2, r3, lsl #1
080063c6:   add.w   r7, r1, r7, lsl #1
1330          while (k > 0U)
080063ca:   movs.w  r10, r6, lsr #2
080063ce:   mov     lr, r7
080063d0:   beq.n   0x8006484 <looptest+224>
080063d2:   mov.w   r8, r10, lsl #3
080063d6:   add.w   r3, r2, #8
1323          sum = 0;
080063da:   mov.w   r10, #0
080063de:   addeq.w r1, r0, #8
080063e2:   add.w   r9, r8, r3
080063e6:   mov     r11, r10
1334            sum = __SMLALD(*__SIMD32(px)++, *__SIMD32(py)++, sum);
080063e8:   ldr.w   r4, [r3, #-8]
080063ec:   ldr.w   r5, [r1, #-8]
1931        __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
080063f0:   smlald  r10, r11, r4, r5
1336            sum = __SMLALD(*__SIMD32(px)++, *__SIMD32(py)++, sum);
080063f4:   ldr.w   r4, [r3, #-4]
080063f8:   ldr.w   r5, [r1, #-4]
080063fc:   smlald  r10, r11, r4, r5
08006400:   adds    r3, #8
08006402:   adds    r1, #8
08006404:   cmp     r9, r3
08006406:   bne.n   0x80063e8 <looptest+68>
08006408:   add.w   r1, r2, r8
0800640c:   add.w   r3, r0, r8
1346          while (k > 0U)
08006410:   ands.w  r4, r6, #3
08006414:   beq.n   0x8006442 <looptest+158>
1349            sum = __SMLALD(*px++, *py++, sum);
08006416:   ldrsh.w r5, [r1]
0800641a:   ldrsh.w r8, [r3]
1931        __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
0800641e:   smlald  r10, r11, r5, r8
1346          while (k > 0U)
08006422:   cmp     r4, #1
08006424:   beq.n   0x8006442 <looptest+158>
1349            sum = __SMLALD(*px++, *py++, sum);
08006426:   ldrsh.w r5, [r1, #2]
0800642a:   ldrsh.w r8, [r3, #2]
0800642e:   smlald  r10, r11, r5, r8
1346          while (k > 0U)
08006432:   cmp     r4, #2
08006434:   beq.n   0x8006442 <looptest+158>
1349            sum = __SMLALD(*px++, *py++, sum);
08006436:   ldrsh.w r1, [r1, #4]
0800643a:   ldrsh.w r3, [r3, #4]
1931        __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
0800643e:   smlald  r10, r11, r1, r3
1358      	 *pOut = (q15_t) (__SSAT(sum >> 15, 16));
08006442:   mov.w   r10, r10, lsr #15
08006446:   orr.w   r10, r10, r11, lsl #17
0800644a:   ssat    r10, #16, r10
0800644e:   sxth.w  r10, r10
1360      		if ( *pOut > max)
08006452:   ldrd    r8, r9, [sp]
08006456:   sxth.w  r4, r10
1358      	 *pOut = (q15_t) (__SSAT(sum >> 15, 16));
0800645a:   strh.w  r10, [r7], #-2
1360      		if ( *pOut > max)
0800645e:   asrs    r5, r4, #31
08006460:   cmp     r8, r4
08006462:   sbcs.w  r3, r9, r5
08006466:   bge.n   0x800646e <looptest+202>
08006468:   mov     r12, lr
0800646a:   strd    r4, r5, [sp]
1320        while (blockSize3 > 0U)
0800646e:   subs    r6, #1
1377          px = ++pSrc1;
08006470:   add.w   r2, r2, #2
1320        while (blockSize3 > 0U)
08006474:   bne.n   0x80063ca <looptest+38>
1386        return (maxLoc - pDst);
08006476:   ldr     r3, [sp, #48]   ; 0x30
08006478:   sub.w   r0, r12, r3
1387      }

(to be continued...)

and lastly the sum comparison

Clock cycles 28931727       60274
    if (sum > max)
    {
    	max = sum;
    	maxLoc = pOut;
    }
looptest:
080063a4:   stmdb   sp!, {r4, r5, r6, r7, r8, r9, r10, r11, lr}
1314        pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
080063a8:   subs    r1, r3, r1
1283      {
080063aa:   sub     sp, #20
1309      	pOut = pDst + srcALen -1U;   //pOut = srcALen -1U; // start from end of output buffer, work back  //pDst + ((srcALen + srcBLen) - 2U);
080063ac:   mvn.w   r8, #2147483648 ; 0x80000000
1314        pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
080063b0:   movw    r12, #4095      ; 0xfff
1294      	q63_t max = 0;  																/* value of max value */
080063b4:   mov.w   r10, #0
080063b8:   mov.w   r11, #0
1309      	pOut = pDst + srcALen -1U;   //pOut = srcALen -1U; // start from end of output buffer, work back  //pDst + ((srcALen + srcBLen) - 2U);
080063bc:   add     r8, r3
080063be:   adds    r3, r1, #1
080063c0:   ldr     r1, [sp, #56]   ; 0x38
1314        pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
080063c2:   mov     r9, r0
080063c4:   add.w   r2, r2, r3, lsl #1
080063c8:   add.w   r8, r1, r8, lsl #1
1330          while (k > 0U)
080063cc:   movs.w  r6, r12, lsr #2
080063d0:   beq.n   0x800647c <looptest+216>
080063d2:   lsls    r6, r6, #3
080063d4:   add.w   r4, r2, #8
080063d8:   add.w   r5, r9, #8
1323          sum = 0;
080063dc:   movs    r0, #0
080063de:   movs    r1, #0
080063e0:   adds    r7, r6, r4
1931        __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
080063e2:   mov     r3, r0
080063e4:   mov     lr, r1
080063e6:   ldr.w   r0, [r5, #-8]
080063ea:   ldr.w   r1, [r4, #-8]
080063ee:   smlald  r3, lr, r1, r0
1336            sum = __SMLALD(*__SIMD32(px)++, *__SIMD32(py)++, sum);
080063f2:   ldr.w   r0, [r4, #-4]
080063f6:   ldr.w   r1, [r5, #-4]
1931        __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
080063fa:   smlald  r3, lr, r0, r1
080063fe:   adds    r4, #8
08006400:   mov     r0, r3
08006402:   adds    r5, #8
08006404:   mov     r1, lr
1330          while (k > 0U)
08006406:   cmp     r4, r7
08006408:   bne.n   0x80063e2 <looptest+62>
0800640a:   adds    r5, r2, r6
0800640c:   add     r6, r9
0800640e:   strd    r0, r1, [sp]
1346          while (k > 0U)
08006412:   ands.w  r7, r12, #3
08006416:   beq.n   0x800644c <looptest+168>
1349            sum = __SMLALD(*px++, *py++, sum);
08006418:   ldrsh.w lr, [r5]
0800641c:   ldr     r4, [sp, #0]
0800641e:   ldrsh.w r0, [r6]
08006422:   ldr     r3, [sp, #4]
08006424:   smlald  r4, r3, lr, r0
1346          while (k > 0U)
08006428:   cmp     r7, #1
0800642a:   beq.n   0x8006448 <looptest+164>
1349            sum = __SMLALD(*px++, *py++, sum);
0800642c:   ldrsh.w r1, [r5, #2]
08006430:   ldrsh.w r0, [r6, #2]
1931        __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
08006434:   smlald  r4, r3, r1, r0
1346          while (k > 0U)
08006438:   cmp     r7, #2
0800643a:   beq.n   0x8006448 <looptest+164>
1349            sum = __SMLALD(*px++, *py++, sum);
0800643c:   ldrsh.w r1, [r5, #4]
08006440:   ldrsh.w r0, [r6, #4]
1931        __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
08006444:   smlald  r4, r3, r1, r0
08006448:   strd    r4, r3, [sp]
1367          if (sum > max)
0800644c:   ldrd    r0, r1, [sp]
08006450:   cmp     r10, r0
08006452:   sbcs.w  r3, r11, r1
08006456:   bge.n   0x8006460 <looptest+188>
08006458:   mov     r10, r0
0800645a:   mov     r11, r1
0800645c:   str.w   r8, [sp, #12]
1320        while (blockSize3 > 0U)
08006460:   subs.w  r12, r12, #1
1374          pOut -= 1;
08006464:   sub.w   r8, r8, #2
1377          px = ++pSrc1;
08006468:   add.w   r2, r2, #2
1320        while (blockSize3 > 0U)
0800646c:   bne.n   0x80063cc <looptest+40>
1386        return (maxLoc - pDst);
0800646e:   ldr     r3, [sp, #12]
08006470:   ldr     r2, [sp, #56]   ; 0x38
08006472:   subs    r0, r3, r2
1387      }

fbar
Senior

As a follow up to my issue:

I could find no way to further optimize the correlation function when an if statement is included in the loop. The compiler optimizes a lot of variables out, and as soon as I try to compare values and return something, multiple variables are reintroduced, adding thousands of cycles per iteration. The perf hit of trying to determine the max value on the fly is simply too high for my case.

So I modified the arm_correlate_q15() function to return an array of q63_t values instead, and stored that array in RAM_D1 instead, of which I have more than plenty. it turns out that the performance penalty for having that q63_t array in RAM_D1 instead of DTCMRAM is less than 2 clock cycles per iteration, thanks no doubt to the H7 cache and optimized architecture.

So, in my case, it's vastly preferable to use extra RAM_D1 to store a temporary q63 array instead of trying to perform on-the-fly calculations. All the other arrays are still stored in DTCMRAM.

I have to say that I'm surprised by how small an effect DTCM RAM has in my particular case.

Thanks to @Community member​ and @embeddedt​  for the pointers that, in the end, helped solve the issue