Need help understanding performance issue on STM32H743
I'm struggling to understand why I'm seeing a huge performance hit by slightly changing code in a loop. I narrowed down the problem to a manageable code size, and I'm hoping someone here can suggest how to investigate and resolve my issue.
I'm using a NUCLEO-H743ZI2 board, clock set to 480MHz, and my toolchain is STM32CubeIde 1.0.2 with he latest HAL libraries for the H743. Everything else running on that board works as I would expect
I am using the CMSIS arm_correlate_q15 library to cross correlate two signals I capture with the ADC. The CMSIS function takes 2 Q15 arrays as input, outputs a third array, twice as long as the biggest input with correlation values. Even if the library uses q63_t math internally, the output array is scaled down to q15_t. I'm using pretty big arrays (12288 elements), and the return array has too many values that are equal to the maximum possible for the q15 values, making it hard to pinpoint the exact correlation point. Also, due to the big array sizes, having an additional output array twice as big as the input uses too much memory (I plan to store the arrays in DTCM RAM for perf reasons, (and, yes, my ADC DMA goes into RAM_D1, then gets moved into DTCM RAM)
So I modified the CMSIS function to return the max value location instead of an array, by comparing each intermediate result with a "max" variable, and storing new values each time I find a new max. I expected, if anything, for that to be the same or faster than first calling arm_correlate followed by arm_max_q15 to find the max value
Instead I get a massive perf hit, roughly adding an extra 1,500 processor cycles per loop, for just a simple if statement. Clearly there is something I don't understand causing problems
So I created a test loop using just a small part of the arm_correlate_q15 code, and I can see that just by having a multiplication loop, without storing the q15 result in the destination array, I get a baseline of execution time. Storing the result as q15_t in the array and then comparing the just stored value to a "max" variable, I add ~10 clock cycles to each loop iteration, which makes perfect sense to me. If I instead compare the q63_t result to a q63_t max value (without storing anything), I get a ~1550 clock cycles hit per loo. Now , I would expect a 64bit if statement to take a bit longer than a 16 or 32 bit, but not that much of a hit
I'm enclosing the code in question, with the 3 options commented out.
int looptest( q15_t * pSrcA,
uint32_t srcALen,
q15_t * pSrcB,
uint32_t srcBLen, q15_t * pDst)
{
q15_t *pIn1; /* inputA pointer */
q15_t *pIn2; /* inputB pointer */
q15_t *pOut = pDst; /* output pointer */
q63_t sum, acc0, acc1, acc2, acc3; /* Accumulators */
q15_t *px; /* Intermediate inputA pointer */
q15_t *py; /* Intermediate inputB pointer */
q15_t *pSrc1; /* Intermediate pointers */
q31_t x0, x1, x2, x3, c0; /* temporary variables for holding input and coefficient values */
uint32_t j, k = 0U, count, blkCnt, blockSize2, blockSize3; /* loop counter */
q63_t max = 0; /* value of max value */
q15_t *maxLoc; /* location of max value */
/* Initialization of inputA pointer */
pIn1 = (pSrcB);
/* Initialization of inputB pointer */
pIn2 = (pSrcA);
/* srcBLen is always considered as shorter or equal to srcALen */
j = srcBLen;
srcBLen = srcALen;
srcALen = j;
/* set the destination pointer to point to the last output sample */
pOut = pDst + srcALen -1U;
blockSize3 = 4095;
count = 4095;
pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
px = pSrc1;
/* Working pointer of inputB */
py = pIn2;
while (blockSize3 > 0U)
{
/* Accumulator is made zero for every iteration */
sum = 0;
/* Apply loop unrolling and compute 4 MACs simultaneously. */
k = count >> 2U;
/* First part of the processing with loop unrolling. Compute 4 MACs at a time.
** a second loop below computes MACs for the remaining 1 to 3 samples. */
while (k > 0U)
{
/* Perform the multiply-accumulates */
/* sum += x[srcALen - srcBLen + 4] * y[3] , sum += x[srcALen - srcBLen + 3] * y[2] */
sum = __SMLALD(*__SIMD32(px)++, *__SIMD32(py)++, sum);
/* sum += x[srcALen - srcBLen + 2] * y[1] , sum += x[srcALen - srcBLen + 1] * y[0] */
sum = __SMLALD(*__SIMD32(px)++, *__SIMD32(py)++, sum);
/* Decrement the loop counter */
k--;
}
/* If the count is not a multiple of 4, compute any remaining MACs here.
** No loop unrolling is used. */
k = count % 0x4U;
while (k > 0U)
{
/* Perform the multiply-accumulates */
sum = __SMLALD(*px++, *py++, sum);
/* Decrement the loop counter */
k--;
}
// nothing in this section, 28.928.560 clock cycles
// using q15_t 28,973,622 clock cycles, ~11 extra cycles per loop
/* *pOut = (q15_t) (__SSAT(sum >> 15, 16));
if ( *pOut > max)
{
max = *pOut;
maxLoc = pOut;
}*/
// using sum directly 35,269,175 clock cycles, ~1540 extra cycles per loop
/* if (sum > max)
{
max = sum;
maxLoc = pOut;
}*/
/* Destination pointer is updated according to the address modifier, inc */
pOut -= 1;
/* Update the inputA and inputB pointers for next MAC calculation */
px = ++pSrc1;
py = pIn2;
/* Decrement the MAC count */
count--;
/* Decrement the loop counter */
blockSize3--;
}
return (maxLoc - pDst);
}I call the function with the following:
DWT->CYCCNT = 0;
int cycleCount = DWT->CYCCNT;
value2 = looptest(aADC1_CH3, 4096, aADC2_CH5, 12288, dout);
value4 = DWT->CYCCNT - cycleCount;
printf("Clock cycles %d\r\n", value4);With the if statements commented out (as per code above) I get a baseline of ~28.928M clock cycles. Enabling the below, I get ~29.973M clock cycles, a roughly 11 clocks per loop
*pOut = (q15_t) (__SSAT(sum >> 15, 16));
if ( *pOut > max)
{
max = *pOut;
maxLoc = pOut;
}Enabling instead this part, I get an execution time of ~35,269M cycles, a hit of ~1540 clock cycles compared to baseline
if (sum > max)
{
max = sum;
maxLoc = pOut;
}In all cases, the HAL Tick is disabled to ensure nothing else happens in the background. There are no other interrupts enabled. The cycles above were measured using -ODebug optimizations, but the relative differences happen even when compiled with -OFast (which is how I plan to compile the final code).
I'm at a loss of how to further understand my issue and figure out a way to improve performance... any suggestion is appreciated, including pointing out how much of in idiot I am for having missed the obvious :grinning_face: