Need help understanding performance issue on STM32H743

Question

I'm struggling to understand why I'm seeing a huge performance hit by slightly changing code in a loop. I narrowed down the problem to a manageable code size, and I'm hoping someone here can suggest how to investigate and resolve my issue.I'm using a NUCLEO-H743ZI2 board, clock set to 480MHz, and my toolchain is STM32CubeIde 1.0.2 with he latest HAL libraries for the H743. Everything else running on that board works as I would expectI am using the CMSIS arm_correlate_q15 library to cross correlate two signals I capture with the ADC. The CMSIS function takes 2 Q15 arrays as input, outputs a third array, twice as long as the biggest input with correlation values. Even if the library uses q63_t math internally, the output array is scaled down to q15_t. I'm using pretty big arrays (12288 elements), and the return array has too many values that are equal to the maximum possible for the q15 values, making it hard to pinpoint the exact correlation point. Also, due to the big array sizes, having an additional output array twice as big as the input uses too much memory (I plan to store the arrays in DTCM RAM for perf reasons, (and, yes, my ADC DMA goes into RAM_D1, then gets moved into DTCM RAM)So I modified the CMSIS function to return the max value location instead of an array, by comparing each intermediate result with a "max" variable, and storing new values each time I find a new max. I expected, if anything, for that to be the same or faster than first calling arm_correlate followed by arm_max_q15 to find the max valueInstead I get a massive perf hit, roughly adding an extra 1,500 processor cycles per loop, for just a simple if statement. Clearly there is something I don't understand causing problemsSo I created a test loop using just a small part of the arm_correlate_q15 code, and I can see that just by having a multiplication loop, without storing the q15 result in the destination array, I get a baseline of execution time. Storing the result as q15_t in the array and then comparing the just stored value to a "max" variable, I add ~10 clock cycles to each loop iteration, which makes perfect sense to me. If I instead compare the q63_t result to a q63_t max value (without storing anything), I get a ~1550 clock cycles hit per loo. Now , I would expect a 64bit if statement to take a bit longer than a 16 or 32 bit, but not that much of a hitI'm enclosing the code in question, with the 3 options commented out.int looptest( q15_t * pSrcA,
 uint32_t srcALen,
 q15_t * pSrcB,
 uint32_t srcBLen, q15_t * pDst)
{
 
	q15_t *pIn1; /* inputA pointer */
	q15_t *pIn2; /* inputB pointer */
	q15_t *pOut = pDst; /* output pointer */
	q63_t sum, acc0, acc1, acc2, acc3; /* Accumulators */
	q15_t *px; /* Intermediate inputA pointer */
	q15_t *py; /* Intermediate inputB pointer */
	q15_t *pSrc1; /* Intermediate pointers */
	q31_t x0, x1, x2, x3, c0; /* temporary variables for holding input and coefficient values */
	uint32_t j, k = 0U, count, blkCnt, blockSize2, blockSize3; /* loop counter */
	q63_t max = 0; 			/* value of max value */
	q15_t *maxLoc;			 /* location of max value */
 
	/* Initialization of inputA pointer */
	pIn1 = (pSrcB);
 
	/* Initialization of inputB pointer */
	pIn2 = (pSrcA);
 
	/* srcBLen is always considered as shorter or equal to srcALen */
	j = srcBLen;
	srcBLen = srcALen;
	srcALen = j;
 
	/* set the destination pointer to point to the last output sample */
	pOut = pDst + srcALen -1U; 
 
	blockSize3 = 4095;
	count = 4095;
 
 pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
 px = pSrc1;
 
 /* Working pointer of inputB */
 py = pIn2;
 
 while (blockSize3 > 0U)
 {
 /* Accumulator is made zero for every iteration */
 sum = 0;
 
 /* Apply loop unrolling and compute 4 MACs simultaneously. */
 k = count >> 2U;
 
 /* First part of the processing with loop unrolling. Compute 4 MACs at a time.
 ** a second loop below computes MACs for the remaining 1 to 3 samples. */
 while (k > 0U)
 {
 /* Perform the multiply-accumulates */
 /* sum += x[srcALen - srcBLen + 4] * y[3] , sum += x[srcALen - srcBLen + 3] * y[2] */
 sum = __SMLALD(*__SIMD32(px)++, *__SIMD32(py)++, sum);
 /* sum += x[srcALen - srcBLen + 2] * y[1] , sum += x[srcALen - srcBLen + 1] * y[0] */
 sum = __SMLALD(*__SIMD32(px)++, *__SIMD32(py)++, sum);
 
 /* Decrement the loop counter */
 k--;
 }
 
 /* If the count is not a multiple of 4, compute any remaining MACs here.
 ** No loop unrolling is used. */
 k = count % 0x4U;
 
 while (k > 0U)
 {
 /* Perform the multiply-accumulates */
 sum = __SMLALD(*px++, *py++, sum);
 
 /* Decrement the loop counter */
 k--;
 }
 
 // nothing in this section, 28.928.560 clock cycles
 
 // using q15_t 28,973,622 clock cycles, ~11 extra cycles per loop
/*	 *pOut = (q15_t) (__SSAT(sum >> 15, 16));
 
		if ( *pOut > max)
		{
			max = *pOut;
			maxLoc = pOut;
		}*/
 
 // using sum directly 35,269,175 clock cycles, ~1540 extra cycles per loop
/* if (sum > max)
 {
 	max = sum;
 	maxLoc = pOut;
 }*/
 
 /* Destination pointer is updated according to the address modifier, inc */
 pOut -= 1;
 
 /* Update the inputA and inputB pointers for next MAC calculation */
 px = ++pSrc1;
 py = pIn2;
 
 /* Decrement the MAC count */
 count--;
 
 /* Decrement the loop counter */
 blockSize3--;
 }
 return (maxLoc - pDst);
}I call the function with the following:DWT->CYCCNT = 0;
int cycleCount = DWT->CYCCNT;
value2 = looptest(aADC1_CH3, 4096, aADC2_CH5, 12288, dout);
value4 = DWT->CYCCNT - cycleCount;
printf("Clock cycles %d
", value4);With the if statements commented out (as per code above) I get a baseline of ~28.928M clock cycles. Enabling the below, I get ~29.973M clock cycles, a roughly 11 clocks per loop*pOut = (q15_t) (__SSAT(sum >> 15, 16));
 
if ( *pOut > max)
{
			max = *pOut;
			maxLoc = pOut;
		}Enabling instead this part, I get an execution time of ~35,269M cycles, a hit of ~1540 clock cycles compared to baselineif (sum > max)
 {
 	max = sum;
 	maxLoc = pOut;
 }In all cases, the HAL Tick is disabled to ensure nothing else happens in the background. There are no other interrupts enabled. The cycles above were measured using -ODebug optimizations, but the relative differences happen even when compiled with -OFast (which is how I plan to compile the final code).I'm at a loss of how to further understand my issue and figure out a way to improve performance... any suggestion is appreciated, including pointing out how much of in idiot I am for having missed the obvious :grinning_face:

fbar · Accepted Answer

As a follow up to my issue:I could find no way to further optimize the correlation function when an if statement is included in the loop. The compiler optimizes a lot of variables out, and as soon as I try to compare values and return something, multiple variables are reintroduced, adding thousands of cycles per iteration. The perf hit of trying to determine the max value on the fly is simply too high for my case.So I modified the arm_correlate_q15() function to return an array of q63_t values instead, and stored that array in RAM_D1 instead, of which I have more than plenty. it turns out that the performance penalty for having that q63_t array in RAM_D1 instead of DTCMRAM is less than 2 clock cycles per iteration, thanks no doubt to the H7 cache and optimized architecture. So, in my case, it's vastly preferable to use extra RAM_D1 to store a temporary q63 array instead of trying to perform on-the-fly calculations. All the other arrays are still stored in DTCMRAM.I have to say that I'm surprised by how small an effect DTCM RAM has in my particular case.Thanks to @Community member​ and @embeddedt​  for the pointers that, in the end, helped solve the issue

waclawek.jan · Answer

The processor does not execute C code, so you should start with comparing the disasm for the various cases.

IMO the compiler simply hit the point where due to register pressure it started to swap variables in/out of memory; plus maybe the stack is not in the best place.

I don't use the 'H7.

JW

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded