2019-10-06 11:28 AM
I'm struggling to understand why I'm seeing a huge performance hit by slightly changing code in a loop. I narrowed down the problem to a manageable code size, and I'm hoping someone here can suggest how to investigate and resolve my issue.
I'm using a NUCLEO-H743ZI2 board, clock set to 480MHz, and my toolchain is STM32CubeIde 1.0.2 with he latest HAL libraries for the H743. Everything else running on that board works as I would expect
I am using the CMSIS arm_correlate_q15 library to cross correlate two signals I capture with the ADC. The CMSIS function takes 2 Q15 arrays as input, outputs a third array, twice as long as the biggest input with correlation values. Even if the library uses q63_t math internally, the output array is scaled down to q15_t. I'm using pretty big arrays (12288 elements), and the return array has too many values that are equal to the maximum possible for the q15 values, making it hard to pinpoint the exact correlation point. Also, due to the big array sizes, having an additional output array twice as big as the input uses too much memory (I plan to store the arrays in DTCM RAM for perf reasons, (and, yes, my ADC DMA goes into RAM_D1, then gets moved into DTCM RAM)
So I modified the CMSIS function to return the max value location instead of an array, by comparing each intermediate result with a "max" variable, and storing new values each time I find a new max. I expected, if anything, for that to be the same or faster than first calling arm_correlate followed by arm_max_q15 to find the max value
Instead I get a massive perf hit, roughly adding an extra 1,500 processor cycles per loop, for just a simple if statement. Clearly there is something I don't understand causing problems
So I created a test loop using just a small part of the arm_correlate_q15 code, and I can see that just by having a multiplication loop, without storing the q15 result in the destination array, I get a baseline of execution time. Storing the result as q15_t in the array and then comparing the just stored value to a "max" variable, I add ~10 clock cycles to each loop iteration, which makes perfect sense to me. If I instead compare the q63_t result to a q63_t max value (without storing anything), I get a ~1550 clock cycles hit per loo. Now , I would expect a 64bit if statement to take a bit longer than a 16 or 32 bit, but not that much of a hit
I'm enclosing the code in question, with the 3 options commented out.
int looptest( q15_t * pSrcA,
uint32_t srcALen,
q15_t * pSrcB,
uint32_t srcBLen, q15_t * pDst)
{
q15_t *pIn1; /* inputA pointer */
q15_t *pIn2; /* inputB pointer */
q15_t *pOut = pDst; /* output pointer */
q63_t sum, acc0, acc1, acc2, acc3; /* Accumulators */
q15_t *px; /* Intermediate inputA pointer */
q15_t *py; /* Intermediate inputB pointer */
q15_t *pSrc1; /* Intermediate pointers */
q31_t x0, x1, x2, x3, c0; /* temporary variables for holding input and coefficient values */
uint32_t j, k = 0U, count, blkCnt, blockSize2, blockSize3; /* loop counter */
q63_t max = 0; /* value of max value */
q15_t *maxLoc; /* location of max value */
/* Initialization of inputA pointer */
pIn1 = (pSrcB);
/* Initialization of inputB pointer */
pIn2 = (pSrcA);
/* srcBLen is always considered as shorter or equal to srcALen */
j = srcBLen;
srcBLen = srcALen;
srcALen = j;
/* set the destination pointer to point to the last output sample */
pOut = pDst + srcALen -1U;
blockSize3 = 4095;
count = 4095;
pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
px = pSrc1;
/* Working pointer of inputB */
py = pIn2;
while (blockSize3 > 0U)
{
/* Accumulator is made zero for every iteration */
sum = 0;
/* Apply loop unrolling and compute 4 MACs simultaneously. */
k = count >> 2U;
/* First part of the processing with loop unrolling. Compute 4 MACs at a time.
** a second loop below computes MACs for the remaining 1 to 3 samples. */
while (k > 0U)
{
/* Perform the multiply-accumulates */
/* sum += x[srcALen - srcBLen + 4] * y[3] , sum += x[srcALen - srcBLen + 3] * y[2] */
sum = __SMLALD(*__SIMD32(px)++, *__SIMD32(py)++, sum);
/* sum += x[srcALen - srcBLen + 2] * y[1] , sum += x[srcALen - srcBLen + 1] * y[0] */
sum = __SMLALD(*__SIMD32(px)++, *__SIMD32(py)++, sum);
/* Decrement the loop counter */
k--;
}
/* If the count is not a multiple of 4, compute any remaining MACs here.
** No loop unrolling is used. */
k = count % 0x4U;
while (k > 0U)
{
/* Perform the multiply-accumulates */
sum = __SMLALD(*px++, *py++, sum);
/* Decrement the loop counter */
k--;
}
// nothing in this section, 28.928.560 clock cycles
// using q15_t 28,973,622 clock cycles, ~11 extra cycles per loop
/* *pOut = (q15_t) (__SSAT(sum >> 15, 16));
if ( *pOut > max)
{
max = *pOut;
maxLoc = pOut;
}*/
// using sum directly 35,269,175 clock cycles, ~1540 extra cycles per loop
/* if (sum > max)
{
max = sum;
maxLoc = pOut;
}*/
/* Destination pointer is updated according to the address modifier, inc */
pOut -= 1;
/* Update the inputA and inputB pointers for next MAC calculation */
px = ++pSrc1;
py = pIn2;
/* Decrement the MAC count */
count--;
/* Decrement the loop counter */
blockSize3--;
}
return (maxLoc - pDst);
}
I call the function with the following:
DWT->CYCCNT = 0;
int cycleCount = DWT->CYCCNT;
value2 = looptest(aADC1_CH3, 4096, aADC2_CH5, 12288, dout);
value4 = DWT->CYCCNT - cycleCount;
printf("Clock cycles %d\r\n", value4);
With the if statements commented out (as per code above) I get a baseline of ~28.928M clock cycles. Enabling the below, I get ~29.973M clock cycles, a roughly 11 clocks per loop
*pOut = (q15_t) (__SSAT(sum >> 15, 16));
if ( *pOut > max)
{
max = *pOut;
maxLoc = pOut;
}
Enabling instead this part, I get an execution time of ~35,269M cycles, a hit of ~1540 clock cycles compared to baseline
if (sum > max)
{
max = sum;
maxLoc = pOut;
}
In all cases, the HAL Tick is disabled to ensure nothing else happens in the background. There are no other interrupts enabled. The cycles above were measured using -ODebug optimizations, but the relative differences happen even when compiled with -OFast (which is how I plan to compile the final code).
I'm at a loss of how to further understand my issue and figure out a way to improve performance... any suggestion is appreciated, including pointing out how much of in idiot I am for having missed the obvious :grinning_face:
Solved! Go to Solution.
2019-10-07 11:24 AM
As a follow up to my issue:
I could find no way to further optimize the correlation function when an if statement is included in the loop. The compiler optimizes a lot of variables out, and as soon as I try to compare values and return something, multiple variables are reintroduced, adding thousands of cycles per iteration. The perf hit of trying to determine the max value on the fly is simply too high for my case.
So I modified the arm_correlate_q15() function to return an array of q63_t values instead, and stored that array in RAM_D1 instead, of which I have more than plenty. it turns out that the performance penalty for having that q63_t array in RAM_D1 instead of DTCMRAM is less than 2 clock cycles per iteration, thanks no doubt to the H7 cache and optimized architecture.
So, in my case, it's vastly preferable to use extra RAM_D1 to store a temporary q63 array instead of trying to perform on-the-fly calculations. All the other arrays are still stored in DTCMRAM.
I have to say that I'm surprised by how small an effect DTCM RAM has in my particular case.
Thanks to @Community member and @embeddedt for the pointers that, in the end, helped solve the issue
2019-10-06 01:14 PM
The processor does not execute C code, so you should start with comparing the disasm for the various cases.
IMO the compiler simply hit the point where due to register pressure it started to swap variables in/out of memory; plus maybe the stack is not in the best place.
I don't use the 'H7.
JW
2019-10-06 02:29 PM
As @Community member said, the issue is likely in the generated assembly. If it does turn out to be variables being pushed to memory, try putting your stack in DTCM (if it's not already there), which should be much faster.
2019-10-06 07:00 PM
First of all, thanks for taking the time to answer. And, yes, I'm very familiar with how C gets turned into assembly (having written more hand tuned Z80 code than I care to remember when toolchains were simply not optimized enough). It's just that I'm not familiar at all with ARM assembly, and I was hoping for a suggestion that didn't require digging into it, since I'm basically useless at that point.
I also tried compiling my 3 scenarios with -Onone and the performance (while super-slow in general) is predictable: with no if statement it takes 128.16M cycles, with q15 comparison it takes 128.27M and with the q63_t comparison it takes 128.12M cycles, even faster since there's no __SSAT operation. And the assembly is much easier to read. But too slow to use for final compilation
I dumped the relevant assembly code for all 3 cases, with -OFast, and here they are.
I';m really not asking anyone to pore over the asm and find the problem. More like suggestions on how to try and work around the problem. It seems I have hit an optimization issue, and I have no idea how to work around it. Hand crafting assembly is clearly not an option for me, so I'm at a dead end
everything commented out
Clock cycles 24189027 50.393 msec
looptest:
080063a4: subs r3, r3, r1
080063a6: movw r12, #4095 ; 0xfff
080063aa: adds r3, #1
1283 {
080063ac: stmdb sp!, {r4, r5, r6, r7, r8, r9, lr}
080063b0: ldr r7, [sp, #28]
1314 pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
080063b2: add.w r2, r2, r3, lsl #1
1330 while (k > 0U)
080063b6: movs.w r3, r12, lsr #2
080063ba: beq.n 0x800643a <looptest+150>
080063bc: mov.w r8, r3, lsl #3
080063c0: add.w r1, r2, #8
1323 sum = 0;
080063c4: movs r3, #0
080063c6: add.w r4, r0, #8
080063ca: add.w lr, r1, r8
080063ce: mov r9, r3
1334 sum = __SMLALD(*__SIMD32(px)++, *__SIMD32(py)++, sum);
080063d0: ldr.w r5, [r1, #-8]
080063d4: ldr.w r6, [r4, #-8]
1931 __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
080063d8: smlald r3, r9, r5, r6
1336 sum = __SMLALD(*__SIMD32(px)++, *__SIMD32(py)++, sum);
080063dc: ldr.w r5, [r1, #-4]
080063e0: ldr.w r6, [r4, #-4]
080063e4: smlald r3, r9, r5, r6
080063e8: adds r1, #8
080063ea: adds r4, #8
080063ec: cmp lr, r1
080063ee: bne.n 0x80063d0 <looptest+44>
080063f0: add.w r1, r2, r8
080063f4: add r8, r0
1346 while (k > 0U)
080063f6: ands.w r4, r12, #3
080063fa: beq.n 0x8006428 <looptest+132>
1349 sum = __SMLALD(*px++, *py++, sum);
080063fc: ldrsh.w r5, [r1]
08006400: ldrsh.w r6, [r8]
1931 __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
08006404: smlald r3, r9, r5, r6
1346 while (k > 0U)
08006408: cmp r4, #1
0800640a: beq.n 0x8006428 <looptest+132>
1349 sum = __SMLALD(*px++, *py++, sum);
0800640c: ldrsh.w r5, [r1, #2]
08006410: ldrsh.w r6, [r8, #2]
1931 __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
08006414: smlald r3, r9, r5, r6
1346 while (k > 0U)
08006418: cmp r4, #2
0800641a: beq.n 0x8006428 <looptest+132>
1349 sum = __SMLALD(*px++, *py++, sum);
0800641c: ldrsh.w r1, [r1, #4]
08006420: ldrsh.w r4, [r8, #4]
08006424: smlald r3, r9, r1, r4
1320 while (blockSize3 > 0U)
08006428: subs.w r12, r12, #1
1377 px = ++pSrc1;
0800642c: add.w r2, r2, #2
1320 while (blockSize3 > 0U)
08006430: bne.n 0x80063b6 <looptest+18>
1386 return (maxLoc - pDst);
08006432: negs r0, r7
1387 }
Here's the version using the q15_t pointer
Clock cycles 24231942 50.483
*pOut = (q15_t) (__SSAT(sum >> 15, 16));
if ( *pOut > max)
{
max = *pOut;
maxLoc = pOut;
}
looptest:
080063a4: stmdb sp!, {r4, r5, r6, r7, r8, r9, r10, r11, lr}
1309 pOut = pDst + srcALen -1U; //pOut = srcALen -1U; // start from end of output buffer, work back //pDst + ((srcALen + srcBLen) - 2U);
080063a8: mvn.w r7, #2147483648 ; 0x80000000
1314 pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
080063ac: subs r1, r3, r1
1283 {
080063ae: sub sp, #12
1294 q63_t max = 0; /* value of max value */
080063b0: moveq r4, #0
1309 pOut = pDst + srcALen -1U; //pOut = srcALen -1U; // start from end of output buffer, work back //pDst + ((srcALen + srcBLen) - 2U);
080063b2: add r7, r3
1294 q63_t max = 0; /* value of max value */
080063b4: movs r3, #0
1314 pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
080063b6: movw r6, #4095 ; 0xfff
080063ba: strd r3, r4, [sp]
1314 pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
080063be: adds r3, r1, #1
1309 pOut = pDst + srcALen -1U; //pOut = srcALen -1U; // start from end of output buffer, work back //pDst + ((srcALen + srcBLen) - 2U);
080063c0: ldr r1, [sp, #48] ; 0x30
1314 pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
080063c2: add.w r2, r2, r3, lsl #1
080063c6: add.w r7, r1, r7, lsl #1
1330 while (k > 0U)
080063ca: movs.w r10, r6, lsr #2
080063ce: mov lr, r7
080063d0: beq.n 0x8006484 <looptest+224>
080063d2: mov.w r8, r10, lsl #3
080063d6: add.w r3, r2, #8
1323 sum = 0;
080063da: mov.w r10, #0
080063de: addeq.w r1, r0, #8
080063e2: add.w r9, r8, r3
080063e6: mov r11, r10
1334 sum = __SMLALD(*__SIMD32(px)++, *__SIMD32(py)++, sum);
080063e8: ldr.w r4, [r3, #-8]
080063ec: ldr.w r5, [r1, #-8]
1931 __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
080063f0: smlald r10, r11, r4, r5
1336 sum = __SMLALD(*__SIMD32(px)++, *__SIMD32(py)++, sum);
080063f4: ldr.w r4, [r3, #-4]
080063f8: ldr.w r5, [r1, #-4]
080063fc: smlald r10, r11, r4, r5
08006400: adds r3, #8
08006402: adds r1, #8
08006404: cmp r9, r3
08006406: bne.n 0x80063e8 <looptest+68>
08006408: add.w r1, r2, r8
0800640c: add.w r3, r0, r8
1346 while (k > 0U)
08006410: ands.w r4, r6, #3
08006414: beq.n 0x8006442 <looptest+158>
1349 sum = __SMLALD(*px++, *py++, sum);
08006416: ldrsh.w r5, [r1]
0800641a: ldrsh.w r8, [r3]
1931 __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
0800641e: smlald r10, r11, r5, r8
1346 while (k > 0U)
08006422: cmp r4, #1
08006424: beq.n 0x8006442 <looptest+158>
1349 sum = __SMLALD(*px++, *py++, sum);
08006426: ldrsh.w r5, [r1, #2]
0800642a: ldrsh.w r8, [r3, #2]
0800642e: smlald r10, r11, r5, r8
1346 while (k > 0U)
08006432: cmp r4, #2
08006434: beq.n 0x8006442 <looptest+158>
1349 sum = __SMLALD(*px++, *py++, sum);
08006436: ldrsh.w r1, [r1, #4]
0800643a: ldrsh.w r3, [r3, #4]
1931 __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
0800643e: smlald r10, r11, r1, r3
1358 *pOut = (q15_t) (__SSAT(sum >> 15, 16));
08006442: mov.w r10, r10, lsr #15
08006446: orr.w r10, r10, r11, lsl #17
0800644a: ssat r10, #16, r10
0800644e: sxth.w r10, r10
1360 if ( *pOut > max)
08006452: ldrd r8, r9, [sp]
08006456: sxth.w r4, r10
1358 *pOut = (q15_t) (__SSAT(sum >> 15, 16));
0800645a: strh.w r10, [r7], #-2
1360 if ( *pOut > max)
0800645e: asrs r5, r4, #31
08006460: cmp r8, r4
08006462: sbcs.w r3, r9, r5
08006466: bge.n 0x800646e <looptest+202>
08006468: mov r12, lr
0800646a: strd r4, r5, [sp]
1320 while (blockSize3 > 0U)
0800646e: subs r6, #1
1377 px = ++pSrc1;
08006470: add.w r2, r2, #2
1320 while (blockSize3 > 0U)
08006474: bne.n 0x80063ca <looptest+38>
1386 return (maxLoc - pDst);
08006476: ldr r3, [sp, #48] ; 0x30
08006478: sub.w r0, r12, r3
1387 }
(to be continued...)
2019-10-06 07:01 PM
and lastly the sum comparison
Clock cycles 28931727 60274
if (sum > max)
{
max = sum;
maxLoc = pOut;
}
looptest:
080063a4: stmdb sp!, {r4, r5, r6, r7, r8, r9, r10, r11, lr}
1314 pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
080063a8: subs r1, r3, r1
1283 {
080063aa: sub sp, #20
1309 pOut = pDst + srcALen -1U; //pOut = srcALen -1U; // start from end of output buffer, work back //pDst + ((srcALen + srcBLen) - 2U);
080063ac: mvn.w r8, #2147483648 ; 0x80000000
1314 pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
080063b0: movw r12, #4095 ; 0xfff
1294 q63_t max = 0; /* value of max value */
080063b4: mov.w r10, #0
080063b8: mov.w r11, #0
1309 pOut = pDst + srcALen -1U; //pOut = srcALen -1U; // start from end of output buffer, work back //pDst + ((srcALen + srcBLen) - 2U);
080063bc: add r8, r3
080063be: adds r3, r1, #1
080063c0: ldr r1, [sp, #56] ; 0x38
1314 pSrc1 = (pIn1 + srcALen) - (srcBLen - 1U);
080063c2: mov r9, r0
080063c4: add.w r2, r2, r3, lsl #1
080063c8: add.w r8, r1, r8, lsl #1
1330 while (k > 0U)
080063cc: movs.w r6, r12, lsr #2
080063d0: beq.n 0x800647c <looptest+216>
080063d2: lsls r6, r6, #3
080063d4: add.w r4, r2, #8
080063d8: add.w r5, r9, #8
1323 sum = 0;
080063dc: movs r0, #0
080063de: movs r1, #0
080063e0: adds r7, r6, r4
1931 __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
080063e2: mov r3, r0
080063e4: mov lr, r1
080063e6: ldr.w r0, [r5, #-8]
080063ea: ldr.w r1, [r4, #-8]
080063ee: smlald r3, lr, r1, r0
1336 sum = __SMLALD(*__SIMD32(px)++, *__SIMD32(py)++, sum);
080063f2: ldr.w r0, [r4, #-4]
080063f6: ldr.w r1, [r5, #-4]
1931 __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
080063fa: smlald r3, lr, r0, r1
080063fe: adds r4, #8
08006400: mov r0, r3
08006402: adds r5, #8
08006404: mov r1, lr
1330 while (k > 0U)
08006406: cmp r4, r7
08006408: bne.n 0x80063e2 <looptest+62>
0800640a: adds r5, r2, r6
0800640c: add r6, r9
0800640e: strd r0, r1, [sp]
1346 while (k > 0U)
08006412: ands.w r7, r12, #3
08006416: beq.n 0x800644c <looptest+168>
1349 sum = __SMLALD(*px++, *py++, sum);
08006418: ldrsh.w lr, [r5]
0800641c: ldr r4, [sp, #0]
0800641e: ldrsh.w r0, [r6]
08006422: ldr r3, [sp, #4]
08006424: smlald r4, r3, lr, r0
1346 while (k > 0U)
08006428: cmp r7, #1
0800642a: beq.n 0x8006448 <looptest+164>
1349 sum = __SMLALD(*px++, *py++, sum);
0800642c: ldrsh.w r1, [r5, #2]
08006430: ldrsh.w r0, [r6, #2]
1931 __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
08006434: smlald r4, r3, r1, r0
1346 while (k > 0U)
08006438: cmp r7, #2
0800643a: beq.n 0x8006448 <looptest+164>
1349 sum = __SMLALD(*px++, *py++, sum);
0800643c: ldrsh.w r1, [r5, #4]
08006440: ldrsh.w r0, [r6, #4]
1931 __ASM volatile ("smlald %0, %1, %2, %3" : "=r" (llr.w32[0]), "=r" (llr.w32[1]): "r" (op1), "r" (op2) , "0" (llr.w32[0]), "1" (llr.w32[1]) );
08006444: smlald r4, r3, r1, r0
08006448: strd r4, r3, [sp]
1367 if (sum > max)
0800644c: ldrd r0, r1, [sp]
08006450: cmp r10, r0
08006452: sbcs.w r3, r11, r1
08006456: bge.n 0x8006460 <looptest+188>
08006458: mov r10, r0
0800645a: mov r11, r1
0800645c: str.w r8, [sp, #12]
1320 while (blockSize3 > 0U)
08006460: subs.w r12, r12, #1
1374 pOut -= 1;
08006464: sub.w r8, r8, #2
1377 px = ++pSrc1;
08006468: add.w r2, r2, #2
1320 while (blockSize3 > 0U)
0800646c: bne.n 0x80063cc <looptest+40>
1386 return (maxLoc - pDst);
0800646e: ldr r3, [sp, #12]
08006470: ldr r2, [sp, #56] ; 0x38
08006472: subs r0, r3, r2
1387 }
2019-10-07 11:24 AM
As a follow up to my issue:
I could find no way to further optimize the correlation function when an if statement is included in the loop. The compiler optimizes a lot of variables out, and as soon as I try to compare values and return something, multiple variables are reintroduced, adding thousands of cycles per iteration. The perf hit of trying to determine the max value on the fly is simply too high for my case.
So I modified the arm_correlate_q15() function to return an array of q63_t values instead, and stored that array in RAM_D1 instead, of which I have more than plenty. it turns out that the performance penalty for having that q63_t array in RAM_D1 instead of DTCMRAM is less than 2 clock cycles per iteration, thanks no doubt to the H7 cache and optimized architecture.
So, in my case, it's vastly preferable to use extra RAM_D1 to store a temporary q63 array instead of trying to perform on-the-fly calculations. All the other arrays are still stored in DTCMRAM.
I have to say that I'm surprised by how small an effect DTCM RAM has in my particular case.
Thanks to @Community member and @embeddedt for the pointers that, in the end, helped solve the issue