2021-11-08 12:07 AM
Hello again,
I recently checked the time the STM32F7 (192MHz) needs to copy one 32bit buffer to another one, and I'm a little shocked about how long that takes.
(Maybe I expect too much, coming from 8bit µCs with 20MHz, an STM32 still feels like a racing car compared to a horse carriage)
For time measurement the interrupts are disabled and the PTP registers are used.
It takes about 38µs to copy a uint32_t buffer[256] to another one, that is about 28 cycles @192MHz. Okay, there's the for-loop comparison and increment, but still, 28 cycles?
Anything I oversee or is that just it?
I know it can be a long way from C to assembler, and I haven't checked that yet (never even looked at the STM32 / ARM assembler stuff), but the list excerpt below shows some 17 CPU actions for the for-loop - right?
Here's the uart/debug output:
*.c:
#define RPM_DMA_BUF_MAX 256
uint32_t u32RpmDmaBuf[2][RPM_DMA_BUF_MAX] = { { 0 } };
uint32_t u32RpmUartBuf[RPM_DMA_BUF_MAX];
uint32_t *pu32SrcBuf;
uint32_t u32StopNanoSec = 0;
uint32_t u32StartNanoSec = 0;
/* copy buffer with snapshot of timestamp before and after */
__disable_irq();
u32Val = ETH->PTPTSLR;
for( i16 = 0; i16 < RPM_DMA_BUF_MAX; i16++ )
{
u32RpmUartBuf[i16] = *(pu32SrcBuf++);
}
u32Val2 = ETH->PTPTSLR;
__enable_irq();
u32StartNanoSec = ETH_PTPSubSecond2NanoSecond(u32Val);
u32StopNanoSec = ETH_PTPSubSecond2NanoSecond(u32Val2);
u32RpmDebugTime = 0;
if( u32StopNanoSec > u32StartNanoSec ) u32RpmDebugTime = u32StopNanoSec - u32StartNanoSec;
uart_printf("u32StopNanoSec = %ld\n\r", u32StopNanoSec);
uart_printf("u32StartNanoSec = %ld\n\r", u32StartNanoSec);
uart_printf("u32buffer[%d] copy time: %ld ns\n\r", RPM_DMA_BUF_MAX, u32RpmDebugTime);
*.list
__disable_irq();
u32Val = ETH->PTPTSLR;
8006b42: 4bad ldr r3, [pc, #692] ; (8006df8 <UART3_RxCmndProcessing+0x17d8>)
8006b44: f8d3 370c ldr.w r3, [r3, #1804] ; 0x70c
8006b48: f8c7 3498 str.w r3, [r7, #1176] ; 0x498
for( i16 = 0; i16 < RPM_DMA_BUF_MAX; i16++ )
8006b4c: 2300 movs r3, #0
8006b4e: f8a7 3504 strh.w r3, [r7, #1284] ; 0x504
8006b52: e010 b.n 8006b76 <UART3_RxCmndProcessing+0x1556>
{
//u32RpmUartBuf[i16] = u32RpmDmaBuf[u8RpmBufPtr][i16];
//u32RpmUartBuf[i16] = *(pu32SrcBuf++);
*(pu32DstBuf++) = *(pu32SrcBuf++);
8006b54: f8d7 24ec ldr.w r2, [r7, #1260] ; 0x4ec
8006b58: 1d13 adds r3, r2, #4
8006b5a: f8c7 34ec str.w r3, [r7, #1260] ; 0x4ec
8006b5e: f8d7 34e8 ldr.w r3, [r7, #1256] ; 0x4e8
8006b62: 1d19 adds r1, r3, #4
8006b64: f8c7 14e8 str.w r1, [r7, #1256] ; 0x4e8
8006b68: 6812 ldr r2, [r2, #0]
8006b6a: 601a str r2, [r3, #0]
for( i16 = 0; i16 < RPM_DMA_BUF_MAX; i16++ )
8006b6c: f8b7 3504 ldrh.w r3, [r7, #1284] ; 0x504
8006b70: 3301 adds r3, #1
8006b72: f8a7 3504 strh.w r3, [r7, #1284] ; 0x504
8006b76: f8b7 3504 ldrh.w r3, [r7, #1284] ; 0x504
8006b7a: 2bff cmp r3, #255 ; 0xff
8006b7c: d9ea bls.n 8006b54 <UART3_RxCmndProcessing+0x1534>
}
u32Val2 = ETH->PTPTSLR;
8006b7e: 4b9e ldr r3, [pc, #632] ; (8006df8 <UART3_RxCmndProcessing+0x17d8>)
8006b80: f8d3 370c ldr.w r3, [r3, #1804] ; 0x70c
8006b84: f8c7 3494 str.w r3, [r7, #1172] ; 0x494
__ASM volatile ("cpsie i" : : : "memory");
8006b88: b662 cpsie i
}
8006b8a: bf00 nop
__enable_irq();
Solved! Go to Solution.
2021-11-08 12:32 AM
What are your compiler & compiler settings/options? A Debug configuration does not generate fast code. Did you try a Release config?
hth
KnarfB
2021-11-08 12:32 AM
What are your compiler & compiler settings/options? A Debug configuration does not generate fast code. Did you try a Release config?
hth
KnarfB
2021-11-08 12:58 AM
Thanks for the reply, and sorry for the missing info!
STM32CubeIDE 1.6 - and yes, debug config.
I'll try the release...
2021-11-08 02:34 AM
Holy Smokes, THAT helped!
Thanks a lot, KnarfB!
Copy time has gone down from 38µs (debug) to 2.8µs (release) !
That seems almost too fast with about 2 cycles per copy...
One more question concerning the *.list file:
Is there a setting where showing the C source in the list file can be turned on?
I've been through many project settings in the IDE, but haven't found anything yet.
2021-11-08 06:26 AM
If using Keil, look for a .txt file - much mire useful for analyzing the generated code!
Regards,
Dave
2021-11-08 08:06 AM
The DWT CYCCNT might be less of a circus..
https://www.systutorials.com/generate-a-mixed-source-and-assembly-listing-using-gcc/
2021-11-08 11:19 AM
Thanks for the info! Lots to learn for me about ARMs.
2021-11-08 11:02 PM
Using DWT CYCCNT I get some corresponding numbers:
PTP time check:
u32StopNanoSec = 374706031
u32StartNanoSec = 374703048
u32buffer[256] copy time: 2983 ns
DWT->CYCCNT = 0x48DE03C2
u32DwtCycCnt1 = 1216759486
u32DwtCycCnt2 = 1216760037
DWT->CYCCNT diff = 551 -> 2870 ns @ 192 MHz
And for others who might try this, here's how to enable the CYCCNT (took me some time to find out about the LAR):
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->LAR = 0xC5ACCE55;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
DWT->CTRL |= DWT_CTRL_PCSAMPLENA_Msk;
2021-11-17 10:40 PM
Hello again,
after having programmed some other stuff and checking no more the data copy time,
I'm back at this point again, and guess what, copy time is again much higher!
I have NOT changed the code above, which checks the time and copies the buffers, and
I have not switched back to debug mode, but now I get a copy time of about 33 µs - which was
about 3µs after switching from compiler debug to release mode (still using STM32CubeIDE).
Any ideas, please?
Maybe @KnarfB or @Community member have some idea about that?
2021-11-17 11:20 PM
Hi,
If you have more lines in your code, the compiler may optimize by moving the memory load/store instructions around or even change their order. This will not affect the C semantics of your code but the timing. E.g., the compiler tries to load as early as possible to hide memory latencies.
You can tell the compiler not to move around the code by placing compile-time memory barriers. Shown here for gcc and the Cortex core cycle counter:
asm volatile ("" : : : "memory"); // see https://blog.regehr.org/archives/28
volatile uint32_t tick = DWT->CYCCNT;
asm volatile ("" : : : "memory");
// code of interest
asm volatile ("" : : : "memory");
volatile uint32_t tock = DWT->CYCCNT;
asm volatile ("" : : : "memory");
Strictly speaking, the volatile is not neccessary to suppress the reordering but avoids that the compiler optimizes away seemingly unused variables.
If possible, check the assembler code, e.g. by stepping though the loop at instruction level.
One last remark: when you have several pointers around, think of defining restrict pointers https://en.wikipedia.org/wiki/Restrict to guide optimization.
hth
KnarfB