STM32F7 @ 192MHz: long data copy time?

LCE · ‎2021-11-08

Hello again,

I recently checked the time the STM32F7 (192MHz) needs to copy one 32bit buffer to another one, and I'm a little shocked about how long that takes.

(Maybe I expect too much, coming from 8bit µCs with 20MHz, an STM32 still feels like a racing car compared to a horse carriage)

For time measurement the interrupts are disabled and the PTP registers are used.

It takes about 38µs to copy a uint32_t buffer[256] to another one, that is about 28 cycles @192MHz. Okay, there's the for-loop comparison and increment, but still, 28 cycles?

Anything I oversee or is that just it?

I know it can be a long way from C to assembler, and I haven't checked that yet (never even looked at the STM32 / ARM assembler stuff), but the list excerpt below shows some 17 CPU actions for the for-loop - right?

Here's the uart/debug output:

*.c:

#define RPM_DMA_BUF_MAX		256
 
uint32_t u32RpmDmaBuf[2][RPM_DMA_BUF_MAX] = { { 0 } };
 
uint32_t u32RpmUartBuf[RPM_DMA_BUF_MAX];
uint32_t *pu32SrcBuf;
 
uint32_t u32StopNanoSec = 0;
uint32_t u32StartNanoSec = 0;
 
/* copy buffer with snapshot of timestamp before and after */ 
__disable_irq();
u32Val = ETH->PTPTSLR;
 
for( i16 = 0; i16 < RPM_DMA_BUF_MAX; i16++ )
{
    u32RpmUartBuf[i16] = *(pu32SrcBuf++);
}
u32Val2 = ETH->PTPTSLR;
__enable_irq();
 
u32StartNanoSec = ETH_PTPSubSecond2NanoSecond(u32Val);
u32StopNanoSec = ETH_PTPSubSecond2NanoSecond(u32Val2);
 
u32RpmDebugTime = 0;
if( u32StopNanoSec > u32StartNanoSec ) u32RpmDebugTime = u32StopNanoSec - u32StartNanoSec;
uart_printf("u32StopNanoSec  = %ld\n\r", u32StopNanoSec);
uart_printf("u32StartNanoSec = %ld\n\r", u32StartNanoSec);
uart_printf("u32buffer[%d] copy time: %ld ns\n\r", RPM_DMA_BUF_MAX, u32RpmDebugTime);

*.list

__disable_irq();
u32Val = ETH->PTPTSLR;
 8006b42:	4bad      	ldr	r3, [pc, #692]	; (8006df8 <UART3_RxCmndProcessing+0x17d8>)
 8006b44:	f8d3 370c 	ldr.w	r3, [r3, #1804]	; 0x70c
 8006b48:	f8c7 3498 	str.w	r3, [r7, #1176]	; 0x498
						for( i16 = 0; i16 < RPM_DMA_BUF_MAX; i16++ )
 8006b4c:	2300      	movs	r3, #0
 8006b4e:	f8a7 3504 	strh.w	r3, [r7, #1284]	; 0x504
 8006b52:	e010      	b.n	8006b76 <UART3_RxCmndProcessing+0x1556>
						{
							//u32RpmUartBuf[i16] = u32RpmDmaBuf[u8RpmBufPtr][i16];
							//u32RpmUartBuf[i16] = *(pu32SrcBuf++);
							*(pu32DstBuf++) = *(pu32SrcBuf++);
 8006b54:	f8d7 24ec 	ldr.w	r2, [r7, #1260]	; 0x4ec
 8006b58:	1d13      	adds	r3, r2, #4
 8006b5a:	f8c7 34ec 	str.w	r3, [r7, #1260]	; 0x4ec
 8006b5e:	f8d7 34e8 	ldr.w	r3, [r7, #1256]	; 0x4e8
 8006b62:	1d19      	adds	r1, r3, #4
 8006b64:	f8c7 14e8 	str.w	r1, [r7, #1256]	; 0x4e8
 8006b68:	6812      	ldr	r2, [r2, #0]
 8006b6a:	601a      	str	r2, [r3, #0]
						for( i16 = 0; i16 < RPM_DMA_BUF_MAX; i16++ )
 8006b6c:	f8b7 3504 	ldrh.w	r3, [r7, #1284]	; 0x504
 8006b70:	3301      	adds	r3, #1
 8006b72:	f8a7 3504 	strh.w	r3, [r7, #1284]	; 0x504
 8006b76:	f8b7 3504 	ldrh.w	r3, [r7, #1284]	; 0x504
 8006b7a:	2bff      	cmp	r3, #255	; 0xff
 8006b7c:	d9ea      	bls.n	8006b54 <UART3_RxCmndProcessing+0x1534>
						}
u32Val2 = ETH->PTPTSLR;
 8006b7e:	4b9e      	ldr	r3, [pc, #632]	; (8006df8 <UART3_RxCmndProcessing+0x17d8>)
 8006b80:	f8d3 370c 	ldr.w	r3, [r3, #1804]	; 0x70c
 8006b84:	f8c7 3494 	str.w	r3, [r7, #1172]	; 0x494
  __ASM volatile ("cpsie i" : : : "memory");
 8006b88:	b662      	cpsie	i
}
 8006b8a:	bf00      	nop
__enable_irq();

KnarfB · ‎2021-11-08

What are your compiler & compiler settings/options? A Debug configuration does not generate fast code. Did you try a Release config?

hth

KnarfB

View solution in original post

KnarfB · ‎2021-11-08

What are your compiler & compiler settings/options? A Debug configuration does not generate fast code. Did you try a Release config?

hth

KnarfB

LCE · ‎2021-11-08

Thanks for the reply, and sorry for the missing info!

STM32CubeIDE 1.6 - and yes, debug config.

I'll try the release...

LCE · ‎2021-11-08

Holy Smokes, THAT helped!

Thanks a lot, KnarfB!

Copy time has gone down from 38µs (debug) to 2.8µs (release) !

That seems almost too fast with about 2 cycles per copy...

One more question concerning the *.list file:

Is there a setting where showing the C source in the list file can be turned on?

I've been through many project settings in the IDE, but haven't found anything yet.

Bassett.David · ‎2021-11-08

If using Keil, look for a .txt file - much mire useful for analyzing the generated code!

Regards,

Dave

Tesla DeLorean · ‎2021-11-08

The DWT CYCCNT might be less of a circus..

https://www.systutorials.com/generate-a-mixed-source-and-assembly-listing-using-gcc/

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

LCE · ‎2021-11-08

Thanks for the info! Lots to learn for me about ARMs.

LCE · ‎2021-11-08

Using DWT CYCCNT I get some corresponding numbers:

PTP time check:
u32StopNanoSec  = 374706031
u32StartNanoSec = 374703048
u32buffer[256] copy time: 2983 ns
 
DWT->CYCCNT   = 0x48DE03C2
u32DwtCycCnt1 = 1216759486
u32DwtCycCnt2 = 1216760037
DWT->CYCCNT diff = 551 -> 2870 ns @ 192 MHz

And for others who might try this, here's how to enable the CYCCNT (took me some time to find out about the LAR):

CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->LAR = 0xC5ACCE55;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
DWT->CTRL |= DWT_CTRL_PCSAMPLENA_Msk;

LCE · ‎2021-11-17

Hello again,

after having programmed some other stuff and checking no more the data copy time,

I'm back at this point again, and guess what, copy time is again much higher!

I have NOT changed the code above, which checks the time and copies the buffers, and

I have not switched back to debug mode, but now I get a copy time of about 33 µs - which was

about 3µs after switching from compiler debug to release mode (still using STM32CubeIDE).

Any ideas, please?

Maybe @KnarfB or @Community member have some idea about that?

KnarfB · ‎2021-11-17

Hi,

If you have more lines in your code, the compiler may optimize by moving the memory load/store instructions around or even change their order. This will not affect the C semantics of your code but the timing. E.g., the compiler tries to load as early as possible to hide memory latencies.

You can tell the compiler not to move around the code by placing compile-time memory barriers. Shown here for gcc and the Cortex core cycle counter:

asm volatile ("" : : : "memory");		// see https://blog.regehr.org/archives/28
volatile uint32_t tick = DWT->CYCCNT;
asm volatile ("" : : : "memory");
 
// code of interest
 
asm volatile ("" : : : "memory");
volatile uint32_t tock = DWT->CYCCNT;
asm volatile ("" : : : "memory");

Strictly speaking, the volatile is not neccessary to suppress the reordering but avoids that the compiler optimizes away seemingly unused variables.

If possible, check the assembler code, e.g. by stepping though the loop at instruction level.

One last remark: when you have several pointers around, think of defining restrict pointers https://en.wikipedia.org/wiki/Restrict to guide optimization.

hth

KnarfB