cancel
Showing results for 
Search instead for 
Did you mean: 

STM32F7 @ 192MHz: long data copy time?

LCE
Principal

Hello again,

I recently checked the time the STM32F7 (192MHz) needs to copy one 32bit buffer to another one, and I'm a little shocked about how long that takes.

(Maybe I expect too much, coming from 8bit µCs with 20MHz, an STM32 still feels like a racing car compared to a horse carriage)

For time measurement the interrupts are disabled and the PTP registers are used.

It takes about 38µs to copy a uint32_t buffer[256] to another one, that is about 28 cycles @192MHz. Okay, there's the for-loop comparison and increment, but still, 28 cycles?

Anything I oversee or is that just it?

I know it can be a long way from C to assembler, and I haven't checked that yet (never even looked at the STM32 / ARM assembler stuff), but the list excerpt below shows some 17 CPU actions for the for-loop - right?

Here's the uart/debug output:

*.c:

#define RPM_DMA_BUF_MAX		256
 
uint32_t u32RpmDmaBuf[2][RPM_DMA_BUF_MAX] = { { 0 } };
 
uint32_t u32RpmUartBuf[RPM_DMA_BUF_MAX];
uint32_t *pu32SrcBuf;
 
uint32_t u32StopNanoSec = 0;
uint32_t u32StartNanoSec = 0;
 
/* copy buffer with snapshot of timestamp before and after */ 
__disable_irq();
u32Val = ETH->PTPTSLR;
 
for( i16 = 0; i16 < RPM_DMA_BUF_MAX; i16++ )
{
    u32RpmUartBuf[i16] = *(pu32SrcBuf++);
}
u32Val2 = ETH->PTPTSLR;
__enable_irq();
 
u32StartNanoSec = ETH_PTPSubSecond2NanoSecond(u32Val);
u32StopNanoSec = ETH_PTPSubSecond2NanoSecond(u32Val2);
 
u32RpmDebugTime = 0;
if( u32StopNanoSec > u32StartNanoSec ) u32RpmDebugTime = u32StopNanoSec - u32StartNanoSec;
uart_printf("u32StopNanoSec  = %ld\n\r", u32StopNanoSec);
uart_printf("u32StartNanoSec = %ld\n\r", u32StartNanoSec);
uart_printf("u32buffer[%d] copy time: %ld ns\n\r", RPM_DMA_BUF_MAX, u32RpmDebugTime);

*.list

__disable_irq();
u32Val = ETH->PTPTSLR;
 8006b42:	4bad      	ldr	r3, [pc, #692]	; (8006df8 <UART3_RxCmndProcessing+0x17d8>)
 8006b44:	f8d3 370c 	ldr.w	r3, [r3, #1804]	; 0x70c
 8006b48:	f8c7 3498 	str.w	r3, [r7, #1176]	; 0x498
						for( i16 = 0; i16 < RPM_DMA_BUF_MAX; i16++ )
 8006b4c:	2300      	movs	r3, #0
 8006b4e:	f8a7 3504 	strh.w	r3, [r7, #1284]	; 0x504
 8006b52:	e010      	b.n	8006b76 <UART3_RxCmndProcessing+0x1556>
						{
							//u32RpmUartBuf[i16] = u32RpmDmaBuf[u8RpmBufPtr][i16];
							//u32RpmUartBuf[i16] = *(pu32SrcBuf++);
							*(pu32DstBuf++) = *(pu32SrcBuf++);
 8006b54:	f8d7 24ec 	ldr.w	r2, [r7, #1260]	; 0x4ec
 8006b58:	1d13      	adds	r3, r2, #4
 8006b5a:	f8c7 34ec 	str.w	r3, [r7, #1260]	; 0x4ec
 8006b5e:	f8d7 34e8 	ldr.w	r3, [r7, #1256]	; 0x4e8
 8006b62:	1d19      	adds	r1, r3, #4
 8006b64:	f8c7 14e8 	str.w	r1, [r7, #1256]	; 0x4e8
 8006b68:	6812      	ldr	r2, [r2, #0]
 8006b6a:	601a      	str	r2, [r3, #0]
						for( i16 = 0; i16 < RPM_DMA_BUF_MAX; i16++ )
 8006b6c:	f8b7 3504 	ldrh.w	r3, [r7, #1284]	; 0x504
 8006b70:	3301      	adds	r3, #1
 8006b72:	f8a7 3504 	strh.w	r3, [r7, #1284]	; 0x504
 8006b76:	f8b7 3504 	ldrh.w	r3, [r7, #1284]	; 0x504
 8006b7a:	2bff      	cmp	r3, #255	; 0xff
 8006b7c:	d9ea      	bls.n	8006b54 <UART3_RxCmndProcessing+0x1534>
						}
u32Val2 = ETH->PTPTSLR;
 8006b7e:	4b9e      	ldr	r3, [pc, #632]	; (8006df8 <UART3_RxCmndProcessing+0x17d8>)
 8006b80:	f8d3 370c 	ldr.w	r3, [r3, #1804]	; 0x70c
 8006b84:	f8c7 3494 	str.w	r3, [r7, #1172]	; 0x494
  __ASM volatile ("cpsie i" : : : "memory");
 8006b88:	b662      	cpsie	i
}
 8006b8a:	bf00      	nop
__enable_irq();

10 REPLIES 10

Thanks KnarfB!

Interesting effects, at first the "asm volatile..." lines didn't do anything, now after some more cleans and builds timing is back where it should be.

So I am not so sure if that was "compiler luck" or the lines you mentioned.

Anyway, thanks for the help - and the link about volatile.

Coming from the hardware side, "volatile" is something I somehow always tried to avoid and thus never really understood, just remembering that some volatile declaration troubled me a long time ago when working with some 8-bit controllers / compilers...