2018-06-20 04:46 AM
As i had lack of speed while using stm32f407 disco, i ordered stmf746 nucleo-144. And What suprised me was that it is slower then stmf407, despite that stm32f07 168 max mhz, and stmf756 is up to 216Mhz. I configured my code with CUBEMX. Simple test:
While(1)
{
HAL_GPIO_WritePin
(GPIOD, GPIO_PIN_12, GPIO_PIN_SET);
HAL_GPIO_WritePin
(GPIOD, GPIO_PIN_12, GPIO_PIN_RESET);
}
gives me 150ns pulses on STM32F746
when the same test gives me 100ns on STM32F407.
I used HSE 8mhz external resonator. Aclually clock config were the same for both mcu, except that
STM32F746 has 216 mhz max when stmf407 168.
So why is that? STM32F746 should be more powerful. Also it was released later
#stm32f407 #stm32f746 #external-rcSolved! Go to Solution.
2018-06-20 03:13 PM
Hi,
Performance on an MCU is not just Pin I/O toggle limited by many factors: AHB speed ( Frequency, latency of CPU core write and propagation to the physical I/O and then external bus load ).
in advanced MCUs like the Cortex-M cores, Peripheral usage : SPI, I/O, Timer should work in autonomous way like using DMA or builtIn FIFOs to let CPU Cortex do important tasks (Real MIPS) and math computation , Algos. not like legacy old MCUs.
on STM32F7 we have Cortex-M7 core providing twice M4 MIPS (STM32F4) Thanks to pipeline and dual issue instruction set when running at same speed. Then particular care should be taken from where code is put like Cache activation etc.
Have te a look on this app note :
Good lecture,
Ciao
STOne-32
2018-06-20 05:32 AM
>>
So why is that? STM32F746 should be more powerful. Also it was released later
Perhaps jamming a GPIO up and down, via write buffers, and slower buses is not a good measure of computational work?
2018-06-20 08:00 AM
the same thing with SPI
2018-06-20 08:14 AM
Btw i dont know what do you mean by 'slower bus' i used the same pins their use the same bus AHB1. I always thought, that more mhz = more speed, is in it related to GPIO pins and mcu's performance?
2018-06-20 08:21 AM
'
I always thought, that more mhz = more speed'
Then your expectation is wrong.
That generalization is built on many assumptions, some of which apparently went wrong in your particular case.
2018-06-20 01:53 PM
... and the twice-as-much MHz 'H7
https://community.st.com/0D50X00009XkWanSAF
... ;)gives me 150ns pulses onSTM32F746
when the same test gives me 100ns onSTM32F4
There may be other things influencing this sort of 'speed' too, but first, try switching on compiler optimization.
JW
2018-06-20 03:13 PM
Hi,
Performance on an MCU is not just Pin I/O toggle limited by many factors: AHB speed ( Frequency, latency of CPU core write and propagation to the physical I/O and then external bus load ).
in advanced MCUs like the Cortex-M cores, Peripheral usage : SPI, I/O, Timer should work in autonomous way like using DMA or builtIn FIFOs to let CPU Cortex do important tasks (Real MIPS) and math computation , Algos. not like legacy old MCUs.
on STM32F7 we have Cortex-M7 core providing twice M4 MIPS (STM32F4) Thanks to pipeline and dual issue instruction set when running at same speed. Then particular care should be taken from where code is put like Cache activation etc.
Have te a look on this app note :
Good lecture,
Ciao
STOne-32
2018-06-20 05:49 PM
Data transactions to buses beyond the TCM are not occurring in a single-cycle, I'd expect something on the AHB to take at least 4 cycles. I'd also expect any decision or arbitration hardware to add additional delays and sequencing. Buses which take more time are by definition slower. Can't find a good diagram of the timing, but you've got a 6 stage pipeline trying to stuff data into a subsystem that needs to sequence through something clocking at least 1/4 the speed/throughput. Observe the behaviour on a multi-lane motorway when everyone has to suddenly funnel into a single lane at half the rated speed limit.
Your code example is also very poor, you want to get this as close to single instructions as possible (repetitive stores), not call subroutines with asserts, comparisons, shifts and loads.
See how something like this works,
{
volatile uint32_t *p = (volatile uint32_t *)&GPIOD->BSRR; uint32_t a = 1 << 12, b = 1 << (12+16); while(1) { *p = a; *p = b; *p = a; *p = b;*p = a;
*p = b; *p = a; *p = b;*p = a;
*p = b; *p = a; *p = b;*p = a;
*p = b; *p = a; *p = b; }}But also observe that whetstone and dhrystone benchmarks don't do IO.
If you want a pin to bang up and down at 54 MHz use a TIM, and zero CPU cycles.
2018-06-21 12:22 AM
i didnt use BSRR register, i made a simple test with ODR one
GPIOA->ODR = 0x40;
GPIOA->ODR = 0x0000;gives me around 12ns, while stm32f407 gives 25ns. So yes its faster when we use registers directly, but i dont undertand why HAL is slower on more powerful mcu. Maybe its some way to oprimize it?
2018-06-21 12:34 AM
Im not sure what do you mean by ''
switching on compiler optimization''. Guess its not about CUBEmx clock optimization.
But anyway this is my clock config for both mcu's:
F4:
F7: