2017-03-20 05:31 AM
Hi,
We are testing a code which is generated from simulink. Code is processing over the single precision operands. When we compile the code with keil , by choosing STM32F767 and SP FPU, the maximum time to complete the one cycle of the algorithm seems more than the maximum time for the STM32F429 in the same conditions. This means F429 is processing and completing the algorithm faster than F767.
?n the technical manual, we were expecting the FPU of the F757 must be faster than F429, in contrass processing the algorith time seems more than F429.
CPU speed of F767 is 216MHz and F429 is 168MHz. we selected the same optimization levels over the keil before compiling.
Algorithm time is measured by a timer as starting in the beggining of the algorithm and stoping at the end.
Question is why and how is the 767 works slower than the 429?
thank you for your answers,
O?uzhan Demirci
#stm32f767 #stm32f429 #fp2017-03-23 01:33 AM
You might also want to test a different compiler. It might be that the IAR compiler for example could optimize better for the M7.
2017-03-23 03:00 AM
Any benchmarks to support that ?
Not to challenge you, but I have tried neither the Keil nor IAR compiler for the M7.
Using my Crossworks (gcc-based) toolchain, I got the expected results when comparing runtime performance of a STM32F4, a STM32F74x, and a SAME70 (Atmel, 300MHz, DP-FPU).
2017-03-23 03:21 AM
I do not have any direct comparison between Keil/IAR/GCC for the STM32F7 but ST used the IAR compiler to show coremark performance:
http://www.st.com/en/microcontrollers/stm32f7-series.html?querycriteria=productId=SS1858
Compare with (look for the STM32F7:
http://eembc.org/coremark/index.php
Then, of course, all code is unique and you never know if one tool or the other is better.
Anyway, if you need more performance I think you should try different options.
2017-03-23 05:38 AM
Since Keil belongs to ARM for some years now, it is IMHO not presumptuous to expect top performance from this toolchain.
Especially for the latest cores.
Then, of course, all code is unique ...
Perhaps the OP ran into a cache-trashing effect, an issue the M7 inherited from it's larger Cortex A 'role models'.
I used the cache setup as provided by vendor examples, and 'heavier' algorithms and data sizes - a signal processing code example ported from PC/Linux.
2017-03-24 09:59 AM
For many compilers operations on single precision produce a double precision result unless specifically over ridden. The remaining operations will be in double precision and probably slower.
SP * SP = DP
SP * DP = DP etc
2017-03-25 04:50 PM
I'm not convinced that is the case, most I'm familiar with don't promote types*. And the ARM implementation isn't one where the intermediate results are held at a higher precision. It is the sort of thing that would make saving/switching context very complicated.
The FPU-D might do all work at 64-bit, and truncate the results to 32-bit, it would depend, I suppose, if that would result in a significant reduction in gates. Intel and Motorola FPUs from decades ago did everything at 80-bits, and had the hardware truncate automatically when values were exported. It would be nice to see some more technical diagrams for the ARM FPv5 implementation.
The FPU and SW implementations might also provide slightly different results.
printf() frequently works with doubles.
* The multiple sw/hw might hold to higher precision mid-process, but likely to truncated into the smaller mantissa/exponent before handing back a value.
2017-03-26 11:41 PM
I'm not convinced that is the case, most I'm familiar with don't promote types*.
I agree.
The IEEE 754 says so as well.
And neither does C such a type promotion..
2018-03-22 08:28 AM
I had the same observations and it took me a while to understand why. The STM32F7 is only faster if you enable the ICTM and DCTM and place time critical sections of the code judiciously. By default the compiler will not do this, so unless you know what you are doing and how to take advantage of these additional features, the STM32F7 will appear slower than the STM32F4.
In other words, you need more fine tuning on the STM32F7 and the default compiler options can’t do this for you because they don’t know what sections of your code should be placed into ICTM and DCTM.
One book that explains some of these feature differences is 'The Designer's Guide to the Cortex-M Processor Family, Second Edition' by Trevor Martin. There is also an ST application note (AN4667, 'STM32F7 Series system architecture and performance') that goes into the details.
It would be difficult to offer general advice but a few suggestions: relocate your stack and time-critical data to DTCM and place your critical code (interrupt routines) to the ICTM.
Don't expect miraculous improvements, the STM32F4 is already a fast processor and ST have done a lot to make it as efficient as possible, but as an illustration, I have managed to achieve an improvement of about 20-30% in my single precision floating point intensive interrupt routine (at the same clock frequency).
2018-03-22 09:18 AM
I'm not for the moment deeping inside Art, I-Cache and D-Cache, but I'm porting microframework (TinyCLR) from F4 to F7 and a C# high level bench to calculate PI decimals has 1.8 sec on F429 (180mhz) and 0.90 sec on F769 (216mhz), for a total of 20 decs, so F7 is double faster than F4. Native code c compiler is ARM GCC.
2018-03-22 09:39 AM
Where it is going to spank the F4, and poorly optioned CM7, is the double precision math.