2015-12-01 11:20 PM
Hello.
I have very DSP load program to run, and STM32F4 is not up to the task, so i switched to STM32F7 and HAL ( it is just asking for trouble ) I have few question related to code execution from flash storing DSP coefficients for math. 1) STMCubeMX SORTEX_M7 configuration allows for AXI and TCM interface. I have small program, around 16-32k. What i have to use to get best performance? Or this is not a issue ? 2) As far as i understand, for best Cortexm7 performance, DSP coefficients must be held in DTCM, what is the syntax to generate f32 128-256 points array in this part of memory ? This is code i need to get stored inside DTCM:float sin_coeff[256];
float cos_coeff[256];
uint32_t i = 0;
i=0;
while(i<256)
{
sin_coeff[i]=(float)sinf((6.2831853f*0f*(float)i)/256);
cos_coeff[i]=(float)cosf((6.2831853f*0f*(float)i)/256);
i++;
}
2015-12-02 02:28 AM
Hi karpavicius.linas,
•“STMCubeMX CORTEX_M7 configuration allows for AXI and TCM interface. I have small program, around 16-32k. What i have to use to get best performance? Or this is not a issue?�When the code size of the user application fits into the internal Flash memory, the latter would be the best execution region either: – Through TCM (Flash-ITCM) by enabling the ART-accelerator or– Through AXI/AHB by enabling the cache in order to reach 0-wait state at 216 MHzNote that the execution from Flash-ITCM/data in DTCM-RAM and Flash-AXI/data in DTCM-RAM have the same CoreMark score which is 5 CoreMark/MHz.•“As far as i understand, for best Cortexm7 performance, DSP coefficients must be held in DTCM.�
You are right, DTCM-RAM is accessible by bytes, half-words (16 bits), words (32 bits) or double words (64 bits), and it’s accessible at a maximum CPU clock speed without latency, which enables the Cortex-M7 processor to achieve excellent performance in many control and DSP applications. •What is the syntax to generate f32 128-256 points array in this part of memory? “
The syntax depends on the tool that you use. You have to modify your linker file, create a new memory section starting from 0x2000 0000 to 0x2000 FFFF (64 Kbyte of DTCM-RAM) in order to place your data in this area.• I'd highly recommend you to have a look to the application note: it provides a demonstration of the performance of the STM32F7 Series devices in various memory partitioning configurations (different code and data locations).-Syrine-
2015-12-02 04:37 AM
I am using IAR ARM.
It does say, that variable is placed at :cos_coeff <
array
> 0x2000086C float[50000]
But can it place array that long that start in DTCM RAM, and end up outside this memory ? Should linker place it in SRAM ?
2015-12-02 06:02 AM
Hi karpavicius.linas,
With 50000 size of array, it clear that you exceed the 64 Kbyte of DTCM-RAM, so a part of data is placed on DTCM and the rest will be placed in SRAM.-Syrine-2015-12-02 06:08 AM
There are surely a number of options to do this.
You could use fixed pointers, pragma/attribute settings with the memory suitably carved up in the linker script/scatter file, a heap that's sufficiently large for these structures and parked in this area. The math making the table could probably done more cleanlysin_coeff[i]=(float)sinf((6.2831853f*0f*(float)i)/256);
sin_coeff[i]=sinf(6.2831853f*0f*(float)i*0.00390625f); sin_coeff[i]=sinf(3.1415926535897932f*(float)i);
2015-12-02 06:19 AM
sin_coeff[i]=sinf(3.1415926535897932f*(float)i);
Compiler will make my code like that, with highest optimization, it should be smart enough to do math before. Also it is LUT and sinf takes ages to calculate, thats why i never use sinf functions, only LUT
2015-12-02 08:13 AM
>sin_coeff[i]=sinf(3.1415926535897932f*(float)i);
>
>Compiler will make my code like that, with highest optimization, it should be smart enough to do math before. Also it is LUT and sinf takes ages to calculate, thats why i never use sinf functions, only LUT
Why not pre-generating the LUT, and include it as array of constants in the source code ?
2015-12-02 08:25 AM
Because it is very flexible, and give best results in terms of performance. It's all about execution speed with me.
2015-12-02 11:31 AM
It's just an idea.
I used this method - with integer calculations and a respective LUT - for a tight loop on a F100.>Because it is very flexible, ...Unless it reads and changes variables that define the resultant LUT at runtime, there is no additional flexibility, compared to a constant table. Both methods need a rebuild/reflash cycle. For my application, I wrote a tool to create that LUT as separate source file. > and give best results in terms of performance. It's all about execution speed with me. If access to RAM is faster than to code/Flash, that would be an argument. Never tried a F7, though. And honestly, in that case I would prefer a M7 with double-precision FPU ...