I wanted to know how to speed up the execution by putting the code in RAM.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2021-09-24 1:10 AM
I learned from section 2.5.7.2 of the STM32CubeIDE user manual that it can be done by modifying the .ld flie and .s file and using __attribute__((section(".RAM_Code"))) declaration functions.
But i still confuse in modifying the .s file. I wonder how to add these code into the .s file
Solved! Go to Solution.
- Labels:
-
STM32CubeIDE
-
STM32G4 Series
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2021-09-26 11:54 PM
As @S.Ma​ pointed out: G4 is quite good in hiding flash latencies. Check RM0440 Reference manual stating "Based on CoreMark benchmark, the performance achieved thanks to the ART accelerator is equivalent to 0 wait state program execution from Flash memory at a CPU frequency up to 170 MHz."
Check the generated assembly code and try to optimzie it by compiler options, using restrict pointers etc..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2021-09-26 5:38 AM
The .s file in question is the startup*.s file under Startup. It should already contain the the first two lines and the last 5 lines (starting with the comment "Copy the data ...") of the code snippet you have posted. Those last lines deal with copying initialized data from Flash to RAM (like string constants in your code). The code before it (starting with the comment "Copy the ram code ...") does the same for your attributed function's code. So you have to insert all lines starting with "Copy the ram code ..." and ending before "Copy the data ..." from the snippet right before the "Copy the data ..." in your startup*.s file.
Note: My startup file starts a little differently:
Reset_Handler:
ldr sp, =_estack /* Set stack pointer */
/* Call the clock system initialization function.*/
bl SystemInit
Leave all existing lines in the .s file, just add the code copying loop as said above.
hth
KnarfB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2021-09-26 7:33 AM
How much speed up are we talking about? Which chip, core frequency an compiler? Do you use dmas and which transfer rate? You might not save much in the end... the hw around the core makes wonders for performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2021-09-26 7:44 PM
Thanks for your answer�?
I have successfully copied the code from Flash to RAM for execution in G4 series chips. However, I find that code executes slower in Ram than it does in Flash. I am confuse with this because the code does faster in RAM when I used Ti C2000 DSP chip.
The test code is as follows, it's a 16bit add.
for(i=0;i<100;i++)
{
uiRes[i]=uiCal1[i]+uiCal2[i];
}
Later, I ran the same test with the G0 series chips and I find that code executes faster in Ram . What's the difference between the two chips executing code in RAM?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2021-09-26 7:46 PM
Thanks for your answer�?
I have successfully copied the code from Flash to RAM for execution in G4 series chips. However, I find that code executes slower in Ram than it does in Flash. I am confuse with this because the code does faster in RAM when I used Ti C2000 DSP chip.
The test code is as follows, it's a 16bit add.
for(i=0;i<100;i++)
{
uiRes[i]=uiCal1[i]+uiCal2[i];
}
Later, I ran the same test with the G0 series chips and I find that code executes faster in Ram . What's the difference between the two chips executing code in RAM?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2021-09-26 8:37 PM
Many. Core is different, compiler options, flash wait states vs voltage'n frequency ranges, ART accelerator config, ....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2021-09-26 11:54 PM
As @S.Ma​ pointed out: G4 is quite good in hiding flash latencies. Check RM0440 Reference manual stating "Based on CoreMark benchmark, the performance achieved thanks to the ART accelerator is equivalent to 0 wait state program execution from Flash memory at a CPU frequency up to 170 MHz."
Check the generated assembly code and try to optimzie it by compiler options, using restrict pointers etc..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2021-09-27 2:50 AM
Try SIMD intrinsics like:
uint32_t *restrict uiRes_32 = (uint32_t*)uiRes;
uint32_t *restrict uiCal1_32 = (uint32_t*)uiCal1;
uint32_t *restrict uiCal2_32 = (uint32_t*)uiCal2;
for (int i = 0; i < len/2; i++)
{
uiRes_32[i] = __UADD16( uiCal1_32[i], uiCal2_32[i] );
}
This requires few thoughts on proper aligments etc..
N.B: I couldn't get gcc into generating SIMD instructions automagically, hmm. Used to work on Cortex-A however.
hth
KnarfB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2021-09-29 11:33 AM
In many cases code in flash memory can run faster because then instructions and data are fetched from different memories and can be done simultaneously. Try splitting code and data somehow between the available memories. For example put stack and/or code in CCM RAM, move data to SRAM2 or some other combination.
