I wanted to know how to speed up the execution by putting the code in RAM.

fhu.11 · ‎2021-09-24

I learned from section 2.5.7.2 of the STM32CubeIDE user manual that it can be done by modifying the .ld flie and .s file and using __attribute__((section(".RAM_Code"))) declaration functions.

But i still confuse in modifying the .s file. I wonder how to add these code into the .s file

KnarfB · ‎2021-09-26

As @S.Ma pointed out: G4 is quite good in hiding flash latencies. Check RM0440 Reference manual stating "Based on CoreMark benchmark, the performance achieved thanks to the ART accelerator is equivalent to 0 wait state program execution from Flash memory at a CPU frequency up to 170 MHz."

Check the generated assembly code and try to optimzie it by compiler options, using restrict pointers etc..

View solution in original post

KnarfB · ‎2021-09-26

The .s file in question is the startup*.s file under Startup. It should already contain the the first two lines and the last 5 lines (starting with the comment "Copy the data ...") of the code snippet you have posted. Those last lines deal with copying initialized data from Flash to RAM (like string constants in your code). The code before it (starting with the comment "Copy the ram code ...") does the same for your attributed function's code. So you have to insert all lines starting with "Copy the ram code ..." and ending before "Copy the data ..." from the snippet right before the "Copy the data ..." in your startup*.s file.

Note: My startup file starts a little differently:

Reset_Handler:
  ldr   sp, =_estack    /* Set stack pointer */
 
/* Call the clock system initialization function.*/
    bl  SystemInit

Leave all existing lines in the .s file, just add the code copying loop as said above.

hth

KnarfB

S.Ma · ‎2021-09-26

How much speed up are we talking about? Which chip, core frequency an compiler? Do you use dmas and which transfer rate? You might not save much in the end... the hw around the core makes wonders for performance.

fhu.11 · ‎2021-09-26

Thanks for your answer�?

I have successfully copied the code from Flash to RAM for execution in G4 series chips. However, I find that code executes slower in Ram than it does in Flash. I am confuse with this because the code does faster in RAM when I used Ti C2000 DSP chip.

The test code is as follows, it's a 16bit add.

for(i=0;i<100;i++)

{

uiRes[i]=uiCal1[i]+uiCal2[i];

}

Later, I ran the same test with the G0 series chips and I find that code executes faster in Ram . What's the difference between the two chips executing code in RAM?

fhu.11 · ‎2021-09-26

Thanks for your answer�?

I have successfully copied the code from Flash to RAM for execution in G4 series chips. However, I find that code executes slower in Ram than it does in Flash. I am confuse with this because the code does faster in RAM when I used Ti C2000 DSP chip.

The test code is as follows, it's a 16bit add.

for(i=0;i<100;i++)

{

uiRes[i]=uiCal1[i]+uiCal2[i];

}

Later, I ran the same test with the G0 series chips and I find that code executes faster in Ram . What's the difference between the two chips executing code in RAM?

S.Ma · ‎2021-09-26

Many. Core is different, compiler options, flash wait states vs voltage'n frequency ranges, ART accelerator config, ....

KnarfB · ‎2021-09-26

As @S.Ma pointed out: G4 is quite good in hiding flash latencies. Check RM0440 Reference manual stating "Based on CoreMark benchmark, the performance achieved thanks to the ART accelerator is equivalent to 0 wait state program execution from Flash memory at a CPU frequency up to 170 MHz."

Check the generated assembly code and try to optimzie it by compiler options, using restrict pointers etc..

KnarfB · ‎2021-09-27

Try SIMD intrinsics like:

uint32_t *restrict uiRes_32 = (uint32_t*)uiRes;
	uint32_t *restrict uiCal1_32 = (uint32_t*)uiCal1;
	uint32_t *restrict uiCal2_32 = (uint32_t*)uiCal2;
 
	for (int i = 0; i < len/2; i++)
	{
		uiRes_32[i] = __UADD16( uiCal1_32[i], uiCal2_32[i] );
	}

This requires few thoughts on proper aligments etc..

N.B: I couldn't get gcc into generating SIMD instructions automagically, hmm. Used to work on Cortex-A however.

hth

KnarfB

Piranha · ‎2021-09-29

In many cases code in flash memory can run faster because then instructions and data are fetched from different memories and can be done simultaneously. Try splitting code and data somehow between the available memories. For example put stack and/or code in CCM RAM, move data to SRAM2 or some other combination.