cancel
Showing results for 
Search instead for 
Did you mean: 

More efficient startup code

ppannuto
Associate
Posted on March 02, 2016 at 22:03

Hi all,

For an application I'm working on, I needed to speed up startup. I looked into the .init and .bss routines and saw that they weren't very efficient. I re-wrote them and got about a 2.2x improvement in startup time for my application. The revised code does the same thing, just in fewer cycles.

Feel free to use freely if this is useful for anyone. @ST, feel free to copy into the library you distribute.

https://gist.github.com/ppannuto/672328eb8184abdb9559

-Pat

#startup-speed-efficiency-fast
2 REPLIES 2
Amel NASRI
ST Employee
Posted on March 03, 2016 at 16:17

Hi pannuto.pat,

Thanks for sharing your new startup file for gcc.

I would like to understand in which sens ''.init and .bss routines are not very very efficient''?

Then is it possible to provide us more details on the updates you made in order to decrease the startup time?

-Mayla-

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

ppannuto
Associate
Posted on March 03, 2016 at 19:45

Sure, if you compare the running of the two loops, the original code

executed several more instructions per loop, every loop iteration it

would read the same memory address into the same register, which it

didn't need to do. You can also eliminate the adds that increments

the pointer by using the stmia [store and increment after] instruction;

for a single store operation, stm and stmia both take 2 cycles. (On more

powerful (cortex-m3 and up) cores, you usually use postfix addressing,

i.e. str r0, [r1], #4 to do this, but postfix isn't supported on the m0,

stmia is, however).

In the old code, the loop part was

  ldr  r3, =_sidata

  ldr  r3, [r3, r1]

  str  r3, [r0, r1]

  adds  r1, r1, #4

  ldr  r0, =_sdata

  ldr  r3, =_edata

  adds  r2, r0, r1

  cmp  r2, r3

  bcc  CopyDataInit

  5 memory operations x 2 cycles each = 10 cycles

+ 3 alu operations    x 1 cycle  each =  3 cycles

+ 1 branch opreation  x 1 cycle (usu) =  1 cycle

For 14 cycles / loop. In the new code, the loop part is

  ldmia r2!, {r3}

  stmia r0!, {r3}

  cmp   r0, r1

  bcc   CopyDataInitializersLoop

  2 memory/alu operations x 2 cycles each = 4 cycles

  1 alu operation         x 1 cycle  each = 1 cycle

  1 branch operation      x 1 cycle (usu) = 1 cycle

For 6 cycles / loop.

Is this clear?

-Pat

Complete Old Copy Data:

  movs  r1, #0

  b  LoopCopyDataInit

CopyDataInit:

  ldr  r3, =_sidata

  ldr  r3, [r3, r1]

  str  r3, [r0, r1]

  adds  r1, r1, #4

LoopCopyDataInit:

  ldr  r0, =_sdata

  ldr  r3, =_edata

  adds  r2, r0, r1

  cmp  r2, r3

  bcc  CopyDataInit

Complete New Copy Data:

CopyDataInitializersStart:

  ldr   r0, =_sdata   /* write to this addr */

  ldr   r1, =_edata   /* until you get to this addr */

  ldr   r2, =_sidata  /* reading from this addr */

  b     CopyDataInitializersEnterLoop

CopyDataInitializersLoop:

  ldmia r2!, {r3}

  stmia r0!, {r3}

CopyDataInitializersEnterLoop:

  cmp   r0, r1

  bcc   CopyDataInitializersLoop