Skip to main content
PMath.4
Senior III
October 24, 2019
Question

Arm assembler - branch by a programmable amount

  • October 24, 2019
  • 10 replies
  • 2488 views

I want to manipulate the program counter in an assembler function by branching down a list of instructions by a programmable amount

.syntax unified
.global myasm
myasm: PUSH {R0,R1,r2,r4,R5}
ADD R15,#4
 
 

is rejected by the assembler

.syntax unified
.global myasm
myasm: PUSH {R0,R1,r2,r4,R5}
mov r5,r2,lsl #2
ADD R15,r5
 
 

is accepted but isn't doing what I want. I assume that R15 increments by 4 for each instruction so I'm making sure that I'm always adding multiples of 4.

This is a technique I've uased on other processors from the PDP11 onwards but the ARM crashes as soon as the ADD instruction executes

Any help appreciated

    This topic has been closed for replies.

    10 replies

    waclawek.jan
    Super User
    October 24, 2019

    > I assume that R15 increments by 4 for each instruction

    No. Cortex-M don't execute ARM instructions.

    The Cortex-M execute Thumb-2 instruction set, it has 16-bit instruction word, but some instructions are one-, some are two-instruction-words (i.e. some are 2 bytes, other 4-bytes (this differs between Cortex-M0/M0+ and Cortex-M3/M4/M7).

    Thumb also severely restricts what you can do with PC/R15, your basic option is to use BX; but you might be pleasantly surprised by the TBB/TBH instructions.

    Read the ARM-v7 Architecture Reference Manual (unless you intend to use Cortex-M0/M0++, in which case it's ARM-v6).

    JW

    PMath.4
    PMath.4Author
    Senior III
    October 24, 2019

    Jan

    Thanks for the pointer. TBH can be made to do what I want albeit in a clunky way compared to just directly adding an offset to the program counter. I've got my code working in a test program however, in the main program it causes problems. I suspect that despite pushing and popping all relevant registers the assembler code is conflicting with pipelining, cacheing or something similar.

    Tesla DeLorean
    Guru
    October 24, 2019

    Perhaps you can BL to the next instruction, ADD to LR, and the BX LR out of it.

    Tips, Buy me a coffee, or three.. PayPal Venmo (See Profile) Up vote any posts that you find helpful, it shows what's working..
    PMath.4
    PMath.4Author
    Senior III
    October 24, 2019

    "Perhaps you can BL to the next instruction, ADD to LR, and the BX LR out of it."

    Wonderful bit of lateral thinking :) Unfortunately LR is like PC - you can't assign to it. It appears you can add a small literal

    add Lr,#n

    but you can't do

    add LR,R2

    Tesla DeLorean
    Guru
    October 24, 2019

    But you can do this

      ADD R2, LR

      BX R2

    What's the target here? A CM0(+)

    Tips, Buy me a coffee, or three.. PayPal Venmo (See Profile) Up vote any posts that you find helpful, it shows what's working..
    PMath.4
    PMath.4Author
    Senior III
    October 25, 2019

    IT WORKS!!

    This code implements what in old PDP11 terminology was called a transfer vector. It allows a variable number of bytes to be copied without any loop or loop testing.

    memcpy() seems to be appalling slow and a list of C pointer copies *p++=*q++; is very much faster but can't deal with the situation where the number of elements to copy is variable. This assembler code can. Thanks to Jan and Clive for the help. Tested on STM32H743

    /*
    Routine to copy bytes without any loop
    R0 - source address
    R1 - destination address
    R2 - number of bytes to copy
    example code allows 0-10 bytes to be copied
    versions can be built to copy shorts, words etc all without looping
    */
    mycpy: 	PUSH {R4-R5,lr}
    			bl next
    next: mov r4,#104			@ length of the jump if zero bytes to copy
    			mov r5,r2,lsl #3 	@each copy takes two words - 8 bytes
    			sub r4,r5 			@offset the jump by the number of bytes
    			add r4,lr			@add in the current return address
    			bx r4				@return from the subroutine to the correct address
    			LDRb r5,[r0],#1
    			strb	r5,[r1],#1
    			LDRb r5,[r0],#1
    			strb	r5,[r1],#1
    			LDRb r5,[r0],#1
    			strb	r5,[r1],#1
    			LDRb r5,[r0],#1
    			strb	r5,[r1],#1
    			LDRb r5,[r0],#1
    			strb	r5,[r1],#1
    			LDRb r5,[r0],#1
    			strb	r5,[r1],#1
    			LDRb r5,[r0],#1
    			strb	r5,[r1],#1
    			LDRb r5,[r0],#1
    			strb	r5,[r1],#1
    			LDRb r5,[r0],#1
    			strb	r5,[r1],#1
    			LDRb r5,[r0],#1
    			strb	r5,[r1],#1
    			LDRb r5,[r0],#1
    			strb	r5,[r1],#1
    			POP {R4-R5,lr}
    			mov r0,#0
    			bx lr
     

    waclawek.jan
    Super User
    October 25, 2019

    Couldn't shifting the count by 1 and IT be used to break this down to a byte-halfword-word-two_words read, up to 15 bytes?

    JW

    PMath.4
    PMath.4Author
    Senior III
    October 25, 2019

    "Couldn't shifting the count by 1 and IT be used to break this down to a byte-halfword-word-two_words read, up to 15 bytes?"

    Yes there are lots of games to play with now the concept is working. This version deals with bytes and doesn't care about word boundaries in either the sourc eor destination. By adding extra testing etc. you can transfer bytes up to the word boundary, then words up to the last complete word, then bytes again for the remainder. The challenge is to optimise very short transfers which are most typical in my application but at the same time do long ones efficiently as well.

    waclawek.jan
    Super User
    October 25, 2019

    > doesn't care about word boundaries

    AFAIK CM7 supports unaligned accesses. They surely are not that efficient, than aligned accesses, but they might still be more efficient than bytewise access.

    > The challenge is to optimise very short transfers which are most typical in my application but at the same time do long ones efficiently as well.

    This is what memcpy() does, AFAIK. I doubt you can win easily over "factory optimized" memcpy() (and I believe ARM does contribute to gcc and kin in this regard) as far as the general case (both short and long, to any source/destination); so your chance is to go for special cases where memcpy() may have weaknesses. Short transfers may be one of them, transfers to/from particular areas may be other.

    JW

    PMath.4
    PMath.4Author
    Senior III
    October 25, 2019

    The memcpy implementation really does seem to be poor - at least on the H7. I saw a >2x performance increase using a string of *p++=*q++ vs memcpy when I knew the elements were word aligned

    Tesla DeLorean
    Guru
    October 25, 2019

    The CM7 still faults on unaligned LDRD/STRD

    Keil has a better memcpy() apparently than GCC, and inlines short ones.​ Where the alignment of source and destination permit it uses load/store multiple. Similar theme with memset(), etc

    Tips, Buy me a coffee, or three.. PayPal Venmo (See Profile) Up vote any posts that you find helpful, it shows what's working..