Arm assembler - branch by a programmable amount

PMath.4 · ‎2019-10-24

I want to manipulate the program counter in an assembler function by branching down a list of instructions by a programmable amount

.syntax unified
.global myasm
myasm:   PUSH {R0,R1,r2,r4,R5}
ADD R15,#4

is rejected by the assembler

.syntax unified
.global myasm
myasm:   PUSH {R0,R1,r2,r4,R5}
mov r5,r2,lsl #2
ADD R15,r5

is accepted but isn't doing what I want. I assume that R15 increments by 4 for each instruction so I'm making sure that I'm always adding multiples of 4.

This is a technique I've uased on other processors from the PDP11 onwards but the ARM crashes as soon as the ADD instruction executes

Any help appreciated

waclawek.jan · ‎2019-10-24

> I assume that R15 increments by 4 for each instruction

No. Cortex-M don't execute ARM instructions.

The Cortex-M execute Thumb-2 instruction set, it has 16-bit instruction word, but some instructions are one-, some are two-instruction-words (i.e. some are 2 bytes, other 4-bytes (this differs between Cortex-M0/M0+ and Cortex-M3/M4/M7).

Thumb also severely restricts what you can do with PC/R15, your basic option is to use BX; but you might be pleasantly surprised by the TBB/TBH instructions.

Read the ARM-v7 Architecture Reference Manual (unless you intend to use Cortex-M0/M0++, in which case it's ARM-v6).

JW

PMath.4 · ‎2019-10-24

Jan

Thanks for the pointer. TBH can be made to do what I want albeit in a clunky way compared to just directly adding an offset to the program counter. I've got my code working in a test program however, in the main program it causes problems. I suspect that despite pushing and popping all relevant registers the assembler code is conflicting with pipelining, cacheing or something similar.

Tesla DeLorean · ‎2019-10-24

Perhaps you can BL to the next instruction, ADD to LR, and the BX LR out of it.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

PMath.4 · ‎2019-10-24

"Perhaps you can BL to the next instruction, ADD to LR, and the BX LR out of it."

Wonderful bit of lateral thinking :) Unfortunately LR is like PC - you can't assign to it. It appears you can add a small literal

add Lr,#n

but you can't do

add LR,R2

Tesla DeLorean · ‎2019-10-24

But you can do this

ADD R2, LR

BX R2

What's the target here? A CM0(+)

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Tesla DeLorean · ‎2019-10-24

>>Wonderful bit of lateral thinking =)

I don't even need to think outside the box, I am the box, and the surfaces see in all directions... inside and out

210 0000001A F000 F800 BL .+4

211 0000001E 4472 ADD R2, LR

212 00000020 4710 BX R2

Scaling LR probably isn't going to be helpful

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

PMath.4 · ‎2019-10-25

IT WORKS!!

This code implements what in old PDP11 terminology was called a transfer vector. It allows a variable number of bytes to be copied without any loop or loop testing.

memcpy() seems to be appalling slow and a list of C pointer copies *p++=*q++; is very much faster but can't deal with the situation where the number of elements to copy is variable. This assembler code can. Thanks to Jan and Clive for the help. Tested on STM32H743

/*
Routine to copy bytes without any loop
R0 - source address
R1 - destination address
R2 - number of bytes to copy
example code allows 0-10 bytes to be copied
versions can be built to copy shorts, words etc all without looping
*/
mycpy: 	PUSH {R4-R5,lr}
			bl next
next:       mov r4,#104			@ length of the jump if zero bytes to copy
			mov r5,r2,lsl #3 	@each copy takes two words - 8 bytes
			sub r4,r5 			@offset the jump by the number of bytes
			add r4,lr			@add in the current return address
			bx r4				@return from the subroutine to the correct address
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			POP {R4-R5,lr}
			mov r0,#0
			bx lr

waclawek.jan · ‎2019-10-25

Couldn't shifting the count by 1 and IT be used to break this down to a byte-halfword-word-two_words read, up to 15 bytes?

JW

PMath.4 · ‎2019-10-25

"Couldn't shifting the count by 1 and IT be used to break this down to a byte-halfword-word-two_words read, up to 15 bytes?"

Yes there are lots of games to play with now the concept is working. This version deals with bytes and doesn't care about word boundaries in either the sourc eor destination. By adding extra testing etc. you can transfer bytes up to the word boundary, then words up to the last complete word, then bytes again for the remainder. The challenge is to optimise very short transfers which are most typical in my application but at the same time do long ones efficiently as well.