Senior III

Question

Arm assembler - branch by a programmable amount

Forum|Forum|6 years ago
October 24, 2019
10 replies
2488 views

I want to manipulate the program counter in an assembler function by branching down a list of instructions by a programmable amount

.syntax unified
.global myasm
myasm: PUSH {R0,R1,r2,r4,R5}
ADD R15,#4

is rejected by the assembler

.syntax unified
.global myasm
myasm: PUSH {R0,R1,r2,r4,R5}
mov r5,r2,lsl #2
ADD R15,r5

is accepted but isn't doing what I want. I assume that R15 increments by 4 for each instruction so I'm making sure that I'm always adding multiples of 4.

This is a technique I've uased on other processors from the PDP11 onwards but the ARM crashes as soon as the ADD instruction executes

Any help appreciated

This topic has been closed for replies.

waclawek.jan

Super User

> I assume that R15 increments by 4 for each instruction

No. Cortex-M don't execute ARM instructions.

The Cortex-M execute Thumb-2 instruction set, it has 16-bit instruction word, but some instructions are one-, some are two-instruction-words (i.e. some are 2 bytes, other 4-bytes (this differs between Cortex-M0/M0+ and Cortex-M3/M4/M7).

Thumb also severely restricts what you can do with PC/R15, your basic option is to use BX; but you might be pleasantly surprised by the TBB/TBH instructions.

Read the ARM-v7 Architecture Reference Manual (unless you intend to use Cortex-M0/M0++, in which case it's ARM-v6).

JW

PMath.4Author

Senior III

Jan

Thanks for the pointer. TBH can be made to do what I want albeit in a clunky way compared to just directly adding an offset to the program counter. I've got my code working in a test program however, in the main program it causes problems. I suspect that despite pushing and popping all relevant registers the assembler code is conflicting with pipelining, cacheing or something similar.

Tesla DeLorean

Guru

Perhaps you can BL to the next instruction, ADD to LR, and the BX LR out of it.

Tips, Buy me a coffee, or three.. PayPal Venmo (See Profile) Up vote any posts that you find helpful, it shows what's working..

PMath.4Author

Senior III

"Perhaps you can BL to the next instruction, ADD to LR, and the BX LR out of it."

Wonderful bit of lateral thinking :) Unfortunately LR is like PC - you can't assign to it. It appears you can add a small literal

add Lr,#n

but you can't do

add LR,R2

Tesla DeLorean

Guru

But you can do this

ADD R2, LR

BX R2

What's the target here? A CM0(+)

Tips, Buy me a coffee, or three.. PayPal Venmo (See Profile) Up vote any posts that you find helpful, it shows what's working..

PMath.4Author

Senior III

IT WORKS!!

This code implements what in old PDP11 terminology was called a transfer vector. It allows a variable number of bytes to be copied without any loop or loop testing.

memcpy() seems to be appalling slow and a list of C pointer copies *p++=*q++; is very much faster but can't deal with the situation where the number of elements to copy is variable. This assembler code can. Thanks to Jan and Clive for the help. Tested on STM32H743

/*
Routine to copy bytes without any loop
R0 - source address
R1 - destination address
R2 - number of bytes to copy
example code allows 0-10 bytes to be copied
versions can be built to copy shorts, words etc all without looping
*/
mycpy: 	PUSH {R4-R5,lr}
			bl next
next: mov r4,#104			@ length of the jump if zero bytes to copy
			mov r5,r2,lsl #3 	@each copy takes two words - 8 bytes
			sub r4,r5 			@offset the jump by the number of bytes
			add r4,lr			@add in the current return address
			bx r4				@return from the subroutine to the correct address
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			LDRb r5,[r0],#1
			strb	r5,[r1],#1
			POP {R4-R5,lr}
			mov r0,#0
			bx lr

waclawek.jan

Super User

Couldn't shifting the count by 1 and IT be used to break this down to a byte-halfword-word-two_words read, up to 15 bytes?

JW

PMath.4Author

Senior III

"Couldn't shifting the count by 1 and IT be used to break this down to a byte-halfword-word-two_words read, up to 15 bytes?"

Yes there are lots of games to play with now the concept is working. This version deals with bytes and doesn't care about word boundaries in either the sourc eor destination. By adding extra testing etc. you can transfer bytes up to the word boundary, then words up to the last complete word, then bytes again for the remainder. The challenge is to optimise very short transfers which are most typical in my application but at the same time do long ones efficiently as well.

waclawek.jan

Super User

> doesn't care about word boundaries

AFAIK CM7 supports unaligned accesses. They surely are not that efficient, than aligned accesses, but they might still be more efficient than bytewise access.

> The challenge is to optimise very short transfers which are most typical in my application but at the same time do long ones efficiently as well.

This is what memcpy() does, AFAIK. I doubt you can win easily over "factory optimized" memcpy() (and I believe ARM does contribute to gcc and kin in this regard) as far as the general case (both short and long, to any source/destination); so your chance is to go for special cases where memcpy() may have weaknesses. Short transfers may be one of them, transfers to/from particular areas may be other.

JW

PMath.4Author

Senior III

The memcpy implementation really does seem to be poor - at least on the H7. I saw a >2x performance increase using a string of *p++=*q++ vs memcpy when I knew the elements were word aligned

Tesla DeLorean

Guru

The CM7 still faults on unaligned LDRD/STRD

Keil has a better memcpy() apparently than GCC, and inlines short ones. Where the alignment of source and destination permit it uses load/store multiple. Similar theme with memset(), etc

Tips, Buy me a coffee, or three.. PayPal Venmo (See Profile) Up vote any posts that you find helpful, it shows what's working..

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded