2019-10-24 04:42 AM
I want to manipulate the program counter in an assembler function by branching down a list of instructions by a programmable amount
.syntax unified
.global myasm
myasm: PUSH {R0,R1,r2,r4,R5}
ADD R15,#4
is rejected by the assembler
.syntax unified
.global myasm
myasm: PUSH {R0,R1,r2,r4,R5}
mov r5,r2,lsl #2
ADD R15,r5
is accepted but isn't doing what I want. I assume that R15 increments by 4 for each instruction so I'm making sure that I'm always adding multiples of 4.
This is a technique I've uased on other processors from the PDP11 onwards but the ARM crashes as soon as the ADD instruction executes
Any help appreciated
2019-10-25 02:22 AM
> doesn't care about word boundaries
AFAIK CM7 supports unaligned accesses. They surely are not that efficient, than aligned accesses, but they might still be more efficient than bytewise access.
> The challenge is to optimise very short transfers which are most typical in my application but at the same time do long ones efficiently as well.
This is what memcpy() does, AFAIK. I doubt you can win easily over "factory optimized" memcpy() (and I believe ARM does contribute to gcc and kin in this regard) as far as the general case (both short and long, to any source/destination); so your chance is to go for special cases where memcpy() may have weaknesses. Short transfers may be one of them, transfers to/from particular areas may be other.
JW
2019-10-25 02:37 AM
The memcpy implementation really does seem to be poor - at least on the H7. I saw a >2x performance increase using a string of *p++=*q++ vs memcpy when I knew the elements were word aligned
2019-10-25 03:12 AM
The CM7 still faults on unaligned LDRD/STRD
Keil has a better memcpy() apparently than GCC, and inlines short ones. Where the alignment of source and destination permit it uses load/store multiple. Similar theme with memset(), etc