Odd behavior in assembly busy wait cycle

PB1 · ‎2020-02-07

I am preparing the labs for the undergraduate embedded systems course at the university where I work (the University of Milano-Bicocca in Italy). This year we will change the hardware and we will start using the STM32 Nucleo F767ZI boards. I am a total beginner, so please bear me. I am preparing the usual "blinky" example, so I started STM32CubeIDE, I created a fresh project, did not touch the default pinout configuration, and proudly wrote my first code for a Nucleo board ever:

/* main.c: */
...
int main(void)
{
    ...
    while (1)
    {
        HAL_GPIO_WritePin(GPIOB, GPIO_PIN_0 | GPIO_PIN_7 | GPIO_PIN_14, GPIO_PIN_SET);
        PAUSE_CIRCA_ONE_SECOND;
        HAL_GPIO_WritePin(GPIOB, GPIO_PIN_0 | GPIO_PIN_7 | GPIO_PIN_14, GPIO_PIN_RESET);
        PAUSE_CIRCA_ONE_SECOND;
    }
}
...

To spice the things up a bit I wrote the pause code in ARM inline assembly:

#define PAUSE_CIRCA_ONE_SECOND \
	asm volatile("mov r0, #0x4400000\n" \
				 "sub r0, r0, #1\n" \
				 "cmp r0, #0\n" \
				 "bne .-6" : : : "cc", "r0");

where I considered the 216MHz clock and I guessed that the CPU would execute approximately one instruction per clock cycle - hence 0x4400000. It turns out that the resulting behavior is almost as I would expect: the leds stay on for about one second, and this is great, but they stay off much longer, that is, for about 12 seconds. The interesting thing is that if I unroll the loop once:

while (1)
{
    HAL_GPIO_WritePin(GPIOB, GPIO_PIN_0 | GPIO_PIN_7 | GPIO_PIN_14, GPIO_PIN_SET);
    PAUSE_CIRCA_ONE_SECOND;
    HAL_GPIO_WritePin(GPIOB, GPIO_PIN_0 | GPIO_PIN_7 | GPIO_PIN_14, GPIO_PIN_RESET);
    PAUSE_CIRCA_ONE_SECOND;
    HAL_GPIO_WritePin(GPIOB, GPIO_PIN_0 | GPIO_PIN_7 | GPIO_PIN_14, GPIO_PIN_SET);
    PAUSE_CIRCA_ONE_SECOND;
    HAL_GPIO_WritePin(GPIOB, GPIO_PIN_0 | GPIO_PIN_7 | GPIO_PIN_14, GPIO_PIN_RESET);
    PAUSE_CIRCA_ONE_SECOND;
}

the leds always stay on for about one second, and stay off for about 12 seconds on the odd cycles (starting to count from 1) and for about 1 second on the even cycles. This behavior remains the same if I unroll the loop more than once. I would like to understand the reason of such a strange behavior.

I did some small investigations to exclude the most trivial causes. I don't think there is any difference between the many instances of the PAUSE_CIRCA_ONE_SECOND code: I disassembled the main() and compared them without finding any difference. Also, I used PC-relative offsets and cross-checked the branch target in the disassembled code to be sure that one instance does not jump into the other. The slow part is not the call to HAL_GPIO_WritePin: I debugged the code and setting/resetting the GPIO pins always takes a negligible amount of time when compared to PAUSE_CIRCA_ONE_SECOND.

Any explanation, any suggestion on how I can further investigate the causes, or any fix to the above code, is welcome.

Thank you in advance

Pietro Braione

waclawek.jan · ‎2020-02-08

> I came to the conclusion that caching is most likely the cause.

I don't think so; I believe it's the speculative prefetch/branch prediction/branch address cache, or maybe even the dual issue.

Unfortunately, according to PM0253, branch prediction probably can't be switched off, and it does not mention BTAC at all so it's probably not implemented in the ST incarnations of Cortex-M7. Contrary to the Cortex-M7 TRM, PM0253 also states most of the "tweaking" bits of SCB_ACTLR register to be reserved, so these again are probably not implemented by ST. But - again contrary to the TRM - there appears to be a simple way to switch off the instruction cache, by clearing SCB_CCR.IC, so that's a simple thing and it would be nice if you'd try that.

JW

n2wx · ‎2020-02-08

I think they still have their place during power up before everything is brought up, but not with HAL and maybe not with the opaque predictive caching this puppy may or may not do. It's frightening that the ref manual doesn't expose a way to turn off branch predictive caching if such a thing is implemented

berendi · ‎2020-02-08

If there were a way to turn off branch prediction, it'd rather belong to the programming manual. But there is no way.

Piranha · ‎2020-02-08

> about the present day trend of using highly nondeterministic superscalar/pipelined/cached processors in embedded - and possibly safety-critical - settings

If a device's ability to function correctly depends on exact CPU clock ticks in a complex MCU with multiple peripherals on multiple buses running simultaneously, then that is not a safety-critical device and IMHO not a reliable device at all. And it's true not only for Cortex-M7 but also for other more deterministic Cortex-M CPUs, because buses, flash memory accesses, interrupts, interrupt nesting and other features also introduce non-determinism.