Odd behavior in assembly busy wait cycle

PB1 · ‎2020-02-07

I am preparing the labs for the undergraduate embedded systems course at the university where I work (the University of Milano-Bicocca in Italy). This year we will change the hardware and we will start using the STM32 Nucleo F767ZI boards. I am a total beginner, so please bear me. I am preparing the usual "blinky" example, so I started STM32CubeIDE, I created a fresh project, did not touch the default pinout configuration, and proudly wrote my first code for a Nucleo board ever:

/* main.c: */
...
int main(void)
{
    ...
    while (1)
    {
        HAL_GPIO_WritePin(GPIOB, GPIO_PIN_0 | GPIO_PIN_7 | GPIO_PIN_14, GPIO_PIN_SET);
        PAUSE_CIRCA_ONE_SECOND;
        HAL_GPIO_WritePin(GPIOB, GPIO_PIN_0 | GPIO_PIN_7 | GPIO_PIN_14, GPIO_PIN_RESET);
        PAUSE_CIRCA_ONE_SECOND;
    }
}
...

To spice the things up a bit I wrote the pause code in ARM inline assembly:

#define PAUSE_CIRCA_ONE_SECOND \
	asm volatile("mov r0, #0x4400000\n" \
				 "sub r0, r0, #1\n" \
				 "cmp r0, #0\n" \
				 "bne .-6" : : : "cc", "r0");

where I considered the 216MHz clock and I guessed that the CPU would execute approximately one instruction per clock cycle - hence 0x4400000. It turns out that the resulting behavior is almost as I would expect: the leds stay on for about one second, and this is great, but they stay off much longer, that is, for about 12 seconds. The interesting thing is that if I unroll the loop once:

while (1)
{
    HAL_GPIO_WritePin(GPIOB, GPIO_PIN_0 | GPIO_PIN_7 | GPIO_PIN_14, GPIO_PIN_SET);
    PAUSE_CIRCA_ONE_SECOND;
    HAL_GPIO_WritePin(GPIOB, GPIO_PIN_0 | GPIO_PIN_7 | GPIO_PIN_14, GPIO_PIN_RESET);
    PAUSE_CIRCA_ONE_SECOND;
    HAL_GPIO_WritePin(GPIOB, GPIO_PIN_0 | GPIO_PIN_7 | GPIO_PIN_14, GPIO_PIN_SET);
    PAUSE_CIRCA_ONE_SECOND;
    HAL_GPIO_WritePin(GPIOB, GPIO_PIN_0 | GPIO_PIN_7 | GPIO_PIN_14, GPIO_PIN_RESET);
    PAUSE_CIRCA_ONE_SECOND;
}

the leds always stay on for about one second, and stay off for about 12 seconds on the odd cycles (starting to count from 1) and for about 1 second on the even cycles. This behavior remains the same if I unroll the loop more than once. I would like to understand the reason of such a strange behavior.

I did some small investigations to exclude the most trivial causes. I don't think there is any difference between the many instances of the PAUSE_CIRCA_ONE_SECOND code: I disassembled the main() and compared them without finding any difference. Also, I used PC-relative offsets and cross-checked the branch target in the disassembled code to be sure that one instance does not jump into the other. The slow part is not the call to HAL_GPIO_WritePin: I debugged the code and setting/resetting the GPIO pins always takes a negligible amount of time when compared to PAUSE_CIRCA_ONE_SECOND.

Any explanation, any suggestion on how I can further investigate the causes, or any fix to the above code, is welcome.

Thank you in advance

Pietro Braione

waclawek.jan · ‎2020-02-07

Nice!

get rid of as much of Cube/HAL as possible (as part of investigation, but also in school setting you IMO should avoid it) - mainly regular interrupts such as systick
disconnect debugger
switch off caches
run out of the TCM memory, run at low speeds where FLASH can run without waitstates
move the code up/down e.g. by adding NOPs, observing alignment of the jump target to 2^N boundaries

I'm not sure it's the best idea to start with a superscalar processor with speculative prefetch, a bunch of caches and a complex busmatrix. So, losing control is almost inevitable and hard to analyze. and you've been quite rightly hit by it in your very first attempt. I definitively would not like to have this as the main specimen for an undergraduate mcu course - except as an example how modern "embedded" departed far from the notion of "control".

Yes, I'm old fashioned and grumpy.

JW

View solution in original post

n2wx · ‎2020-02-07

If your part has instruction and data caches what happens if you turn them off and go back to the pre-unrolled version?

waclawek.jan · ‎2020-02-07

Nice!

get rid of as much of Cube/HAL as possible (as part of investigation, but also in school setting you IMO should avoid it) - mainly regular interrupts such as systick
disconnect debugger
switch off caches
run out of the TCM memory, run at low speeds where FLASH can run without waitstates
move the code up/down e.g. by adding NOPs, observing alignment of the jump target to 2^N boundaries

I'm not sure it's the best idea to start with a superscalar processor with speculative prefetch, a bunch of caches and a complex busmatrix. So, losing control is almost inevitable and hard to analyze. and you've been quite rightly hit by it in your very first attempt. I definitively would not like to have this as the main specimen for an undergraduate mcu course - except as an example how modern "embedded" departed far from the notion of "control".

Yes, I'm old fashioned and grumpy.

JW

RMcCa · ‎2020-02-07

Dunno, but I'm glad I'm not a student at your uni.

Tesla DeLorean · ‎2020-02-07

Are you sure you've configured it to successfully run at 216 MHz, perhaps it is running at 16 MHz out of reset on the HSI

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

berendi · ‎2020-02-08

It might be an alignment issue, or the branch predictor is confused, maybe by the conditional branch followed by an unconditional one.

What happens if you move the last delay to the top of the loop like this?

In today's lesson we have seen why one should never use cpu loops as a delay. Next week, we will learn how to set up a hardware timer to generate proper timings.

PB1 · ‎2020-02-08

I did not touch the clock speed settings, and I reckon that if it were running slower the up-cycle would not last one second. But it is a possibility.

PB1 · ‎2020-02-08

LOL, sometimes I am also glad I am not.

PB1 · ‎2020-02-08

Thanks to everyone for your feedback. Of course the aim of this experiment is to say "never do this kind of things, use the fine timers instead". In my not so short teaching experience I have met the student who swapped integer variables with XOR to save 32 bits of temp - during a programming 101 course, on a machine with a giga of memory available, and after being told to write readable, rather than fast or tight, code. I therefore fully expect that some students will come to the conclusion that using a current-day ARM M7 as it were an 8051 is a good idea, and I want to prove them wrong as soon as possible before they hit the wall by themselves.

This said, I did some more experiments by disabling L1 instruction cache and fiddling with the size of the code (e.g., using a toggle and a single wait block in the loop), and I came to the conclusion that caching is most likely the cause. That's enough for me. Thanks to everyone for your suggestions and comments. And about the present day trend of using highly nondeterministic superscalar/pipelined/cached processors in embedded - and possibly safety-critical - settings, maybe I also am too old fashioned and grumpy so I will shut up.

waclawek.jan · ‎2020-02-08

> the aim of this experiment is to say "never do this kind of things, use the fine timers instead"

Never say never.

Loopdelay blinky is THE cornerstone when bringing up a new system.

> one should never use cpu loops as a delay

Never say never again...

Loopdelays are still useful in the - admittedly rare, but existing - cases, where short delays in the order of a few machine cycles are needed.

I'd say, "never" is good to be said in the first course (101 in the anglosaxon educational world), but later the refined truth should be taught, IMO.

JW