Use of "volatile" keyword

MSull.1 · ‎2021-09-16

Hi, I'm working on an STM32H7 project with very fast (800 KHz) periodic interrupt. I've been looking at the generated code as I tweak the C source. I'm confused about the use of "volatile" for the global variables used by the ISR.

I feel like I need to specify volatile on the variables that the ISR uses so the foreground code doesn't assume that the value won't change between instructions and knows that results need to be stored timely. On the other hand, marking a variable volatile causes the generated code in the ISR to have a lot of unnecessary loads from memory. For example, the C sequence

if (myvar != 0)
{
  myvar -= 1;
  if (myvar == 0)
  {
    ...
  }
}

generates three loads of myvar to a register when myvar is volatile. Without volatile, it gets loaded into a register one time.

I kind of suspect the answer is not to worry about the foreground and just not specify volatile at all. But I'm wondering if there is a way to get the compiler to ensure these variables get treated as volatile in the foreground but not generate all of this redundant load/store activity in the ISR.

Piranha · ‎2021-09-26

The compiler barrier does not generate any instruction or interact with the hardware at runtime in any way. The only thing it does, is limiting the compiler to not move any loads and stores beyond this barrier at compile time. Take a note that this includes also the non-volatile storage type variables! This issue is explained very nicely in the following article's problem No. 5:

https://blog.regehr.org/archives/28

Also take a note that the CMSIS defined __DMB() and similar macros include both - compiler barrier and the respective CPU instruction. The idea is that it does not make much of a sense to inline a DMB instruction an still allow compiler to move other loads and stores beyond it.

So the practical advice is:

__COMPILER_BARRIER() - impacts compile time only. Use it to enforce the program order of non-volatile storage type accesses relative to volatile storage type ones.
__DMB() - impacts compile time and runtime. Use it to enforce the execution order of normal memory type accesses relative to device and strongly ordered memory type ones.

MSull.1 · ‎2021-09-26

I see what you're saying but I don't really have a locking problem.

I am using volatile on a variable (rate) because it is written by the foreground and read by the ISR. Neither is a read/modify/write. The foreground writes the variable and the ISR reads it. I do need the foreground to be atomic, but it's a 32-bit integer variable so the write is going to be atomic anyway. The ISR runs at the highest preemptive priority so it's essentially atomic as well.

The alleged ordering problem arose with a different variable (accum) that is being modified in the ISR. This variable is never accessed by foreground. I shouldn't ever have made it volatile. Without thinking, I just made it volatile since it is accessed by the ISR but, at the time, didn't recognize that it is never accessed by the foreground. In this case, I still think my problem was re-ordering in the CPU pipeline. Maybe the code will make more sense.

The point of the ISR is to do something periodically. The foreground controls the period by changing the variable rate. The variable accum is negative most of the time and advances toward zero by rate at every invocation of the ISR. The rate at which accum "overflows" to a non-negative is therefore rate per second.

volatile int rate;
int accum;
 
void ISR()
{
	accum += rate;
	if (rate >= 0)
	{
		accum -= INTERRUPT_FREQUENCY;
		// DO SOMETHING
	}
}

gcc generates (comments added)

ISR:
	ldr	r3, .L103
	ldr	r2, [r3]	// load rate
	ldr	r1, [r3, #4]	// load accum
	adds	r2, r2, r1
	bmi	.L102
	ldr	r0, .L103+4
	add	r0, r0, r2
	// DO SOMETHING
	str	r0, [r3, #4]
	bx	lr
.L102:
	str	r2, [r3, #4]
	bx	lr
.L103:
	.word	.LANCHOR3	// address of the DTCM
	.word	-800000	// negative of interrupt frequency

The problem was that the actual frequency produced was too high. That is, DO SOMETHING happened more than rate times per second. This implied that the IF expression was evaluating as true when it should not be. I thought maybe the result of the subtraction wasn't getting written back to memory before the next interrupt came along. But that implies that the write is being deferred by over a microsecond (on a 450 MHz processor) which seems unlikely, especially since the variables are in the DTCM.

What solved the problem was adding a DMB at the beginning of the ISR. That is, between line 1 and 2 of the above assembly listing

Piranha · ‎2021-09-26

If the variable accum is used only in the ISR, make it static and move into that ISR. For compiled code it will change nothing, but it will limit it's accessibility scope to the ISR.

The presented C code example has a mistake - line 7 should test the variable accum instead of rate.

Other than that I don't see any problems related to compiler or hardware memory types. Even if the store has not been completed, the load from the same address will still read the correct value. Compiler and memory barriers are not necessary here and are not a correct and reliable solution. In this case most likely the DMB instruction just increased the ISR execution time, but the real problem is late interrupt source clear. The correct and reliable solution is to always test the specific interrupt status flag at peripheral and clear it as soon as possible.

MSull.1 · ‎2021-09-26

Well, Piranha, you caught me ;) In an (apparently misguided) effort to simplify my posts and save future readers from wading through unrelated code, I keep writing analogous but simplified code. Then you or Pavel spot the missing parts! This is what it actually looks like:

__attribute__ ((section("TCM_Data")))
struct Drive_t 
{
	s32 Accum;
	volatile s32 Rate; // in steps per second
	u32 toggler; // not volatile because only accessed by the ISR
	volatile s32 StepsToGo; // if non-zero, count down and zero Drive_Rate when zero is reached
} Drive;
 
__attribute__ ((section("TCM_Code"),optimize(3)))
void LPTIM4_IRQHandler()
{
	//LED_RED_GPIO_Port->BSRR = LED_RED_Pin;
	LPTIM4->ICR = LPTIM_FLAG_ARRM;
 
    asm("dmb"); 
	if ((Drive.Accum += Drive.Rate) >= 0)
	{
	  Drive.Accum -= DENOMINATOR;
	  DRV_INT_STEP_GPIO_Port->BSRR = (Drive.toggler ^= (SET_STEP_PIN | CLR_STEP_PIN));
	  {
		  register s32 rstepstogo = Drive.StepsToGo;
		  if (rstepstogo)
		  {
			  if ((rstepstogo -= 1) == 0)
			  {
				  Drive.Rate = 0;
			  }
			  Drive.StepsToGo = rstepstogo;
		  }
	  }
    }
    //LED_RED_GPIO_Port->BSRR = LED_RED_Pin << 16;
}

Before you question it, let me explain that toggler is initialized elsewhere as SET_STEP_PIN so the result of the exclusive-or is to make toggler alternate between SET_STEP_PIN and CLEAR_STEP_PIN. The net effect is that the GPIO pin toggles at the specified rate.

As you can probably guess, uncommenting the writes to LED_GPIO_PORT/PIN lets me examine the timing with an oscilloscope. This pulse takes 120 nS when accum overflows and 80nS when it doesn't. Timing the function this way doesn't include exception entry and exit, but that's about a 10% CPU load. And I do have the FPU set to lazy stacking.

So, yes, I am clearing the interrupt flag as soon as possible. It's the first thing on entry. In the generated code, the write to the SFR occurs on the third instruction in the SFR. The address of LPTIM4_IRQHandler is in the vector table, there's no other indirection. Unless I want to dedicate a global register to the address of LPTIM4, it's hard to see how this could be any faster. (Since the whole bare-metal system is in "C", I actually could use r9 for that purpose but it would be a hair-tearing regression bug waiting to happen.)

	.section	TCM_Code,"ax",%progbits
	.align	1
	.p2align 2,,3
	.global	LPTIM4_IRQHandler
	.syntax unified
	.thumb
	.thumb_func
	.fpu fpv5-d16
	.type	LPTIM4_IRQHandler, %function
LPTIM4_IRQHandler:
.LFB160:
	.loc 1 420 1 is_stmt 1 view -0
	.cfi_startproc
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	.loc 1 422 2 view .LVU1023
	.loc 1 422 72 is_stmt 0 view .LVU1024
	ldr	r3, .L108
	mov	r2, #-1
	str	r2, [r3, #4]
	.loc 1 424 5 is_stmt 1 view .LVU1025
	.syntax unified
@ 424 "../src/Drive.c" 1
	dmb
 
...
 
.L108:
	.word	1476406272	// base address of LPTIM4 SFRs

( See how much cleaner and more readable it was when I simplified it before? )

At this point, I don't actually understand what's wrong but the adding and removing the DMB with no other changes unquestionably turns the problem off and on. Without the DMB, the I/O pin toggles about 20% or so too fast.

I thought of declaring accum (and toggler) in function-local static but pairing it up with rate sets up the compiler to access both of them with a single load of an address to base register r3. Rate can't be local static because the foreground has to be able to change it. Given that there's nothing else in the DTCM, accum and rate would most likely still be close enough for an [r3,#n] access if I made accum a local static, but it would be at the discretion of the linker and someday, if other allocations are added to the DTCM, the timing could change subtly. I've also never put a section attribute on a local static but I don't see any reason one couldn't. Yes, it bothers me that accum and toggler are exported globally but saving a couple of instructions in an 800 KHz ISR makes a material difference.

	ldr	r3, .L103
	ldr	r2, [r3]	// load rate
	ldr	r1, [r3, #4]	// load accum

The construction if ((Drive.Accum += Drive.Rate) >= 0) isn't exactly reader-friendly code but emphasizes that we are re-using the result of the addition without generating another load of Accum. In the generated code, the flags from the add instruction are immediately used by a bmi instruction.

In hindsight, I probably should have simply written the ISR in assembler since I did so much source tweaking to get the compiler to give me the desired result. And I could be exposing myself to some late nights if a future version of the compiler doesn't generate exactly the same assembly.

Piranha · ‎2021-09-27

OK, but apart from that one mistake, your minimized version was effectively the same. :) In this case rewriting the same logic in assembly will not solve the problem. As I said previously, you have to test the interrupt status flag. Just do it like this:

void LPTIM4_IRQHandler(void)
{
	if (LPTIM4->ISR & LPTIM_ISR_ARRM) {
		LPTIM4->ICR = LPTIM_ICR_ARRMCF;
		
		// The current ISR code
	}
}

By the way, do not use asm("dmb"), use __DMB() instead. I explained it in a reply to Pavel. Though in this case you don't need any of those. Also LPTIM_FLAG_ARRM is a HAL library definition, not a CMSIS register definition, but it should be effectively the same.

MSull.1 · ‎2021-09-27

What I really meant was it would have been less effort on my part to write it in assembly instead of cajoling gcc into doing what I wanted. The additional advantage might be that the assembly wouldn't change if rebuilt with some future version of the compiler.

You're right. I instrumented again and the ISR is double hitting. It's not clear to my why I didn't see that before with the oscilloscope but, anyway, I see it now. Adding the test of LPTIM_ISR_ARRM solves the problem. Makes me wonder why DMB solved the problem. Probably because it introduced just enough additional delay.

MSull.1 · ‎2021-09-27

On reflection, I realize that I've seen this same problem before with a UART ISR.