Skip to main content
MSull.1
Associate II
September 16, 2021
Question

Use of "volatile" keyword

  • September 16, 2021
  • 10 replies
  • 10892 views

Hi, I'm working on an STM32H7 project with very fast (800 KHz) periodic interrupt. I've been looking at the generated code as I tweak the C source. I'm confused about the use of "volatile" for the global variables used by the ISR.

I feel like I need to specify volatile on the variables that the ISR uses so the foreground code doesn't assume that the value won't change between instructions and knows that results need to be stored timely. On the other hand, marking a variable volatile causes the generated code in the ISR to have a lot of unnecessary loads from memory. For example, the C sequence

if (myvar != 0)
{
 myvar -= 1;
 if (myvar == 0)
 {
 ...
 }
}

generates three loads of myvar to a register when myvar is volatile. Without volatile, it gets loaded into a register one time.

I kind of suspect the answer is not to worry about the foreground and just not specify volatile at all. But I'm wondering if there is a way to get the compiler to ensure these variables get treated as volatile in the foreground but not generate all of this redundant load/store activity in the ISR.

This topic has been closed for replies.

10 replies

TDK
September 16, 2021
uint32_t myvar_shadow = myvar;
if (myvar_shadow != 0)
{
 myvar_shadow -= 1;
 myvar = myvar_shadow;
 if (myvar_shadow == 0)
 {
 ...
 }
}

Mark the variable as volatile, and if you're manipulating it, use a different variable to manipulate it, then store it. That other variable will be optimized to a register. You'll be left with a single load and a single store instruction to the memory address.

"If you feel a post has answered your question, please click ""Accept as Solution""."
MSull.1
MSull.1Author
Associate II
September 16, 2021

Doh! That makes perfect sense. Thanks.

alister
Senior III
September 17, 2021

Yes you'll need volatile where you need it. You've studied the assembler so you understand what it's doing and why it's required.

Need to consider atomicity too.

800kHz seems hot. Couldn't use a GPIO input capture or some other peripheral instead?

S.Ma
Principal
September 19, 2021

At cost of portability, if your global variable can be read in 1 memory cycle, you can in certain cases remove volatile, like written by interrupt, read by main loop. Sometime you could also use flags like set by interrupt and data is preserved until data is read by main and flag is cleared.

800kHz interrupt has high probability to fall apart once all interrupt sources and atomic or interrupt disabled code section will cause interrupt overrun. Use the available hw to scale down the interrupt rate.

Another use of volatile is to keep a special code in flash which can be activated in debug mode.

MSull.1
MSull.1Author
Associate II
September 19, 2021

I know this is a pretty fast interrupt but I did get it working well. The ISR is implementing a rate generator so there's not really a hardware alternative although I did consider using an FPGA for this project. I chose the STM32H7 because it's fast enough to do this in an ISR yet less expensive than an SoC. The ISR consumes about 10% of the CPU. Which is a lot, but the rest of the application doesn't require high performance so it works out well. This is a 480 MHz CPU and the rest of the functionality worked fine on a previous implementation in a 70 MHz 16-bit PIC. So I can certainly spare 10%.

There are other interrupts happening for timers, UARTS, and the USB peripheral. By setting the 800 KHz interrupt to a high preemptive priority (0) and the others to the minimum preemptive priority (3), I only have two-deep nesting at any given time and the 800 KHz interrupt never gets deferred.

Some of the variables are only accessed in the ISR, so those aren't marked volatile. For the variables that are accessed in the foreground and the ISR, I marked them volatile and used an explicit local register copy as TDK suggested. It probably doesn't matter but I marked those variables with the "register" storage class. I located all of the variables accessed by the ISR in the DTCM.

I had to use a DMB (memory barrier) to get this to work right. Otherwise, weird stuff happened which I interpreted as being some kind of deferred write not completing by the time the next interrupt occurs. I really don't understand this. I thought with the variables in DTCM would basically be as fast as registers.

So, it's one of those cases where it works well and I need to move on with the rest of the project rather than spending more time trying to get rid of the DMB. I am still curious, though.

Piranha
Principal III
September 25, 2021

The volatile storage type is a restriction for compiler and can force a specific order of instructions in compiled code, but that doesn't mean that the CPU will execute those instructions in that same order. Reordering wasn't the case for simpler cores, but Cortex-M7 is dual issue and actually does it. To solve it, one has to deal with the memory types. DTCM is always treated as a normal memory type, but the order of operations is only guaranteed for device and strongly ordered memory types. And yes, the DMB instruction is the correct solution.

Read more about it in AN4838, AN4839 and here:

https://community.st.com/s/question/0D50X0000C4Nk4GSQS/bug-missing-compiler-and-cpu-memory-barriers

Pavel A.
Super User
September 25, 2021

> And yes, the DMB instruction

Isn't the memory order enforced by "compiler barrier", which is named __COMPILER_BARRIER in CMSIS?

This is defined not as DMB , but as asm volatile("":::"memory") ?

MSull.1
MSull.1Author
Associate II
September 25, 2021

Yes, that makes sense. Reordering would cause the symptom I am seeing. Especially since it doesn't happen every time the ISR runs.

Thanks!

Pavel A.
Super User
September 26, 2021

Then you're not after volatiles. Read on atomics (either C++ or plain C, whatever you prefer)

https://preshing.com/20120913/acquire-and-release-semantics/

MSull.1
MSull.1Author
Associate II
September 26, 2021

I see what you're saying but I don't really have a locking problem.

I am using volatile on a variable (rate) because it is written by the foreground and read by the ISR. Neither is a read/modify/write. The foreground writes the variable and the ISR reads it. I do need the foreground to be atomic, but it's a 32-bit integer variable so the write is going to be atomic anyway. The ISR runs at the highest preemptive priority so it's essentially atomic as well.

The alleged ordering problem arose with a different variable (accum) that is being modified in the ISR. This variable is never accessed by foreground. I shouldn't ever have made it volatile. Without thinking, I just made it volatile since it is accessed by the ISR but, at the time, didn't recognize that it is never accessed by the foreground. In this case, I still think my problem was re-ordering in the CPU pipeline. Maybe the code will make more sense.

The point of the ISR is to do something periodically. The foreground controls the period by changing the variable rate. The variable accum is negative most of the time and advances toward zero by rate at every invocation of the ISR. The rate at which accum "overflows" to a non-negative is therefore rate per second.

volatile int rate;
int accum;
 
void ISR()
{
	accum += rate;
	if (rate >= 0)
	{
		accum -= INTERRUPT_FREQUENCY;
		// DO SOMETHING
	}
}

gcc generates (comments added)

ISR:
	ldr	r3, .L103
	ldr	r2, [r3]	// load rate
	ldr	r1, [r3, #4]	// load accum
	adds	r2, r2, r1
	bmi	.L102
	ldr	r0, .L103+4
	add	r0, r0, r2
	// DO SOMETHING
	str	r0, [r3, #4]
	bx	lr
.L102:
	str	r2, [r3, #4]
	bx	lr
.L103:
	.word	.LANCHOR3	// address of the DTCM
	.word	-800000	// negative of interrupt frequency

The problem was that the actual frequency produced was too high. That is, DO SOMETHING happened more than rate times per second. This implied that the IF expression was evaluating as true when it should not be. I thought maybe the result of the subtraction wasn't getting written back to memory before the next interrupt came along. But that implies that the write is being deferred by over a microsecond (on a 450 MHz processor) which seems unlikely, especially since the variables are in the DTCM.

What solved the problem was adding a DMB at the beginning of the ISR. That is, between line 1 and 2 of the above assembly listing

Piranha
Principal III
September 27, 2021

If the variable accum is used only in the ISR, make it static and move into that ISR. For compiled code it will change nothing, but it will limit it's accessibility scope to the ISR.

The presented C code example has a mistake - line 7 should test the variable accum instead of rate.

Other than that I don't see any problems related to compiler or hardware memory types. Even if the store has not been completed, the load from the same address will still read the correct value. Compiler and memory barriers are not necessary here and are not a correct and reliable solution. In this case most likely the DMB instruction just increased the ISR execution time, but the real problem is late interrupt source clear. The correct and reliable solution is to always test the specific interrupt status flag at peripheral and clear it as soon as possible.

MSull.1
MSull.1Author
Associate II
September 27, 2021

Well, Piranha, you caught me ;) In an (apparently misguided) effort to simplify my posts and save future readers from wading through unrelated code, I keep writing analogous but simplified code. Then you or Pavel spot the missing parts! This is what it actually looks like:

__attribute__ ((section("TCM_Data")))
struct Drive_t 
{
	s32 Accum;
	volatile s32 Rate; // in steps per second
	u32 toggler; // not volatile because only accessed by the ISR
	volatile s32 StepsToGo; // if non-zero, count down and zero Drive_Rate when zero is reached
} Drive;
 
__attribute__ ((section("TCM_Code"),optimize(3)))
void LPTIM4_IRQHandler()
{
	//LED_RED_GPIO_Port->BSRR = LED_RED_Pin;
	LPTIM4->ICR = LPTIM_FLAG_ARRM;
 
 asm("dmb"); 
	if ((Drive.Accum += Drive.Rate) >= 0)
	{
	 Drive.Accum -= DENOMINATOR;
	 DRV_INT_STEP_GPIO_Port->BSRR = (Drive.toggler ^= (SET_STEP_PIN | CLR_STEP_PIN));
	 {
		 register s32 rstepstogo = Drive.StepsToGo;
		 if (rstepstogo)
		 {
			 if ((rstepstogo -= 1) == 0)
			 {
				 Drive.Rate = 0;
			 }
			 Drive.StepsToGo = rstepstogo;
		 }
	 }
 }
 //LED_RED_GPIO_Port->BSRR = LED_RED_Pin << 16;
}
 

Before you question it, let me explain that toggler is initialized elsewhere as SET_STEP_PIN so the result of the exclusive-or is to make toggler alternate between SET_STEP_PIN and CLEAR_STEP_PIN. The net effect is that the GPIO pin toggles at the specified rate.

As you can probably guess, uncommenting the writes to LED_GPIO_PORT/PIN lets me examine the timing with an oscilloscope. This pulse takes 120 nS when accum overflows and 80nS when it doesn't. Timing the function this way doesn't include exception entry and exit, but that's about a 10% CPU load. And I do have the FPU set to lazy stacking.

So, yes, I am clearing the interrupt flag as soon as possible. It's the first thing on entry. In the generated code, the write to the SFR occurs on the third instruction in the SFR. The address of LPTIM4_IRQHandler is in the vector table, there's no other indirection. Unless I want to dedicate a global register to the address of LPTIM4, it's hard to see how this could be any faster. (Since the whole bare-metal system is in "C", I actually could use r9 for that purpose but it would be a hair-tearing regression bug waiting to happen.)

	.section	TCM_Code,"ax",%progbits
	.align	1
	.p2align 2,,3
	.global	LPTIM4_IRQHandler
	.syntax unified
	.thumb
	.thumb_func
	.fpu fpv5-d16
	.type	LPTIM4_IRQHandler, %function
LPTIM4_IRQHandler:
.LFB160:
	.loc 1 420 1 is_stmt 1 view -0
	.cfi_startproc
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	.loc 1 422 2 view .LVU1023
	.loc 1 422 72 is_stmt 0 view .LVU1024
	ldr	r3, .L108
	mov	r2, #-1
	str	r2, [r3, #4]
	.loc 1 424 5 is_stmt 1 view .LVU1025
	.syntax unified
@ 424 "../src/Drive.c" 1
	dmb
 
...
 
.L108:
	.word	1476406272	// base address of LPTIM4 SFRs

( See how much cleaner and more readable it was when I simplified it before? )

At this point, I don't actually understand what's wrong but the adding and removing the DMB with no other changes unquestionably turns the problem off and on. Without the DMB, the I/O pin toggles about 20% or so too fast.

I thought of declaring accum (and toggler) in function-local static but pairing it up with rate sets up the compiler to access both of them with a single load of an address to base register r3. Rate can't be local static because the foreground has to be able to change it. Given that there's nothing else in the DTCM, accum and rate would most likely still be close enough for an [r3,#n] access if I made accum a local static, but it would be at the discretion of the linker and someday, if other allocations are added to the DTCM, the timing could change subtly. I've also never put a section attribute on a local static but I don't see any reason one couldn't. Yes, it bothers me that accum and toggler are exported globally but saving a couple of instructions in an 800 KHz ISR makes a material difference.

	ldr	r3, .L103
	ldr	r2, [r3]	// load rate
	ldr	r1, [r3, #4]	// load accum

The construction if ((Drive.Accum += Drive.Rate) >= 0) isn't exactly reader-friendly code but emphasizes that we are re-using the result of the addition without generating another load of Accum. In the generated code, the flags from the add instruction are immediately used by a bmi instruction.

In hindsight, I probably should have simply written the ISR in assembler since I did so much source tweaking to get the compiler to give me the desired result. And I could be exposing myself to some late nights if a future version of the compiler doesn't generate exactly the same assembly.

MSull.1
MSull.1Author
Associate II
September 27, 2021

What I really meant was it would have been less effort on my part to write it in assembly instead of cajoling gcc into doing what I wanted. The additional advantage might be that the assembly wouldn't change if rebuilt with some future version of the compiler.

You're right. I instrumented again and the ISR is double hitting. It's not clear to my why I didn't see that before with the oscilloscope but, anyway, I see it now. Adding the test of LPTIM_ISR_ARRM solves the problem. Makes me wonder why DMB solved the problem. Probably because it introduced just enough additional delay.

MSull.1
MSull.1Author
Associate II
September 27, 2021

On reflection, I realize that I've seen this same problem before with a UART ISR.