cancel
Showing results for 
Search instead for 
Did you mean: 

Some guidelines on what's appropriate to put in CCMRAM on STM32F4 for optimal performance?

arnold_w
Senior

I'm working with the STM32F405 microcontroller and I recently ran out of RAM, but then I discovered that the 64 kB CCMRAM was completely unused. Given that the CCMRAM was supposed to be fast, I moved the stack and state variables of low-level drivers into CCMRAM, but I found that some code actually executed slower after these changes. Most notably a software UART (that simply writes to GPIOx->BSRR 10 times uninterruptedly) had it's Baud rate changed from 3 MBaud to 2.4 MBaud after the stack was moved. Can someone please give some guidelines on what's appropriate to place in the CCMRAM if I want the fastest performance of my code?

8 REPLIES 8

> I moved the stack and state variables of low-level drivers into CCMRAM

Correct.

> Most notably a software UART (that simply writes to GPIOx->BSRR 10 times uninterruptedly)had it's Baud rate changed from 3 MBaud to 2.4 MBaud after the stack was moved.

Show code (source and disasm) before and after. 

JW

arnold_w
Senior

So, I was going to reproduce the software UART Baud rate issue and I moved the stack to CCMRAM and upon further investigation it turns out the Baud rate depends on how I START my code! If I click "Debug STM Application Debug" (button with the insect symbol) or "Run STM Application Debug" (button with an arrow pointing to the right) in Eclipse (NOT System Workbench or STM32CubeIDE) then my Baud rate is 2.4 MBaud. However, if I click Restart, which doesn't change the firmware, then I get 3 MBaud! Also, if I recycle power on my device and let it run freely without Eclipse, then my Baud rate is 3 MBaud. If I move my stack back into regular RAM then my Baud rate is always 3 MBaud. What on earth is going on???

If you rely on timing generated by duration of instructions execution and bus delays, then anything which influences those two will influence the timing, too.

Debugging is intrusive, the debugger is part of the processor core and for it to work, it uses the same resources (buses) than the core, competing with it. Its exact working all the way from the clicks in IDE through layers of software and hardware down to the intimate details of what happens in the processor core and buses, is documented sparsely or not at all.

In other words, don't rely on timing through instruction execution and be prepared it to change whenever you introduce even innocuous changes. Use hardware as appropriate.

JW

 

I think it's strange that the problem goes away when I click "Restart", after that the Baud rate is correct (3 MBaud) and I can insert breakpoints and do everything else related to debugging and everything works just great. 

arnold_w
Senior

Upon further investigations I found that the stack will only be de facto located in CCMRAM when I click "Debug STM Application Debug" or "Run STM Application Debug", all other times it will go back to regular RAM. I verified that using  uint32_t MyStackPtr;
__asm volatile("mov %0, sp" : "=r" (MyStackPtr)); In order to move the stack to CCMRAM, I made 2 changes in sections.ld:

 

__stack = ORIGIN(RAM) + LENGTH(RAM);   

->   

__stack = ORIGIN(CCMRAM) + LENGTH(CCMRAM); 

 

._check_stack : ALIGN(4)
{
. = . + _Minimum_Stack_Size ;
} >RAM

-> 

._check_stack : ALIGN(4)
{
. = . + _Minimum_Stack_Size ;
} >CCMRAM

So, apparently I'm not moving the stack correctly. However, if code is executing slower with the stack in CCMRAM, I'm not sure I want to move it. This is what my software UART code looks like:

inline static void __attribute__((optimize("O1"))) outputByte(volatile uint32_t* BSRR, uint32_t BSRRvalues[10]) {

08041040: ldr r2, [r1, #0]
08041042: ldr r3, [pc, #40] ; (0x804106c <outputByte+44>)
08041044: str r2, [r3, #24]
111 GPIOB->BSRR = BSRRvalues[1];
08041046: ldr r2, [r1, #4]
08041048: str r2, [r3, #24]
112 GPIOB->BSRR = BSRRvalues[2];
0804104a: ldr r2, [r1, #8]
0804104c: str r2, [r3, #24]
113 GPIOB->BSRR = BSRRvalues[3];
0804104e: ldr r2, [r1, #12]
08041050: str r2, [r3, #24]
114 GPIOB->BSRR = BSRRvalues[4];
08041052: ldr r2, [r1, #16]
08041054: str r2, [r3, #24]
115 GPIOB->BSRR = BSRRvalues[5];
08041056: ldr r2, [r1, #20]
08041058: str r2, [r3, #24]
116 GPIOB->BSRR = BSRRvalues[6];
0804105a: ldr r2, [r1, #24]
0804105c: str r2, [r3, #24]
117 GPIOB->BSRR = BSRRvalues[7];
0804105e: ldr r2, [r1, #28]
08041060: str r2, [r3, #24]
118 GPIOB->BSRR = BSRRvalues[8];
08041062: ldr r2, [r1, #32]
08041064: str r2, [r3, #24]
119 GPIOB->BSRR = BSRRvalues[9];
08041066: ldr r2, [r1, #36] ; 0x24
08041068: str r2, [r3, #24]
132 }

Whenever the BSRRvalues array, and hence r1, points to an address in CCMRAM, my UART runs at 2.4 MBaud. Whenever the BSRRvalues array, and hence r1, points to an address in regular RAM, my UART runs at 3 MBaud. This makes no sense to me.

Interesting.

What's your system clock?

JW

What's your system clock?

12 MHz

So, 4/5 cycles per ld/st pair.

I wouldn't expect that much, and frankly the fact that with CCMRAM it takes more is surprising to me.

JW