cancel
Showing results for 
Search instead for 
Did you mean: 

Erratum "Delay after an RCC peripheral clock enabling" - suggested workaround fails to work

[This is a reconstruction of a thread I've started on Apr 7, 2017, but got lost in the two forum software transitions since then.]

In STM32F40x and STM32F41x Errata sheet  DocID022183 Rev 8, 2.1.13  Delay after an RCC peripheral clock enabling, there are three workarounds suggested.

The third one is:
3.  Or simply insert a dummy read operation from the corresponding register just after enabling the peripheral clock.

It is not clear, what is "corresponding register" - is it the register of RCC, or the register of peripheral which clock is being enabled?

However, regardless of this, it does not appear to be working in either case. In the following code (with all clocks settings at their reset defaults):

 

8000214:    2201          movs    r2, #1 
8000216:    6b18          ldr    r0, [r3, #48]    ; 0x30 
8000218:    f440 5080     orr.w    r0, r0, #4096    ; 0x1000 
800021c:    6318          str    r0, [r3, #48]    ; 0x30 
800021e:    6b18          ldr    r0, [r3, #48]    ; 0x30  <------------------ readback 
8000220:    600a          str    r2, [r1, #0] 
8000222:    2000          movs    r0, #0 
8000224:    bf00          nop 
8000226:    bf00          nop

 

 where r3 is preloaded with RCC address and r1 with CRC address, the readback is performed just after the write to RCC, and subsequently the now-presumably-enabled CRC data register is written, which should result in CRC for one word containing 0x00000001 being calculated, so CRC->DR should read as 0xC3C5C0CC. However, placing a breakpoint at the last nop and reading out CRC->DR shows that it is at 0xFFFFFFFF, so the previous write to CRC->DR has been ignored.

I tried also

 

 800021e:    6808          ldr    r0, [r1, #0]  <------------------ readback

 

with the same result.

The two nops or one dsb at that place (i.e. the two other suggested workarounds) work as expected, CRC->DR contains 0xC3C5C0CC at the breakpoint.

ST, please comment.

JW

PS. I experimented with this as it appears to have a similar timing relationship as the CRC-reset-to-data issue I reported here {link leads to my own thread titled "CRC data ignored after CRC Reset" which is another lost thread, to which a later thread links, too}{edit - original thread reconstructed}.
---

{Clive Two.Zero} {Apr 7, 2017 5:15 PM}
Perhaps it is not exactly the same errata, but one specific to the logic of the CRC peripheral. The original errata is one where a pipelined back-to-back write via the write buffer (enabling clock, writing peripheral) provided no margin.

Adding the read of the RCC register after the write gave the peripheral additional clocks to recognize it had just come out of a reset state with clocks for the first time.

The CRC peripheral clearly seems to need more cycles of its state machine to get its act together. It speaks to a deficiency in the CRC logic, because it should be possible to do single cycle computation (in HW its way faster than a 32-bit full-adder)
---

{waclawek.jan} @ {Clive Two.Zero} on {Apr 8, 2017 12:17 PM}
Perhaps it is not exactly the same errata, but one specific to the logic of the CRC peripheral.

Agree. I will retry on Monday with GPIO.

JW

---

{Clive Two.Zero} @ {waclawek.jan} on {Apr 8, 2017 6:15 PM}
I generally try to push the clock stuff up earlier in the initialization. Have you profiled the power consumption of the CRC peripheral? ie Clock disabled, vs clock enabled but peripheral unused.

The APB/AHB clock hazard is more systemic, a bit like the Pipeline/Write-Buffer issue getting IRQ clearing quickly enough to propagate to the Tail-Chaining logic.

ARM has generally eschewed putting logic in to hide all hazards. Intel erects a lot of cliff top fencing to save every lemming.

---

{waclawek.jan} @ {Clive Two.Zero} on {Apr 8, 2017 8:53 PM}
> I generally try to push the clock stuff up earlier in the initialization.

Me too.

> Have you profiled the power consumption of the CRC peripheral?

No. As I've said above, I was trying to find a simple, fast but safe solution to the {CRC problem} I discovered earlier (and it was lots of fun finding, as it was the compiler which kept putting the CRC-reset and CRC-data-write instructions a random number of other instructions apart upon attempts to add debugging code around). Yes a number of nops appear to solve the problem but then I'd have to resort to inline assembler, which is not my forte - gcc is kind enough to reorder NOPs inserted as standalone/macros, too.

I looked at the errata as a source of inspiration.

> ARM has generally eschewed putting logic in to hide all hazards. Intel erects a lot of cliff top fencing to save every lemming.

I'd say in this case it's ST's guilt. As in the case of consecutive writes to CRC->DR, the peripherals (including RCC) which require guard times may insert a sufficient number of waitstates, or provide a readback indicator. At the end of the day, I am absolutely content with things being dangerous, as long as they are clearly and concisely documented, possibly with an accompanying simple and clean program example - which unfortunately is far from being the case.

Jan

---

{waclawek.jan} @ {waclawek.jan} on {Apr 10, 2017 10:02 AM}
Confirming the same problem still pertains when writing to GPIO (an APB-related readback might make a difference in case of enabling APB-based peripherals but I am not interested in experimenting further with what is ST's task to ensure. [EDIT] OK so I tried with APB1/TIM2, and the problem did not show up at all, i.e. write to TIM2_CR1 after immediately after write to RCC_APB1ENR was successful, no matter what the AHB/APB divider was.)

Interestingly, when write buffer on processor-to-bus interface is disabled, one nop is sufficient, but ldr still not (IMO this is related to processor giving up/reacquiring the bus in case of nops, while still keeping the bus in case of successive st/ld - but again this is ST's task to explain).

As I've said, my inline asm skills are not the top, so please don't laugh too loudly. {indentations, tabulators lost, sorry}

 

 

// SCnSCB->ACTLR = SCnSCB_ACTLR_DISDEFWBUF_Msk;  // reduces needed nr of nops  
{    
register uint32_t tmp1, tmp2;    
__asm volatile(      
"movs  %[t1], #1    "     "\n\t"      
"ldr   %[t2], [%[p2], #48]    "     "\n\t"      
"orr   %[t2], #4 "     "\n\t"      
"str   %[t2], [%[p2], #48]    "     "\n\t"
#if(0)      
  "nop"                     "\n\t"      
  "nop"                     "\n\t"
#elif(0)      
  "dsb"                     "\n\t"
#elif(0)      
  "ldr   %[t2], [%[p2], #48]    "     "\n\t"
#elif(1)      
  "ldr   %[t2], [%[p], #0x14]    "     "\n\t"
#else    
  // nothing
#endif      
"str   %[t1], [%[p], #0x14]" "\n\t"      
"nop"                     "\n\t"      
"nop"                     "\n\t"      
"nop"                     "\n\t"      
:[t1] "=&r" (tmp1)      
,[t2] "=&r" (tmp2)      
:[p]   "r" (GPIOC)      
,[p2]   "r" (RCC)    );  
}

 

 

JW

14 REPLIES 14
TDK
Guru

@waclawek.jan FWIW, I can't quite duplicate this on a STM32F411 with max clock (this particular chip errata is the same). I tested GPIO->ODR, CRC->DR and TIM2->ARR. Only GPIO and CRC needed a delay, and a delay of 1 NOP or 1 read from either register worked. TIM2 matched your results (no delay needed). I even used maxed AHB/APB prescalers which made no difference in the results (despite what errata suggests).

Ticks were the difference in DWT->CYCCNT before/after the stores and loads.

GPIOB->ODR:
[00.008080] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=0, result=0, ticks=5 <--
[00.012273] NOP=0, READ_ENABLE=1, READ_PERIPHERAL=0, result=1, ticks=6
[00.016467] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=1, result=1, ticks=6
[00.020661] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=0, result=0, ticks=5 <--
[00.024854] NOP=1, READ_ENABLE=0, READ_PERIPHERAL=0, result=1, ticks=6
[00.029048] NOP=2, READ_ENABLE=0, READ_PERIPHERAL=0, result=1, ticks=7
[00.033250] NOP=3, READ_ENABLE=0, READ_PERIPHERAL=0, result=1, ticks=8
[00.037444] NOP=4, READ_ENABLE=0, READ_PERIPHERAL=0, result=1, ticks=9
[00.041637] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=0, result=0, ticks=5 <--

CRC->DR:
[00.008093] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=0, result=4294967295, ticks=5 <--
[00.012312] NOP=0, READ_ENABLE=1, READ_PERIPHERAL=0, result=3284517068, ticks=10
[00.016532] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=1, result=3284517068, ticks=10
[00.020749] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=0, result=4294967295, ticks=5 <--
[00.024968] NOP=1, READ_ENABLE=0, READ_PERIPHERAL=0, result=3284517068, ticks=10
[00.029188] NOP=2, READ_ENABLE=0, READ_PERIPHERAL=0, result=3284517068, ticks=11
[00.033416] NOP=3, READ_ENABLE=0, READ_PERIPHERAL=0, result=3284517068, ticks=12
[00.037636] NOP=4, READ_ENABLE=0, READ_PERIPHERAL=0, result=3284517068, ticks=13
[00.041853] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=0, result=4294967295, ticks=5 <--

TIM2->ARR:
[00.008078] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=0, result=123, ticks=5
[00.012276] NOP=0, READ_ENABLE=1, READ_PERIPHERAL=0, result=123, ticks=6
[00.016476] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=1, result=123, ticks=23
[00.020675] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=0, result=123, ticks=5
[00.024873] NOP=1, READ_ENABLE=0, READ_PERIPHERAL=0, result=123, ticks=6
[00.029071] NOP=2, READ_ENABLE=0, READ_PERIPHERAL=0, result=123, ticks=7
[00.033277] NOP=3, READ_ENABLE=0, READ_PERIPHERAL=0, result=123, ticks=8
[00.037476] NOP=4, READ_ENABLE=0, READ_PERIPHERAL=0, result=123, ticks=9
[00.041674] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=0, result=123, ticks=5

Example code where READ_ENABLE=1:

08002aa8: ldr r2, [r6, #0]     start = DWT->CYCCNT;
08002aaa: str r0, [r4, #0] RCC->AHB1ENR |= RCC_AHB1ENR_GPIOBEN;
08002aac: ldr r1, [r4, #0] [r1] = RCC->AHB1ENR
08002aae: str.w r12, [r5] GPIOB->ODR = 1;
08002ab2: ldr r3, [r6, #0] end = DWT->CYCCNT;
08002ab4: ldr r1, [r5, #0] result = GPIOB->ODR;

 Registers at start of that code:

TDK_0-1691207035293.png

96 MHz clock, prefetch enabled, data cache enabled, instruction cache enabled.

[00.002678] Clock description:
[00.002839] - System clock selection: PLLCLK using HSE @ assumed 8 MHz
[00.002994] - PLL scalers: M=/4, N=x192, P=/4, Q=/8
[00.003133] - Prescalers: AHB=/1, APB1=/2, APB2=/1
[00.003324] - Clocks: VCO=384 MHz (in range), SYSCLK=96 MHz, HCLK=96 MHz
[00.003508] - Clocks: PCLK1=48 MHz (timers 96 MHz), PCLK2=96 MHz
[00.003612] - Clocks: 48MHz=48 MHz

 

If you feel a post has answered your question, please click "Accept as Solution".

@TDK,

thanks for your work.

Just to be absolutely sure: READ_ENABLE means read back from RCC, READ_PERIPHERAL means read from GPIO (in this case), correct?

To reproduce, I concentrated only at the GPIO case, i.e.

  • write RCC_AHBxENR.GPIOC to 1
  • perform one of the actions providing the delay
  • write GPIOC_ODR to 1

Timing starts/ends around this sequence; GPIOC_ODR is tested for 1 afterwards, and GPIOC is reset using the respective reset bit in RCC, and the enable bit is cleared; then the next variant is tested.

In attachment  [EDIT] absurdly, this forum does not allow .zip, so here [/EDIT] source, .elf, .hex and disasm (.lss). The 'F4 variant was compiled so that stack is not in CCMRAM, thus the same binary could be run in all 3 models.

I don't think clock is much relevant here so I first tried with all clock/FLASH default, then in the attempt to explain differences between your and my findings I set APB1 divider to 2 and enabled FLASH prefetch/both caches, but that had no impact whatsoever on the results.

I am lazy to bring up some printf()/semihosting (in fact I never did so I don't even know how exactly to do that), so I simply put a breakpoint (__BKPT()) and read out results using debugger.

Results:

 

 

works?[yes/no]/ticks
  idx  method      'F411   'F407='F427   'L476
0,3,7  none         no/5       no/6       no/5
    1  read RCC    yes/6       no/7      yes/7
    2  read GPIO   yes/7       no/7       no/6
    4  one NOP     yes/6       no/6       no/6
    5  two NOPs    yes/7      yes/8      yes/7
    6  DSB         yes/7      yes/9      yes/7

 

 

Discussion:

I am confused and have no explanations. Can't explain difference between your and my finding for the GPIO read (READ_PERIPHERAL) in 'F411.  Can't really explain the major difference between 'F407/'F429 and 'F411.

Contrary to the 'F4 where both GPIO and RCC are on the same AHB bus, in the 'L476 RCC and GPIOx are on different AHB buses, that might perhaps explain *some* of the differences, but still I'm much confused.

Note, that 'L476 is vulnerable to this problem, too, although its Errata (ES0250 Rev 10) does not mention it.

JW

 

Yes, your interpretation is correct. It looks like we're doing the exact same thing. I reset the peripheral differently, and use printf for results, but that shouldn't matter.

The clock setup on the F411 and F405/F427/etc is different in that it lets HCLK=PCLK2 at all speeds, I guess that would explain the difference. I have plenty of F429s around, could test one of those and should get the same results as you.

I'm not necessarily defending the HAL implementation, but they do more than just read the register with a single LDR instruction so probably introduce enough of a delay to not cause issues. Feels like a DSB would be a lot cleaner.

If you feel a post has answered your question, please click "Accept as Solution".

> I'm not necessarily defending the HAL implementation, but they do more than just read the register with a single LDR instruction so probably introduce enough of a delay to not cause issues.

Let's see:

 

#define __HAL_RCC_GPIOA_CLK_ENABLE()   do { \
                                        __IO uint32_t tmpreg = 0x00U; \
                                        SET_BIT(RCC->AHB1ENR, RCC_AHB1ENR_GPIOAEN);\
                                        /* Delay after an RCC peripheral clock enabling */ \
                                        tmpreg = READ_BIT(RCC->AHB1ENR, RCC_AHB1ENR_GPIOAEN);\
                                        UNUSED(tmpreg); \
                                          } while(0U)

 

Maybe. As result of readback+masking goes into a volatile variable, at least the mask (AND) should be performed, even if an aggressive optimizer may locate the variable into register and thus perform nothing more than the mask (note that according to C99, what constitutes a volatile access is implementation-defined). But I am not particularly interested in Cube/HAL.

> Feels like a DSB would be a lot cleaner.

I am not fan of using DSB for this kind of problems. It has enough mystical air to sound to be the ultimate solution, but it's not. It does not see beyond the boundary of the processor, and this is exactly the problem which happens completely outside the scope of the processor - in the mcu's fabric. That's why we see different results for different STM32 variants, even if they are all based on the same processor.

I never subscribed to the notion of "hardware abstraction" or "drivers" or any similar high-level concept. So I don't use one ultimate solution, and I also don't feel a need to have one. I feel a need to *understand* the need for solution, and to have a clean avenue towards it.

I use the STM32 as I ever used any other microcontroller, and I've always enabled clocks, set up (most of the) pins etc. at one single point, at the startup (there are of course justified exceptions, where a clock or pin setting has to be modified runtime - but they are exactly that, exceptions). And there are always several other clock registers to set up before I got to setting up the pins or any other peripheral. So this issue simply never had a chance to catch me... :)

Nonetheless, this is just another of the zillions of gotchas one has to be aware of.

My real concern is the erratum, which at least for the 'F407 is incorrect and this is because I consider documentation (i.e. PM+RM+DS+ES+AN) to be the ultimate source of the most accurate and detailed data on the STM32's internals. So, if it's incorrect, it's of utmost importance to correct it ASAP.

I wish ST to adopt this attitude.

JW

The LDR is supposed to stall the pipeline, and wait for it to clear the pending write buffer(s), to insure in-order completion. Prior instructions will impact how many resources are tied up, and the bus speeds and transactions the time this takes.

What it won't do is address timing hazards within the peripheral logic itself.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

I can replicate your results with the STM32F429. Changing AHB prescaler changes things slightly.

[00.001752] Clock description:
[00.001872] - System clock selection: PLLCLK using HSE bypass @ assumed 8 MHz
[00.001980] - PLL scalers: M=/4, N=x168, P=/2, Q=/7
[00.002076] - Prescalers: AHB=/1, APB1=/4, APB2=/2
[00.002208] - Clocks: VCO=336 MHz (in range), SYSCLK=168 MHz, HCLK=168 MHz
[00.002365] - Clocks: PCLK1=42 MHz (timers 84 MHz), PCLK2=84 MHz (timers 168 MHz)
[00.002440] - Clocks: 48MHz=48 MHz
[00.002600]
[00.002620]
[00.002667] Results for GPIOB->ODR:
[00.006747] ticks=6, result=0, FAILED <-----
[00.010832] ticks=9, DSB=1, result=1, pass
[00.014938] ticks=7, READ_ENABLE=1, result=0, FAILED <-----
[00.019047] ticks=7, READ_PERIPH=1, result=0, FAILED <-----
[00.023147] ticks=6, NOP=1, result=0, FAILED <-----
[00.027235] ticks=8, NOP=2, result=1, pass


[00.003505] Clock description:
[00.003746] - System clock selection: PLLCLK using HSE bypass @ assumed 8 MHz
[00.003961] - PLL scalers: M=/4, N=x168, P=/2, Q=/7
[00.004152] - Prescalers: AHB=/2, APB1=/4, APB2=/2
[00.004414] - Clocks: VCO=336 MHz (in range), SYSCLK=168 MHz, HCLK=84 MHz
[00.004723] - Clocks: PCLK1=21 MHz (timers 42 MHz), PCLK2=42 MHz (timers 84 MHz)
[00.004874] - Clocks: 48MHz=48 MHz
[00.005195]
[00.005235]
[00.005328] Results for GPIOB->ODR:
[00.009487] ticks=6, result=0, FAILED <-----
[00.013657] ticks=9, DSB=1, result=1, pass
[00.017846] ticks=7, READ_ENABLE=1, result=1, pass
[00.022039] ticks=7, READ_PERIPH=1, result=1, pass
[00.026234] ticks=6, NOP=1, result=0, FAILED <-----
[00.030408] ticks=8, NOP=2, result=1, pass

I could not get the HAL macro to fail under any circumstances, under full optimization. Anyhow, curiosity satisfied. Was interesting to investigate.

If you feel a post has answered your question, please click "Accept as Solution".

 

> What it won't do is address timing hazards within the peripheral logic itself.

Of course. It feels like peripheral-specific logic timing hazards would fall under a different errata, though. I know there was talk of CRC maybe needing some setup time, but I didn't see that in testing.

If you feel a post has answered your question, please click "Accept as Solution".

I've reconstructed the CRC-specific thread.

JW

> Changing AHB prescaler changes things slightly.

OH!

Confirmed on 'F407. No impact on 'L476 (and that's probably expected, sort of, the weirdness of 'F407 is probably given by pushing it further in terms of max. clock).

JW