Super User

Question

Erratum "Delay after an RCC peripheral clock enabling" - suggested workaround fails to work

Forum|Forum|2 years ago
August 4, 2023
8 replies
10038 views

[This is a reconstruction of a thread I've started on Apr 7, 2017, but got lost in the two forum software transitions since then.]

In STM32F40x and STM32F41x Errata sheet DocID022183 Rev 8, 2.1.13 Delay after an RCC peripheral clock enabling, there are three workarounds suggested.

The third one is:
3. Or simply insert a dummy read operation from the corresponding register just after enabling the peripheral clock.

It is not clear, what is "corresponding register" - is it the register of RCC, or the register of peripheral which clock is being enabled?

However, regardless of this, it does not appear to be working in either case. In the following code (with all clocks settings at their reset defaults):

8000214: 2201 movs r2, #1 
8000216: 6b18 ldr r0, [r3, #48] ; 0x30 
8000218: f440 5080 orr.w r0, r0, #4096 ; 0x1000 
800021c: 6318 str r0, [r3, #48] ; 0x30 
800021e: 6b18 ldr r0, [r3, #48] ; 0x30 <------------------ readback 
8000220: 600a str r2, [r1, #0] 
8000222: 2000 movs r0, #0 
8000224: bf00 nop 
8000226: bf00 nop

where r3 is preloaded with RCC address and r1 with CRC address, the readback is performed just after the write to RCC, and subsequently the now-presumably-enabled CRC data register is written, which should result in CRC for one word containing 0x00000001 being calculated, so CRC->DR should read as 0xC3C5C0CC. However, placing a breakpoint at the last nop and reading out CRC->DR shows that it is at 0xFFFFFFFF, so the previous write to CRC->DR has been ignored.

I tried also

 800021e: 6808 ldr r0, [r1, #0] <------------------ readback

with the same result.

The two nops or one dsb at that place (i.e. the two other suggested workarounds) work as expected, CRC->DR contains 0xC3C5C0CC at the breakpoint.

ST, please comment.

PS. I experimented with this as it appears to have a similar timing relationship as the CRC-reset-to-data issue I reported here {link leads to my own thread titled "CRC data ignored after CRC Reset" which is another lost thread, to which a later thread links, too}{edit - original thread reconstructed}.
---

{Clive Two.Zero} {Apr 7, 2017 5:15 PM}
Perhaps it is not exactly the same errata, but one specific to the logic of the CRC peripheral. The original errata is one where a pipelined back-to-back write via the write buffer (enabling clock, writing peripheral) provided no margin.

Adding the read of the RCC register after the write gave the peripheral additional clocks to recognize it had just come out of a reset state with clocks for the first time.

The CRC peripheral clearly seems to need more cycles of its state machine to get its act together. It speaks to a deficiency in the CRC logic, because it should be possible to do single cycle computation (in HW its way faster than a 32-bit full-adder)
---

{waclawek.jan} @ {Clive Two.Zero} on {Apr 8, 2017 12:17 PM}
Perhaps it is not exactly the same errata, but one specific to the logic of the CRC peripheral.

Agree. I will retry on Monday with GPIO.

---

{Clive Two.Zero} @ {waclawek.jan} on {Apr 8, 2017 6:15 PM}
I generally try to push the clock stuff up earlier in the initialization. Have you profiled the power consumption of the CRC peripheral? ie Clock disabled, vs clock enabled but peripheral unused.

The APB/AHB clock hazard is more systemic, a bit like the Pipeline/Write-Buffer issue getting IRQ clearing quickly enough to propagate to the Tail-Chaining logic.

ARM has generally eschewed putting logic in to hide all hazards. Intel erects a lot of cliff top fencing to save every lemming.

---

{waclawek.jan} @ {Clive Two.Zero} on {Apr 8, 2017 8:53 PM}
> I generally try to push the clock stuff up earlier in the initialization.

Me too.

> Have you profiled the power consumption of the CRC peripheral?

No. As I've said above, I was trying to find a simple, fast but safe solution to the {CRC problem} I discovered earlier (and it was lots of fun finding, as it was the compiler which kept putting the CRC-reset and CRC-data-write instructions a random number of other instructions apart upon attempts to add debugging code around). Yes a number of nops appear to solve the problem but then I'd have to resort to inline assembler, which is not my forte - gcc is kind enough to reorder NOPs inserted as standalone/macros, too.

I looked at the errata as a source of inspiration.

> ARM has generally eschewed putting logic in to hide all hazards. Intel erects a lot of cliff top fencing to save every lemming.

I'd say in this case it's ST's guilt. As in the case of consecutive writes to CRC->DR, the peripherals (including RCC) which require guard times may insert a sufficient number of waitstates, or provide a readback indicator. At the end of the day, I am absolutely content with things being dangerous, as long as they are clearly and concisely documented, possibly with an accompanying simple and clean program example - which unfortunately is far from being the case.

Jan

---

{waclawek.jan} @ {waclawek.jan} on {Apr 10, 2017 10:02 AM}
Confirming the same problem still pertains when writing to GPIO (an APB-related readback might make a difference in case of enabling APB-based peripherals but I am not interested in experimenting further with what is ST's task to ensure. [EDIT] OK so I tried with APB1/TIM2, and the problem did not show up at all, i.e. write to TIM2_CR1 after immediately after write to RCC_APB1ENR was successful, no matter what the AHB/APB divider was.)

Interestingly, when write buffer on processor-to-bus interface is disabled, one nop is sufficient, but ldr still not (IMO this is related to processor giving up/reacquiring the bus in case of nops, while still keeping the bus in case of successive st/ld - but again this is ST's task to explain).

As I've said, my inline asm skills are not the top, so please don't laugh too loudly. {indentations, tabulators lost, sorry}

// SCnSCB->ACTLR = SCnSCB_ACTLR_DISDEFWBUF_Msk; // reduces needed nr of nops 
{ 
register uint32_t tmp1, tmp2; 
__asm volatile( 
"movs %[t1], #1 " "\n\t" 
"ldr %[t2], [%[p2], #48] " "\n\t" 
"orr %[t2], #4 " "\n\t" 
"str %[t2], [%[p2], #48] " "\n\t"
#if(0) 
 "nop" "\n\t" 
 "nop" "\n\t"
#elif(0) 
 "dsb" "\n\t"
#elif(0) 
 "ldr %[t2], [%[p2], #48] " "\n\t"
#elif(1) 
 "ldr %[t2], [%[p], #0x14] " "\n\t"
#else 
 // nothing
#endif 
"str %[t1], [%[p], #0x14]" "\n\t" 
"nop" "\n\t" 
"nop" "\n\t" 
"nop" "\n\t" 
:[t1] "=&r" (tmp1) 
,[t2] "=&r" (tmp2) 
:[p] "r" (GPIOC) 
,[p2] "r" (RCC) ); 
}

This topic has been closed for replies.

TDK

@waclawek.jan FWIW, I can't quite duplicate this on a STM32F411 with max clock (this particular chip errata is the same). I tested GPIO->ODR, CRC->DR and TIM2->ARR. Only GPIO and CRC needed a delay, and a delay of 1 NOP or 1 read from either register worked. TIM2 matched your results (no delay needed). I even used maxed AHB/APB prescalers which made no difference in the results (despite what errata suggests).

Ticks were the difference in DWT->CYCCNT before/after the stores and loads.

GPIOB->ODR:
[00.008080] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=0, result=0, ticks=5 <--
[00.012273] NOP=0, READ_ENABLE=1, READ_PERIPHERAL=0, result=1, ticks=6
[00.016467] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=1, result=1, ticks=6
[00.020661] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=0, result=0, ticks=5 <--
[00.024854] NOP=1, READ_ENABLE=0, READ_PERIPHERAL=0, result=1, ticks=6
[00.029048] NOP=2, READ_ENABLE=0, READ_PERIPHERAL=0, result=1, ticks=7
[00.033250] NOP=3, READ_ENABLE=0, READ_PERIPHERAL=0, result=1, ticks=8
[00.037444] NOP=4, READ_ENABLE=0, READ_PERIPHERAL=0, result=1, ticks=9
[00.041637] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=0, result=0, ticks=5 <--

CRC->DR:
[00.008093] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=0, result=4294967295, ticks=5 <--
[00.012312] NOP=0, READ_ENABLE=1, READ_PERIPHERAL=0, result=3284517068, ticks=10
[00.016532] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=1, result=3284517068, ticks=10
[00.020749] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=0, result=4294967295, ticks=5 <--
[00.024968] NOP=1, READ_ENABLE=0, READ_PERIPHERAL=0, result=3284517068, ticks=10
[00.029188] NOP=2, READ_ENABLE=0, READ_PERIPHERAL=0, result=3284517068, ticks=11
[00.033416] NOP=3, READ_ENABLE=0, READ_PERIPHERAL=0, result=3284517068, ticks=12
[00.037636] NOP=4, READ_ENABLE=0, READ_PERIPHERAL=0, result=3284517068, ticks=13
[00.041853] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=0, result=4294967295, ticks=5 <--

TIM2->ARR:
[00.008078] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=0, result=123, ticks=5
[00.012276] NOP=0, READ_ENABLE=1, READ_PERIPHERAL=0, result=123, ticks=6
[00.016476] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=1, result=123, ticks=23
[00.020675] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=0, result=123, ticks=5
[00.024873] NOP=1, READ_ENABLE=0, READ_PERIPHERAL=0, result=123, ticks=6
[00.029071] NOP=2, READ_ENABLE=0, READ_PERIPHERAL=0, result=123, ticks=7
[00.033277] NOP=3, READ_ENABLE=0, READ_PERIPHERAL=0, result=123, ticks=8
[00.037476] NOP=4, READ_ENABLE=0, READ_PERIPHERAL=0, result=123, ticks=9
[00.041674] NOP=0, READ_ENABLE=0, READ_PERIPHERAL=0, result=123, ticks=5

Example code where READ_ENABLE=1:

08002aa8: ldr r2, [r6, #0] start = DWT->CYCCNT;
08002aaa: str r0, [r4, #0] RCC->AHB1ENR |= RCC_AHB1ENR_GPIOBEN;
08002aac: ldr r1, [r4, #0] [r1] = RCC->AHB1ENR
08002aae: str.w r12, [r5] GPIOB->ODR = 1;
08002ab2: ldr r3, [r6, #0] end = DWT->CYCCNT;
08002ab4: ldr r1, [r5, #0] result = GPIOB->ODR;

Registers at start of that code:

96 MHz clock, prefetch enabled, data cache enabled, instruction cache enabled.

[00.002678] Clock description:
[00.002839] - System clock selection: PLLCLK using HSE @ assumed 8 MHz
[00.002994] - PLL scalers: M=/4, N=x192, P=/4, Q=/8
[00.003133] - Prescalers: AHB=/1, APB1=/2, APB2=/1
[00.003324] - Clocks: VCO=384 MHz (in range), SYSCLK=96 MHz, HCLK=96 MHz
[00.003508] - Clocks: PCLK1=48 MHz (timers 96 MHz), PCLK2=96 MHz
[00.003612] - Clocks: 48MHz=48 MHz

"If you feel a post has answered your question, please click ""Accept as Solution""."

waclawek.janAuthor

Super User

@TDK,

thanks for your work.

Just to be absolutely sure: READ_ENABLE means read back from RCC, READ_PERIPHERAL means read from GPIO (in this case), correct?

To reproduce, I concentrated only at the GPIO case, i.e.

write RCC_AHBxENR.GPIOC to 1
perform one of the actions providing the delay
write GPIOC_ODR to 1

Timing starts/ends around this sequence; GPIOC_ODR is tested for 1 afterwards, and GPIOC is reset using the respective reset bit in RCC, and the enable bit is cleared; then the next variant is tested.

~~In attachment~~ [EDIT] absurdly, this forum does not allow .zip, so here [/EDIT] source, .elf, .hex and disasm (.lss). The 'F4 variant was compiled so that stack is not in CCMRAM, thus the same binary could be run in all 3 models.

I don't think clock is much relevant here so I first tried with all clock/FLASH default, then in the attempt to explain differences between your and my findings I set APB1 divider to 2 and enabled FLASH prefetch/both caches, but that had no impact whatsoever on the results.

I am lazy to bring up some printf()/semihosting (in fact I never did so I don't even know how exactly to do that), so I simply put a breakpoint (__BKPT()) and read out results using debugger.

Results:

works?[yes/no]/ticks
 idx method 'F411 'F407='F427 'L476
0,3,7 none no/5 no/6 no/5
 1 read RCC yes/6 no/7 yes/7
 2 read GPIO yes/7 no/7 no/6
 4 one NOP yes/6 no/6 no/6
 5 two NOPs yes/7 yes/8 yes/7
 6 DSB yes/7 yes/9 yes/7

Discussion:

I am confused and have no explanations. Can't explain difference between your and my finding for the GPIO read (READ_PERIPHERAL) in 'F411. Can't really explain the major difference between 'F407/'F429 and 'F411.

Contrary to the 'F4 where both GPIO and RCC are on the same AHB bus, in the 'L476 RCC and GPIOx are on different AHB buses, that might perhaps explain *some* of the differences, but still I'm much confused.

Note, that 'L476 is vulnerable to this problem, too, although its Errata (ES0250 Rev 10) does not mention it.

TDK

Yes, your interpretation is correct. It looks like we're doing the exact same thing. I reset the peripheral differently, and use printf for results, but that shouldn't matter.

The clock setup on the F411 and F405/F427/etc is different in that it lets HCLK=PCLK2 at all speeds, I guess that would explain the difference. I have plenty of F429s around, could test one of those and should get the same results as you.

I'm not necessarily defending the HAL implementation, but they do more than just read the register with a single LDR instruction so probably introduce enough of a delay to not cause issues. Feels like a DSB would be a lot cleaner.

"If you feel a post has answered your question, please click ""Accept as Solution""."

waclawek.janAuthor

Super User

> I'm not necessarily defending the HAL implementation, but they do more than just read the register with a single LDR instruction so probably introduce enough of a delay to not cause issues.

Let's see:

#define __HAL_RCC_GPIOA_CLK_ENABLE() do { \
 __IO uint32_t tmpreg = 0x00U; \
 SET_BIT(RCC->AHB1ENR, RCC_AHB1ENR_GPIOAEN);\
 /* Delay after an RCC peripheral clock enabling */ \
 tmpreg = READ_BIT(RCC->AHB1ENR, RCC_AHB1ENR_GPIOAEN);\
 UNUSED(tmpreg); \
 } while(0U)

Maybe. As result of readback+masking goes into a volatile variable, at least the mask (AND) should be performed, even if an aggressive optimizer may locate the variable into register and thus perform nothing more than the mask (note that according to C99, what constitutes a volatile access is implementation-defined). But I am not particularly interested in Cube/HAL.

> Feels like a DSB would be a lot cleaner.

I am not fan of using DSB for this kind of problems. It has enough mystical air to sound to be the ultimate solution, but it's not. It does not see beyond the boundary of the processor, and this is exactly the problem which happens completely outside the scope of the processor - in the mcu's fabric. That's why we see different results for different STM32 variants, even if they are all based on the same processor.

I never subscribed to the notion of "hardware abstraction" or "drivers" or any similar high-level concept. So I don't use one ultimate solution, and I also don't feel a need to have one. I feel a need to *understand* the need for solution, and to have a clean avenue towards it.

I use the STM32 as I ever used any other microcontroller, and I've always enabled clocks, set up (most of the) pins etc. at one single point, at the startup (there are of course justified exceptions, where a clock or pin setting has to be modified runtime - but they are exactly that, exceptions). And there are always several other clock registers to set up before I got to setting up the pins or any other peripheral. So this issue simply never had a chance to catch me... :)

Nonetheless, this is just another of the zillions of gotchas one has to be aware of.

My real concern is the erratum, which at least for the 'F407 is incorrect and this is because I consider documentation (i.e. PM+RM+DS+ES+AN) to be the ultimate source of the most accurate and detailed data on the STM32's internals. So, if it's incorrect, it's of utmost importance to correct it ASAP.

I wish ST to adopt this attitude.

TDK

I can replicate your results with the STM32F429. Changing AHB prescaler changes things slightly.

[00.001752] Clock description:
[00.001872] - System clock selection: PLLCLK using HSE bypass @ assumed 8 MHz
[00.001980] - PLL scalers: M=/4, N=x168, P=/2, Q=/7
[00.002076] - Prescalers: AHB=/1, APB1=/4, APB2=/2
[00.002208] - Clocks: VCO=336 MHz (in range), SYSCLK=168 MHz, HCLK=168 MHz
[00.002365] - Clocks: PCLK1=42 MHz (timers 84 MHz), PCLK2=84 MHz (timers 168 MHz)
[00.002440] - Clocks: 48MHz=48 MHz
[00.002600] 
[00.002620] 
[00.002667] Results for GPIOB->ODR:
[00.006747] ticks=6, result=0, FAILED <-----
[00.010832] ticks=9, DSB=1, result=1, pass
[00.014938] ticks=7, READ_ENABLE=1, result=0, FAILED <-----
[00.019047] ticks=7, READ_PERIPH=1, result=0, FAILED <-----
[00.023147] ticks=6, NOP=1, result=0, FAILED <-----
[00.027235] ticks=8, NOP=2, result=1, pass


[00.003505] Clock description:
[00.003746] - System clock selection: PLLCLK using HSE bypass @ assumed 8 MHz
[00.003961] - PLL scalers: M=/4, N=x168, P=/2, Q=/7
[00.004152] - Prescalers: AHB=/2, APB1=/4, APB2=/2
[00.004414] - Clocks: VCO=336 MHz (in range), SYSCLK=168 MHz, HCLK=84 MHz
[00.004723] - Clocks: PCLK1=21 MHz (timers 42 MHz), PCLK2=42 MHz (timers 84 MHz)
[00.004874] - Clocks: 48MHz=48 MHz
[00.005195] 
[00.005235] 
[00.005328] Results for GPIOB->ODR:
[00.009487] ticks=6, result=0, FAILED <-----
[00.013657] ticks=9, DSB=1, result=1, pass
[00.017846] ticks=7, READ_ENABLE=1, result=1, pass
[00.022039] ticks=7, READ_PERIPH=1, result=1, pass
[00.026234] ticks=6, NOP=1, result=0, FAILED <-----
[00.030408] ticks=8, NOP=2, result=1, pass

I could not get the HAL macro to fail under any circumstances, under full optimization. Anyhow, curiosity satisfied. Was interesting to investigate.

"If you feel a post has answered your question, please click ""Accept as Solution""."

waclawek.janAuthor

Super User

> Changing AHB prescaler changes things slightly.

OH!

Confirmed on 'F407. No impact on 'L476 (and that's probably expected, sort of, the weirdness of 'F407 is probably given by pushing it further in terms of max. clock).

waclawek.janAuthor

Super User

I've reconstructed the CRC-specific thread.

Piranha

Principal III

There is an another topic about these issues. By the way, the document Jan talks about, is now ES0182 - Rev 14 and the section is "2.2.13 Delay after an RCC peripheral clock enabling". And for L4 and F7 series the issue is not present in errata, because they documented it in the reference manuals. Actually that is the place, where it should be, because it is not a flaw, but a deliberate design feature. Anyway, there are two potential code issues with peripheral clock enabling.

First, the code must ensure that the write to the corresponding RCC register has actually been completed. The DSB instruction introduces a delay of at least 1 cycle and drains the CPU write buffer, but it doesn't know about the buses, peripherals or anything outside of the CPU. For this issue the less restrictive DMB instruction does the same, but ST doesn't mention it because they don't understand the difference between these two instructions. The NOP instruction architecturally doesn't guarantee anything at all and should be used only for padding. Though it does provide a delay of 1 cycle on at least the Cortex cores M0/M3/M4. I don't know about M23/M3x and I definitely wouldn't rely on it on M7/M55/M85. The only correct solution, which forces the write to always go completely though down to the RCC, is to read back the corresponding or any other RCC register.

Second, the code must wait for a clock enable logic to synchronize with the peripheral bus and actually enable the peripheral clock. For F4 series ST states that the delay is 2 cycles for AHB and 1+PRESC cycles for APB buses. For AHB they say that it's AHB cycles. For APB they don't say anything, but it should also be AHB (!) cycles. If so, this translates to 2 peripheral clock cycles for AHB and (1 .. 2] peripheral clock cycles for APB. This also becomes clear by reading the F7 reference manuals like RM0410 Rev 4 section "5.2.12 Peripheral clock enable register (RCC_AHBxENR, RCC_APBxENRy)", which simplifies this and explicitly states that the delay must be 2 peripheral clock cycles. The same is stated in the reference manuals for L4 series, just the very important word "peripheral" is missing in those.

Taking it all into account, a correct code looks like this:

RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN;
(void)RCC->AHB1ENR;
// Wait for 2 peripheral clock cycles.

// Use the peripheral.

So, from the three "workarounds" in errata, the two DSB and NOP related ones are just a complete nonsense and doesn't guarantee anything. The third read-back related one actually means a read-back from RCC, which they also fail to state, and solves just the first issue, but ignores the second one, and this is what the HAL and LL implements. The HAL and LL in addition does a bitwise AND operation, but that is also a nonsense, which doesn't guarantee anything. As I'm saying it all the time - almost all of the SPL/HAL/LL/Cube code is broken. And not only the code, but even, when ST's documentation gives some advice on how to write the code, the advice is also often inadequate, wrong or just a nonsense.

waclawek.janAuthor

Super User

> And for L4 and F7 series the issue is not present in errata, because they documented it in the reference manuals.

Oh, I didn't know that, thanks.

> The NOP instruction architecturally doesn't guarantee anything at all and should be used only for padding.

I don't really care for the "architectural" things. That's just another word to say "nonportable", but I don't subscribe to the notion of universal portability in microcontrollers, in the same way as I don't subscribe to thus don't seek an universal bulletproof solution independent of context (a.k.a. driver).

So, if a NOP does provide a reliable at-least-one-cycle delay on CM4, and if the problem's reliable solution is at-least-N-cycles delay, then for me N NOPs on CM4 are a perfect solution too. I really don't care that it won't work on CM7 or CM23, as they are attached to a different harness which modifies the problem anyway.

Piranha

Principal III

https://community.st.com/t5/stm32-mcus-embedded-software/stm32-clock-and-delay-helper-functions/td-p/593082

FBL

ST Technical Moderator

Hello @waclawek.jan

Thank you all for this interesting discussion.

Workarounds are a list of applicable measures. So, it is possible some measures are recommended in specific situations and dependent from the use case.

>if a NOP does provide a reliable at-least-one-cycle delay on CM4

NOP instruction does not guarantee one cycle delay. It can increase execution time, leave it unchanged, or even reduce it. However, it is used for instruction alignment, not necessarily time-consuming. Maybe 2 LDR instructions introduce enough delay.

Did HAL macro fail? If not, what do you suggest?

To give better visibility on the answered topics, please click on "Best answer" on the reply which solved your issue or answered your question.Best regards,FBL

waclawek.janAuthor

Super User

Hi @FBL ,

An error is an error. No matter whether any Cube/HAL macro works or not, no matter that in other STM32 the workaround works, no matter that there exist specific circumstances when the workaround works - if in STM32F407 exist conditions when the workaround does not work, as proven by the experiment, it's then not a workaround and shall be removed from the STM32F407 errata.

PS. You are quoting from Cortex-A manual, which is irrelevant for Cortex-M4.

Sign up

Login with SSO

Login to the community

Login with SSO