STM32F4: changing clock speed dynamically (runtime). What do I need to look out for?

fbar · ‎2019-03-27

I'm writing an application for a STM32F407 that requires very high speed ADC for 6 channels (two for each ADC, using regular simultaneous scanned mode), then once a series of conditions are met, perform quite a few cross correlation calculations for very long sequences, requiring the fastest performance possible. I'm using the SPL libraries, incidentally, not the HAL/CubeMX

Due to how the clocks work on a F407, the fastest ADC conversion cycle happens with a 144MHz clock. But the max clock is 168MHz, and that's ~16% faster when performing DSP operations. I do understand how to set the clock for both 144MHz and 168MHz, with all the right PLL_M, N, P and Q values

I was thinking of setting the board up at boot with 144MHz. sample with the fastest ADC clock and, once my conditions are met, switch the clock to 168MHz before performing the cross correlations. Whatever time it takes to dynamically change the clock, I'll more than make it up while running the long (>2 seconds) calculations 16% faster

This article seems to indicate a way to do it http://embeddedsystemengineering.blogspot.com/2015/06/stm32f4-discovery-tutorial-4-configure.html

What would happen if I boot the system with 144MHz clock, switch to 168MHz, then back to 144MHz for another cycle and so on? It seems to work, but I'm not sure if I'm missing any downside not showing up in a short test (or reliability issues over time). I also plan to use USB, so I understand the need to always have a 48MHz clock for that

turboscrew · ‎2019-03-27

Correcte if I'm wrong, but I think you need a backup clock source. You first need to switch to the backup clock and disable PLL before you can change the values. It takes some time for the PLL to stop and start. I think the operation takes something like 1 - 2 milliseconds. Starting and stopping the backup clock takes some milliseconds more, if you don't let it run on the side.

S.Ma · ‎2019-03-28

Agreed. Choice of clock tree is important. Peripherals should remain with constant frequency while core is dynamically prescaled. STM32L4 with MSI and clock selector for most peripherals follow this vision.

waclawek.jan · ‎2019-03-28

> I'm not sure if I'm missing any downside

I don't think there's any (except of those you already mentioned, explicitly stemming from the rigid clocking scheme in 'F4xx).

Going for 16% sounds desperate. Do you achieve the full 16% performance increase, given different FLASH latency? Do you have performed all the "usual" measures, including having stack and the most critical (most often used) variables in CCRAM, and any algorithmic improvements available? Have you considered moving to a different 'F4 (namely 'F427) with slightly even higher maximum clock frequency and with (only slightly, unfortunately) better clock tree? Have you considered moving to a 'F7?

JW

fbar · ‎2019-03-28

Thanks for the reply. Can you please help me understand what you mean by "backup clock source"? Please note that when changing clock, I would be ok with having nothing else running (no timer, dma transfer, etc), and just looping until the PLL is stable (i.e. while (!RCC_WaitForHSEStartUp()); RCC_PLLConfig(new values);)

Would I still need a backup clock (I assume HSI?) in that case?

waclawek.jan · ‎2019-03-28

Your mcu can't run from PLL while you change it (it probably won't allow to change it while running out of it).

So you need to switch temporarily to a different source in RCC_CFGR.SW - HSE is OK if you are running it because of the PLL anyway - switch off PLL, change its settings, switch it back on, wait until it locks, meantime change the FLASH latency accordingly, and then switch back to PLL as the primary source in RCC_CFGR.SW.

JW

fbar · ‎2019-03-28

Desperate is one way to put it 🙂 And, no, haven't done full performance analysis yet, just a quick set of tests, and 168MHz is definitely faster.

As far as I can tell, the flash wait states are 5 for both 144MHz and 168MHz, so there should be a linear improvement there (the F4 manual says that at <150MHz could be only 4, but the system_stm32f4xx file for the STM32F4 Discovery board at 144MHz uses 5). And the ST ART should further reduce effective wait states, especially for a short, tightly coupled loop of CM4 DSP instructions like the ones that make the bulk of my execution time. I want to also try moving that part of the code to SRAM, to see if there's any real difference, but given how effective ART seems to be, a small loop might be faster in FLASH than SRAM

And, yes, I do use CCM RAM already: I convert my signal using ADC with DMA transfer, preprocess the various signals and move the data into CCM RAM, then process the signals with the highly optimized DSP library cross correlation (which does an amazing job of parallelizing with the M4 DSP instructions, especially when using Q15 number format)

And, yes, changing board is an option. But for hobbyist projects, the STM32F407VG boards at around $12 are hard to beat. There are no discovery/nucleo boards with the F417/427, nor much from the Chinese suppliers, so the additional cost would be out of line for the performance gain. The Nucleo F7 is very tempting at twice the performance, and not that expensive (and has the 3 high speed ADCs I need, not to mention extra SRAM). But no SPL, only the newer LL libraries (I really, really dislike the HAL). But for now I would like to maximize the board I have. If nothing else, it's a good learning experience.

Thanks for the additional thoughts, really appreciate the opportunity to brainstorm about this

fbar · ‎2019-03-28

Thanks again.

I used the code linked in my original question (below), made into a function with the right settings for 144MHz and 168MHz, and I tried switching hundreds of times between the 2 clock speeds, doing ADC conversions (6 channels, triple mode, 8192 samples per channel) with 144MHz MHz clock (and 36MHz ADC clock), then switching to 168MHz and running cross correlation, printing the results via USB VCP. Everything works, as long as there is enough time between clock changes and sending data via USB. The USB stability seems to be impacted if sending lots of data while switching clock speed. I did not have time to fully investigate it, but doesn't cause problems in my actual case with infrequent enough clock speed change (only a test with unrealistic frequent switching showed problems)

I modified SysTick to deal with the different clocks, as well

    RCC_DeInit();
    // Enable external crystal (HSE)
    RCC_HSEConfig(RCC_HSE_ON);
   // Wait until HSE ready to use or not
   ErrorStatus errorStatus = RCC_WaitForHSEStartUp();
   if (errorStatus == SUCCESS)
   {
       // Configure the PLL for 168MHz SysClk and 48MHz for USB OTG, SDIO
        RCC_PLLConfig(RCC_PLLSource_HSE, 8, 336, 2, 7);  // values changed for 144MHz
        // Enable PLL
        RCC_PLLCmd(ENABLE);
        // Wait until main PLL clock ready
        while (RCC_GetFlagStatus(RCC_FLAG_PLLRDY) == RESET);
        // Set flash latency
        FLASH_SetLatency(FLASH_Latency_5);
        // AHB 168MHz
        RCC_HCLKConfig(RCC_SYSCLK_Div1);
        RCC_PCLK1Config(RCC_HCLK_Div4);
        RCC_PCLK2Config(RCC_HCLK_Div2);
        // Set SysClk using PLL
        RCC_SYSCLKConfig(RCC_SYSCLKSource_PLLCLK);
    }
    else
    {
        // Do something to indicate that error clock configuration
        while (1);
   }
    SystemCoreClockUpdate();

S.Ma · ‎2019-03-29

Well, it seems you are ready to go to deep optimization of the code, looking at the assembly generated by the compiler. Also if using floats, try Q31 (fractional bits) or instead of moving data, move by computing, interpolate vs math functions, precompute and use lookup tables, put the interrupt vectors and ISR in RAM to prevent Flash ART/core pipeline jumping too much. It's going to be fun. What's difficult to know is which optimisation gives the best return to save development time.

fbar · ‎2019-03-29

Thanks for the additional input. I am actually using Q15 which works incredibly well for 12 bit ADC signals, and arm_correlate_*** functions from the CMSIS library (and the -Os compiler flag, which has a rather dramatic effect). Given how optimized those are for a specific format of input array (to use SIMD and SMLALD CM4 DSP instructions), I need to move data from the ADC buffer to a "signal buffer". Since I'm using all 3 ADCs, with 2 channels each and DMA transfer, the resulting ADC buffer is an interleaved CH1_1, CH2_1, CH3_1, CH4_1, CH5_1, CH6_1, then CH1_2, CH2_2, etc. I had my own highly optimized cross correlation algorithm for a F103 which worked well with an interleaved buffer, but it's no match for using DSP instructions on 4 values at a time... unsurprisingly, the M4 DSP core makes an amazing difference when doing cross correlation, and the CMSIS DSP libraries are really already highly optimized (I still managed to shave ~30% from the execution time by limiting the usual cross correlation window, thanks to how my signals are captured and stored in order). 95% of my loop time is a series of cross correlations on multiple signals, so the highest return is to focus on making the cross correlation itself as fast as possible. Pretty much anything else around it (like copying the ADC buffer into signal arrays) has no real effect on the final outcome. And I'm satisfied that the core cross correlation function I'm using (a modified version of the CMSIS code) is as fast as it could be, and fully contained in ART

Would not have thought about the ISR in RAM, but that makes a ton of sense and I will remember for the future. In my case, I actually disable all interrupts while running the DSP analysis, so there isn't much to gain there

Still, no matter what I do, any optimization that works at 144MHz, also works ~16% faster at 168MHz, so switching the clock is another helpful trick. Clearly an F7 would make even more difference (and I could use the extra RAM to avoid a couple of extra buffer copies to juggle the 128k+64k F407 RAM)