Showing results for 
Search instead for 
Did you mean: 

Clock cycle shift on GPIO output STM32F103

Associate III

Dear Community,

I am porting an old application made on AVR to STM32, and I am facing a strange timing issue. 

In a nutshell, the application is reading sector (512 Bytes) from a SDCARD and output the content of the buffer to GPIO with 4us cycle (meaning 3us low, 1 us data signal). 

The SDCard read is working fine, and I have written a small assembly code to output GPIO signal with precise MCU cycle counting. 

Using DWT on the debugger, it give a very stable and precise counting (288 cycles for a total of 4us).

When using a Logic analyser with 24 MHz freq, I can see shift of signal by 1 or 2 cpu cycles and so delay. 

I have tried to use ODR directly and BSRR but with no luck. 

Attached :

- Screenshot of the logic analyzer

Screenshot 2024-05-12 at 06.30.59.png
As you can see I do not have 3us but 3.042 and this is not always the case

Clock configuration

Screenshot 2024-05-12 at 06.32.34.png

Port configuration:


GPIO_InitStruct.Pull = GPIO_NOPULL;
HAL_GPIO_Init(GPIOC, &GPIO_InitStruct);
Assembly code : 
.global wait_1us
push {lr}
nop ;// 1 1
nop ;// 1 2
mov r2,#20 ;// 1 3
subs r2,r2,#1 ;// 1 1
bne wait_1us_1 ;// 1 2
pop {lr}
bx lr // return from function call

.global wait_3us
push {lr}
subs r2,r2,#1
bne wait_3us_1
pop {lr}
bx lr // return from function call
and r5,r3,0x80000000;// 1 1
lsl r3,r3,#1 ;// 1 2 // right shift r3 by 1
subs r4,r4,#1 ;// 1 3 //; dec r4 bit counter
//mov r6,#0 // Reset the DWT Cycle counter for debug cycle counting
//ldr r6,=DWTCYCNT
//mov r2,#0
//str r2,[r6] // end
bne sendBit ;// 1 4
beq process ;// 1 5
// Clk 15, Readpulse 14, Enable 13
ldr r6,=PIN_BSRR ;// 2 2
LDR r2, [r6] ;// 3 5
cmp r5,#0 ;// 1 6
ITE EQ ;// 1 7

ORREQ r2,r2, #0x80000000 ;// 1 8 set bit 13 to 1, OR with 0000 0010 0000 0000 0x2000 (Bit13) 0x6000 (Bit13 & 14)
ORRNE r2,r2, #0x00008000 ;// 1 9 set bit 29 to 1, OR with 0010 0000 0000 0000
ORR r2,r2, #0x00004000 ;// 1 8 set bit 13 to 1, OR with 0000 0010 0000 0000 0x2000 (Bit13) 0x6000 (Bit13 & 14)
STR r2, [r6] ;// 1 10 set the GPIO port -> from this point we need 1us, 72 CPU cycles (to be confirmed)
bl wait_1us ;// 65 75 144 209
ORR r2,r2, #0xC0000000 ;// 1 12 ; // Bring the pin down
STR r2,[r6] ;// 1 13 ; //
; // We need to adjust the duration of the 3us function if it is the first bit (coming from process less 10 cycle)
cmp r4,#1
ite eq
moveq r2,#56
movne r2,#62
bl wait_3us ; // wait for 3 us in total
b sendByte


I do not know where to look at to be honnest


Lead II

Even something as “lowly” as stm32f1 does not guarantee cycle-by-cycle timing accuracy. Where there are things like FLASH accelarators and multiple clocks in a system, precise timing becomes unpredictable, particularly if you have other things going on such as DMA or interrupts.

(I think stm32f0 data sheets, in contrast, do mention cycle-by-cycle predictability).

You could reduce the number of things “going on” by putting your delay code inline rather than as a subroutine call - this will avoid delay-unpredictability associated with the jump and push/pull of PC to/from the stack.
How certain are you of the delay error? Are you using a Is it consistent or only on some cycles? I see you are using HSE but is it a crystal or a lower-accuracy ceramic resonator?

You do not go into why timing is so critical for your application. But if it is, you might be better off using DMA driven by a timer to pump your pre-processed pattern to the BSRR register.

Thanks Danish1,

Timing is critical because I am interfacing an Old Apple II SDISK with a STM32 to simulate the floppy disk drive. The protocol used is a very specific data transfer protocol without clk pulse but only sequence of 1us data signal, 3us pause and so on for 402 encoded bytes (256 with encoding). I really need something very accurate. It works like a charm on a AVR ATMEGA328P.

I am new to STM32, and very surprise by the unpredictable clock approach.

I have tried the inline of the wait procedure, it helps a bit but it is not yet really accurate. 

I will try the DMA approach (even if I know nothing on ARM DMA with timer), would you have an example where I can start learning how it works ? 


I have checked how DMA works does it mean that I have to convert each Byte to an array of 8 uint32 and to convert to match BSSR ?

so for 402 Bytes I need a 8x402 of uint array right to feed the DMA ?

If I used circular buffer, how do I feed the buffer ? 

How do I manage 1us data pulse and 3 us pause ? another timer ? in that case how to sync the 2 timer ? 

Sorry for all this newbie question




If your data pins are all on one port you could use a 4x402 (zeroed first!) buffer, write your encoded data in every 4th location, and use DMA metered by a 1 uSec timer to blow out the entire buffer.  Or maybe I missed a detail...  😉

Welcome to the STM32.  There are probably some Application Notes that can help with DMA and timer setup.  And there may be something in the CubeMX examples for the F103 that might be helpful - worth a look.

Hello David, 

thanks for your answer, I discover the power of the STM32 and I like it 😉

Very good idea to have 1 data at every 4th location, I have only 1 data pin, and it means that I need 4x8(bit)x402(Bytes) ?

Is there a way to recharge the buffer to be sent using circular ? if it make sense, how do I detect the DMA position ?


You can set the DMA to circular mode, 2 X size of your array with data, then use the half and full buffer callback to fill in new data.

So DMA write continuous data stream without interruption and you have time to fill the next buffer it will send.

If you feel a post has answered your question, please click "Accept as Solution".
Associate III

I see that there is a half transfer interrupt that I would use to update the first half of the buffer ? would that work ? 


The GPIO ports are 16-bit, right?  The DMA writes to the entire port's ODR (all 16 bits in a single write cycle), so your buffer depth is 4x2(bytes)x402(bytes) (and should be at least 16-bit aligned).  The catch with this approach is the DMA is writing to the entire port so any pin that's configured as a GPIO output (including any control signals at your hardware interface that are assigned to a pin on that port) will get the corresponding bit values from the buffer. your hardware interface 1 or 8 bits wide (for the data)?  Are there control signals too?  (I assumed it's 8 bits, hence the ODR suggestion.  If it's only 1 bit the buffer depth increases x8.)  Sorry, I'm not familiar with the details of the interface you're synthesizing and that directly affects the DMA buffer's content.

(I'm not familiar with the F103's DMA so my suggestion might be totally inappropriate.)  If I wanted to be fancy I'd set the DMA up as circular over a 2x buffer and use the half-complete and complete interrupts to know in which half I could scribble while the DMA was blowing out the other half.

Ah, now I see (a little) more!  You're interfacing with this:



If so, maybe instead of bit-banging it with a GPIO and DMA+timer you could use either a USART in synchronous or SPI.  Just a thought... 😉