Clock cycle shift on GPIO output STM32F103

vbesson · ‎2024-05-11

Dear Community,

I am porting an old application made on AVR to STM32, and I am facing a strange timing issue.

In a nutshell, the application is reading sector (512 Bytes) from a SDCARD and output the content of the buffer to GPIO with 4us cycle (meaning 3us low, 1 us data signal).

The SDCard read is working fine, and I have written a small assembly code to output GPIO signal with precise MCU cycle counting.

Using DWT on the debugger, it give a very stable and precise counting (288 cycles for a total of 4us).

When using a Logic analyser with 24 MHz freq, I can see shift of signal by 1 or 2 cpu cycles and so delay.

I have tried to use ODR directly and BSRR but with no luck.

Attached :

- Screenshot of the logic analyzer

As you can see I do not have 3us but 3.042 and this is not always the case

Clock configuration

Port configuration:

GPIO_InitStruct.Pin = GPIO_PIN_13| READ_PULSE_Pin|READ_CLK_Pin;

GPIO_InitStruct.Mode = GPIO_MODE_OUTPUT_PP;

GPIO_InitStruct.Pull = GPIO_NOPULL;

GPIO_InitStruct.Speed=GPIO_SPEED_FREQ_HIGH;

HAL_GPIO_Init(GPIOC, &GPIO_InitStruct);

Assembly code :

.global wait_1us

wait_1us:

.fnstart

push {lr}

nop ;// 1 1

nop ;// 1 2

mov r2,#20 ;// 1 3

wait_1us_1:

subs r2,r2,#1 ;// 1 1

bne wait_1us_1 ;// 1 2

pop {lr}

bx lr // return from function call

.fnend

.global wait_3us

wait_3us:

.fnstart

push {lr}

nop

wait_3us_1:

subs r2,r2,#1

bne wait_3us_1

pop {lr}

bx lr // return from function call

.fnend

sendByte:

and r5,r3,0x80000000;// 1 1

lsl r3,r3,#1 ;// 1 2 // right shift r3 by 1

subs r4,r4,#1 ;// 1 3 //; dec r4 bit counter

//mov r6,#0 // Reset the DWT Cycle counter for debug cycle counting

//ldr r6,=DWTCYCNT

//mov r2,#0

//str r2,[r6] // end

bne sendBit ;// 1 4

beq process ;// 1 5

// Clk 15, Readpulse 14, Enable 13

sendBit:

ldr r6,=PIN_BSRR ;// 2 2

LDR r2, [r6] ;// 3 5

cmp r5,#0 ;// 1 6

ITE EQ ;// 1 7

ORREQ r2,r2, #0x80000000 ;// 1 8 set bit 13 to 1, OR with 0000 0010 0000 0000 0x2000 (Bit13) 0x6000 (Bit13 & 14)

ORRNE r2,r2, #0x00008000 ;// 1 9 set bit 29 to 1, OR with 0010 0000 0000 0000

ORR r2,r2, #0x00004000 ;// 1 8 set bit 13 to 1, OR with 0000 0010 0000 0000 0x2000 (Bit13) 0x6000 (Bit13 & 14)

STR r2, [r6] ;// 1 10 set the GPIO port -> from this point we need 1us, 72 CPU cycles (to be confirmed)

bl wait_1us ;// 65 75 144 209

ORR r2,r2, #0xC0000000 ;// 1 12 ; // Bring the pin down

STR r2,[r6] ;// 1 13 ; //

; // We need to adjust the duration of the 3us function if it is the first bit (coming from process less 10 cycle)

cmp r4,#1

ite eq

moveq r2,#56

movne r2,#62

bl wait_3us ; // wait for 3 us in total

b sendByte

I do not know where to look at to be honnest

vbesson · ‎2024-05-12

Yes this is my goal, I have done it with the ATMEGA328P but the SPI speed does not allow accurate writing,

I have done a lot of trick to make it working... now I try with stm32.

My approach is:

- FatFS to select the right file,

- Direct fat allocation reading to get the cluster / sector match

- Reading is very fast on stm32, less than 3ms to read a sector. As per the specification I have 20ms.

- Using Assembly to send bit by bit the buffer (not working), so I am testing now the approach with the DMA and the timer.

I do not get your point with USART in synchronous or SPI ? you mean having a SPI to DMA and then DMA to USART ?

Just a side question: I have the feeling that my blue pill is not with a genuine st chip (ID change). Would that impact the cycle to cycle predictability ?

Vincent

AScha.3 · ‎2024-05-12

>Just a side question: I have the feeling that my blue pill is not with a genuine st chip (ID change). Would that impact the cycle to cycle predictability ?

No , is same - because is same core.

see for the chips inside :

https://www.richis-lab.de/STM32.htm

> I have the feeling that my blue pill is not with a genuine st chip (ID change)

Whats written on chip ?

Whats its ID ?

If you feel a post has answered your question, please click "Accept as Solution".

vbesson · ‎2024-05-12

It is written:

STM32

F103C8T6

991KA 93

MYS A11

The CPU ID is 0x2ba01477 instead of 0x1ba01477

AScha.3 · ‎2024-05-12

So its probably a CS32F103C8T6 by CKS , not by STM. :)

see: https://www.eevblog.com/forum/beginners/unexpected-idcode-flashing-bluepill-clone/

But it should work fine also (its just same ARM core made by other company).

Bad (illegal) is just : its re-labeled , so its a "fake STM32F103" .

+

Maybe you cannot debug in STM-IDE , because it checking since some versions for "correct" chip ID.

For debug you need genuine STM chip.

If you feel a post has answered your question, please click "Accept as Solution".

vbesson · ‎2024-05-12

Ok thanks,

I am using OpenOCD and debug works fine,

I am implementing the DMA approach, but after that, I am kine to understand why I have a cycle or more shift with assembly code, (maybe I need to disable all interrupts)

Vincent

AScha.3 · ‎2024-05-13

Aaaa, you cannot expect fixed timing, when INT might happen.

Using DMA should be "better" , but still they (dma + cpu) access the same internal bus, so any access might get one or more wait states, until it gets the bus, if there is the bus busy with a transfer at that moment.

If you feel a post has answered your question, please click "Accept as Solution".

Danish1 · ‎2024-05-13

You say I am kine to understand why I have a cycle or more shift with assembly code, (maybe I need to disable all interrupts).

Where you have any interrupt happening, it will be many more than one or two cycles. The processor has to save several registers onto the stack, jump to the interrupt-service-routine, do whatever's coded there, then pop those registers off the stack before resuming whatever your code was doing.

I think an approach to improve cycle-accuracy would be have only one clock in the system. Run the processor sufficiently slowly that you don't have any FLASH wait-states (24 MHz?), and have all the AHB, APB1 and APB2 at the same clock-rate as SysClk. That's if you can get all the necessary processing done while running that slowly.

Thinking about your application - emulating or linking-to a disk drive, I don't currently see why timing is so critical. I would expect even a synchronous motor on a genuine disk drive to suffer from some speed variation. The recorded data - FM, MFM or whatever, should be "self-clocking" in that the time between edges determines whether you have a 1 or 0 of data, and there should be significant latitude in those timings. The only exception is if the receiving end is poorly emulated and samples an entire sector of data based on the timing of the first edge, in which case any clock drift between sender and receiver can easily add up to more than a bit period. I think the fix is to use a different algorithm for reception e.g. make it self-clock as data comes in or sample sufficiently frequently so as not to miss any edges even if the clock drifts.

unsigned_char_array · ‎2024-05-13

Have you tried running the code from RAM instead of FLASH? This might improve timing.

Kudo posts if you have the same problem and kudo replies if the solution works.
Click "Accept as Solution" if a reply solved your problem. If no solution was posted please answer with your own.

waclawek.jan · ‎2024-05-13

Forget about bit-banging, whether asm or DMA.

If you want cycle precision, just use a timer. Or SPI, or whatever other hardware is suitable.

JW

vbesson · ‎2024-05-13

Hello Danish1,

I do not have any interrupt handler implemented, this is why it is very very strange,

Running at 24Mhz is not fast enough, 24 *** cycle to issue on GPIO will not work (same issue as the ATMEGA328P that was overclocked to 27Mhz).

I need to understand (and I will try to experiment what make the asm code to take more time to execute). What is strange is that the first data pulse iteration is not taking 1us but systematically 2.5 us and after that I have 1us or 1.04us.

Synchronisation and accuracy of the clock is critical as it is a 1 GPIO interface without any clock pulse. At the beginning of the transmission there is 5 times a synchronisation pattern made of 10 bits (FFFFFFFF00) and then the real data transfer starts 402 bytes of data.

After that, The disk head can move for 20ms based on 4 GPIO interrupt.

to better understand this, there is a small old book Beneath Apple II DOS where it is explain section 3.7

What does not make sense, is why on a AVR I can manage to get very precise clock cycle, and here on the STM32 it is not the case. I must have done a mistake,

I will try the following:

- Removing all interrupt handler

- Reducing the speed of the clock

- Moving asm in SRAM

- Finish implementation of the DMA with half buffering