QUAD-SPI too busy!

Albi G. · ‎2020-06-10

Hi guys,

i am using the QuadSPI periphery to interface with a FPGA. I am in need of raw throughput with alternating write and read cycles of some number of bytes. I am just concept-prooving right now.

Since QUADSPI->CCR is not allowed to be modified when BUSY==1 the busy flag effectively limits the command rate - which is useful since i obviously need to wait to finish a a command before sending the next one.

My configuration is:

4 bit wide Instruction
Skip directly to 4bit wide Data-transfer
Min CS-High time = 0 (1 cycle)
IndirectRead / IndirectWrite mode, no DMA for now.

Now, unfortunately the BUSY flag is rather lame and it takes exactly 6-7 clocks after CS rises that the BUSY flag clears. That is really **** ** the achievable command rate!

With 6-7 clocks, i mean periphery-clocks after the prescaler. (@1MHz SPI clock, the BUSY-signal is 1 for 6µs after CS rises. If 10MHz is used, then its 0.6us)

What i dont understand is, that this behavior makes the minimum CS-high setting kind of useless. (i never achieve a scenario where CS is less than 6 cycles high in-between commands)

This holds true for consecutive reads, consecutive writes, and alternating read-write.

What could go wrong ??

This is my example code. Not using the library, but my code hopefully reads kind of intuitively...

Measured though my Debug-GPIO: 16.7µs from Command start to BUSY=0

Chip select Low for 11µs

Give me a faster BUSY, please :(

BTW: i also get the same behavior if i configure the QSPI differently:

skip Instruction
Send one byte address
skip directly to Data

== Same timing(-problems)

thanks

berendi · ‎2020-06-12

I've played with it a bit on a STM32H743 (I don't have any STM32G4), the only difference is that BUSY is cleared after 5 cycles instead of 6.

The only way to cut it short is to reset it in the RCC. But then you have to reload all of the QSPI registers.

I have a ~~vague idea~~ working proof of concept below how it could be done quickly without software intervention.

berendi · ‎2020-06-15

So the basic idea is to reset QUADSPI in the RCC, then reprogram its registers as their contents are lost at reset. All of it is handled by timers and DMA autonomously.

static uint32_t qspireset[] = { RCC_AHB3RSTR_QSPIRST, 0 };
	static uint32_t qspi_init_reg[6] = {
			(249 << QUADSPI_CR_PRESCALER_Pos) |
			QUADSPI_CR_SSHIFT |
			QUADSPI_CR_EN,                     // CR
 
			(0x1F << QUADSPI_DCR_FSIZE_Pos) |
			(0 << QUADSPI_DCR_CSHT_Pos),       // DCR
 
			0,                                 // SR
			0,                                 // FCR
			0,                                 // DLR
 
			QUADSPI_CCR_DMODE |
			(0 << QUADSPI_CCR_DCYC_Pos) |
			QUADSPI_CCR_IMODE |
			(1 << QUADSPI_CCR_INSTRUCTION_Pos) // CCR
	};
 
	// Copy a word from a memory variable (make sure it does not end up in DTCM)
	// to a RCC reset register. Circular mode, so the operation is repeated on every DMA request
	DMA1_Stream0->M0AR = (uint32_t)qspireset;
	DMA1_Stream0->PAR = (uint32_t)&RCC->AHB3RSTR;
	DMA1_Stream0->NDTR = 1;
	DMA1_Stream0->CR =
			DMA_SxCR_MSIZE_1 | // 10: 32 bit
			DMA_SxCR_PSIZE_1 | // 10: 32 bit
			DMA_SxCR_CIRC    |
			DMA_SxCR_DIR_0   | // 00: P->M, 01:M->P, 10:M->M
			DMA_SxCR_EN;
 
	// same as above, different value
	DMA1_Stream1->M0AR = (uint32_t)(qspireset + 1);
	DMA1_Stream1->PAR = (uint32_t)&RCC->AHB3RSTR;
	DMA1_Stream1->NDTR = 1;
	DMA1_Stream1->CR =
			DMA_SxCR_MSIZE_1 | // 10: 32 bit
			DMA_SxCR_PSIZE_1 | // 10: 32 bit
			DMA_SxCR_CIRC    |
			DMA_SxCR_DIR_0   | // 00: P->M, 01:M->P, 10:M->M
			DMA_SxCR_EN;
 
	// Copy 6 words from the initialization array to the QUADSPI registers
	DMA1_Stream2->M0AR = (uint32_t)&qspi_init_reg;
	DMA1_Stream2->PAR = (uint32_t)QUADSPI;
	DMA1_Stream2->NDTR = 6;
	DMA1_Stream2->CR =
			DMA_SxCR_MSIZE_1 | // 10: 32 bit
			DMA_SxCR_PSIZE_1 | // 10: 32 bit
			DMA_SxCR_PINC    |
			DMA_SxCR_MINC    |
			DMA_SxCR_CIRC    |
			DMA_SxCR_DIR_0   | // 00: P->M, 01:M->P, 10:M->M
			DMA_SxCR_EN;
 
	DMAMUX1_Channel0->CCR = (DMA_REQUEST_TIM2_CH2 << DMAMUX_CxCR_DMAREQ_ID_Pos);
	DMAMUX1_Channel1->CCR = (DMA_REQUEST_TIM2_CH3 << DMAMUX_CxCR_DMAREQ_ID_Pos);
	DMAMUX1_Channel2->CCR = (DMA_REQUEST_TIM3_CH1 << DMAMUX_CxCR_DMAREQ_ID_Pos);
 
	// TIM3 triggers DMA1_Stream2 writes to the QUADSPI configuration registers.
	// It is gated by the TIM2_CH1 PWM output signal. When I tried reducing the
	// period to 10 cycles (ARR=9), DMA has started dropping requests. Perhaps because
	// DMA is on the AHB bus matrix, but both its source and destination must be
	// accessed through the AHB-AXI gateaway (and I did not bother with relocating
	// data to AHB SRAM). Should be a non-issue on the STM32G4,
	// so it might work with shorter periods there.
	TIM3->ARR = 19;
	TIM3->CCR1 = 5;
	TIM3->DIER = TIM_DIER_CC1DE;
	TIM3->SMCR =
			TIM_SMCR_TS_0 |   // ITR1 = TIM2_TRGO
			TIM_SMCR_SMS_2|TIM_SMCR_SMS_0; // gated mode
 
	// TIM2 is started in one-pulse mode triggered by a rising edge on its ETR pin
	// which must be externally connected to QUADSPI NCS.
	// It triggers two DMA transfers first (on CH2 and CH3), setting and resetting
	// the QUADSPI reset bit in RCC.
	// Then it generates a pulse internally on CH1 (OC1REF) which gates TIM3.
	// The length of the pulse should be exactly 6 times the period of TIM3, i.e.
	// (TIM2->ARR + 1 - TIM2->CCR1) = 6 * (TIM3->ARR + 1), while TIM2->ARR is
	// roughly the time NCS stays high.
	TIM2->ARR = 199u;
	TIM2->CCR1 = 80u;
	TIM2->CCR2 = 1u;
	TIM2->CCR3 = 2u;
	TIM2->CCMR1 =
			TIM_CCMR1_OC1M_2|TIM_CCMR1_OC1M_1|TIM_CCMR1_OC1M_0; // TIM2_CH1 PWM mode 2
	TIM2->DIER = TIM_DIER_CC2DE | TIM_DIER_CC3DE; // CH2 and CH3 trigger DMA on compare event
	TIM2->SMCR =
			TIM_SMCR_TS_2|TIM_SMCR_TS_1|TIM_SMCR_TS_0 |  // ETRF -> trigger input
			TIM_SMCR_SMS_2|TIM_SMCR_SMS_1; // trigger counter start
	TIM2->CR2 = (0b100u << TIM_CR2_MMS_Pos); // OC1REF -> TRGO
 
	// Enable TIM3, it would still wait for the gate signal to start counting
	TIM3->CR1 = TIM_CR1_CEN;
	// TIM2 will be enabled by ETR
	TIM2->CR1 = TIM_CR1_OPM;
 
	// Set up QUADSPI for the first transfer
	QUADSPI->DCR =
			(0x0F << QUADSPI_DCR_FSIZE_Pos) |
			(0 << QUADSPI_DCR_CSHT_Pos) |
			0;
	QUADSPI->CR =  // 0xa6000011;
			(249 << QUADSPI_CR_PRESCALER_Pos) |
			//QUADSPI_CR_SSHIFT |
			QUADSPI_CR_EN;
	QUADSPI->DLR = 0;
	QUADSPI->CCR = // 0x03000301;
			QUADSPI_CCR_DMODE |
			(0 << QUADSPI_CCR_DCYC_Pos) |
			QUADSPI_CCR_IMODE |
			(1 << QUADSPI_CCR_INSTRUCTION_Pos);
	
	*(uint8_t *)&QUADSPI->DR = 1;
 
	// Wait for the QUADSPI BUSY flag to clear, shortly after it sets NCS high
	// the reset through RCC clears all status flags.
	while(QUADSPI->SR & QUADSPI_SR_BUSY)
		;
	// Wait for the timers to finish reinitializing the QUADSPI registers.
	// 
	while(TIM2->CR1 & TIM_CR1_CEN)
		;
 
	// Now QUADSPI is ready to transmit the next data packet.
	*(uint8_t *)&QUADSPI->DR = 1;

Of course you don't have to busy-wait for the timer to stop, either a timer DMA request on the update event or a QUADSPI DMA request could load the next value in the QUADSPI data register.

Apparently the STM32H7 DMA can't copy from peripheral to peripheral registers, so I had to use two DMA channels for that. The STM32G4 DMA should be able to do that, so you can do something like this

DMA1_Stream0->M0AR = (uint32_t)&TIM2->DMAR;
DMA1_Stream0->PAR = (uint32_t)&RCC->AHB3RSTR;
DMA1_Stream0->NDTR = 2;
DMA1_Stream0->CR =
		DMA_SxCR_MSIZE_1 | // 10: 32 bit
		DMA_SxCR_PSIZE_1 | // 10: 32 bit
		DMA_SxCR_CIRC    |
		DMA_SxCR_DIR_0   | // 00: P->M, 01:M->P, 10:M->M
		DMA_SxCR_EN;
TIM2->CCR3 = RCC_AHB3RSTR_QSPIRST;
TIM2->CCR4 = 0;
TIM2->DCR = (1 << TIM_DCR_DBL_Pos) | offsetof(TIM_TypeDef, CCR3);
TIM2->DIER = TIM_DIER_CC2DE; // CC3DE is no longer needed

The timer DMA burst would copy the contents of TIM2->CCR3 followed by TIM2->CCR4 to the RCC reset register, using one DMA channel less.

Albi G. · ‎2020-06-15

I am laughing my *** of :D :D :D

Go hire him, ST!

Arnon · ‎2020-07-14

Hi @Albi G.

were you able to get the QSPI to working reliably ?

Can you please share what you did?

Resting the interface after each transfer, I am able to read data from the FPGA, but now the clock starts before the CS goes low:face_with_steam_from_nose:

thanks,

--Arnon

Albi G. · ‎2020-07-15

Honestly, i have given up. The QSPI seems to be a complete mess - verified to work under specific circumstances, but not as a general purpose IO peripheral. I wish they had sacrificed configurability and just provide a true SPI with a 4bit wide output ontop of the "QSPI-Memory-Interface" (which seems to be the better name).

I completely changed the approach to my solution, no FPGA anymore.

In a way, i may thank ST for this, since it really made me search for alternatives and i think my concept got better :)

Tesla DeLorean · ‎2020-07-15

There are certainly many aspects of the STM32 design that are frustratingly lacking (naive), limiting usage outside more normative cases. A little more thought, and detail work could have made the peripherals significantly more powerful.

A good External DMA interface for the FSMC/FMC would have opened a lot of potential options with FPGA/CPLD, or additional flexibility with width/usage of DCMI

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Arnon · ‎2020-07-15

thanks @Albi G. .

Unfortunately, FPGA is mandatory in my design. I am going to give it another try before switching the single lane SPI:\

Arnon · ‎2020-07-15

Couldn't agree more @Community member .

It looks like sloppy implementation on ST part.

I am still not sure what will be the best way to interface FPGA. I will use the STMF723 (with the built in USB PHY).

Hope this will serve as waning to anyone who consider using the QSPI as a simple high throughput interface

--Arnon

S.Ma · ‎2020-07-15

you may consider 4 spi, one master and 3 slaves to get raw 4 bit transmissions, but probably you will need more gpios like shorting mosi and miso and control mode stuffs. I was told the MDIO if present is intented to communicate with FPGA.. rumors?

S.Ma · ‎2020-07-15

if miso and mosi swap controo bit is available, it might not do what you expect.