Help with configuring DMA (with interrupts) for SPI on STM32L4R5 using assembly code

CHowa · ‎2019-07-28

I'm new to Cortex-M, STM chips, and the STM32L4R5 in particular, and am having problems getting this working. I have used ARM chips (since ARM2 :-), Atmel <x|mega>AVR and the Raspberry Pi Cortex-A chips...

I seem to be the only person who prefers assembly code over C / Cube / HAL etc... but that's a different matter 🙂

I think I have SPI configured correctly, because writing to SPI_DR works (I see SPI_SCK go active and SPI_MOSI shows the correct bits on the oscilloscope). [The STM32L4R5 is the master. The slave is silent. 2 MHz clock. 16-bit transfers. LSB first. clock on leading, rising edge. No NSS line used. Using SPI1 and DMA1_CH1]. But if I set up things for DMA transfer from memory to SPI and enable DMA and SPI, nothing gets transferred.

enable GPIOA, DMA and SPI clocks in RCC.
set up SYSCLK and PLL in RCC to get 64 MHz.
set the GPIO_MODE to AF for PA5 and PA7 and set GPIO_AFRL to AF5 for both pins.
set bits LSBFIRST | BR_DIV32 | MSTR | SSM | SSI in SPI1->CR1
set bit DS16 in SPI1->CR2
enabled DMA1_CH1 interrupts in NVIC->ISE0 (bit 11)
set bits MSIZE_16 | PSIZE_16 | MINC | DIR | TCIE in DMA1->CCR1
store the destination address (SPI1 + SPI_DR) in DMA1->CPAR1
store DMAREQ_ID_SPI_TX (11) in DMAMUX->C0CR
sent two 16-bit data-chunks by
1. setting SPE in SPI1->CR1
2. storing a half-word in SPI1->DR
3. immediately storing another half-word in SPI1->DR
4. waiting until FTLVL and then BSY in SPI1->SR are clear
5. clearing SPE in SPI1->CR1
store the data origin (memory) address in DMA1->CMAR1
store the count [ n(half-words) ] in DMA1->CNDTR1
set bit EN in DMA1->CCR1
set bit TXDMAEN in SPI1->CR2
set bit SPE in SPI1->CR1

At this point I expect the DMA to send the data from memory to the SPI, and SPI to activate its clock and MOSI lines. But nothing happens.

At the end of the transfer I expect the DMA1_Channel1_handler to get called and do:

clear the interrupt flag by writing a '1' (CTCIF1) to DMA1->IFCR
clear bit EN in DMA1->CCR1
write a new memory address in DMA1->CMAR1
rewrite the count to DMA1->CNDTR1
set bit EN in DMA1->CCR1

And finally terminate with:

write a dummy data half-word to SPI1->DR
clear bit EN in DMA1->CCR1
wait until FTLVL and then BSY in SPI1->SR are clear
clear bit SPE in SPI1->CR1
clear bit TXDMAEN in SPI1->CR2

Assuming all my register base addresses, offsets and bit definitions etc are correct -- is there any obvious step I've left out? If not, any ideas as to how I can proceed with testing/debugging? I can't see what's going wrong, but the dma_req from the SPI (due to TXE bit being set in SPI1->SR) doesn't seem to be arriving at the DMA.

I'm using arm-none-eabi-gdb with openocd 0.10.0.+dev-00921-g263deb38 on a Mac. The chip is on a Nucleo-L4R5ZI board.

Thanks for any pointers...

Tesla DeLorean · ‎2019-07-28

>>I seem to be the only person who prefers assembly code over C / Cube / HAL etc... but that's a different matter

Well I think it is more a case of economics, people paying for work tend to be interested in the speed of development, and completion of functional goals, not that the code is small/fast.

>> I have used ARM chips (since ARM2 :-),

My younger brother got one of the first Archimedes systems, I have the VLSI chip manuals still, and had already mastered 6502, Z80, 68K and assorted 808x assemblers. The flash programming side of my STM32 boot loaders use assembler, basically to contain dependencies, keeping things fast and small, and easy to copy into RAM.

You'll likely need to compare/contrast what the C libraries are doing, and decompose the sequences.

The DMA is driven by the TXE bit.

Prove that SPI is working in a polled mode, ie reading SPI1->SR, writing SPI1->DR

Then check the trigger paths for the DMA, and that it is not flagging errors/faults, and the address advances.

Mostly bit settings and control paths.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

waclawek.jan · ‎2019-07-28

> enable GPIOA, DMA and SPI clocks in RCC.

Does this include DMAMUX clock?

> if I set up things for DMA transfer from memory to SPI and enable DMA and SPI, nothing gets transferred.

Then interrupts are secondary. You need the SPI to move first, and for that, the Tx DMA to get triggered and transferred data into SPI1->DR.

You can always read back content of all the registers you've set and check if they are set as you expect them to be. You can also observe the DMA status register for possible fault and check if the relevant NDTR changes as expected.

JW

CHowa · ‎2019-07-28

"My younger brother got one of the first Archimedes systems"

I had an Archimedes A305 (no hard drive...) and learnt assembler from Peter Cockerell's ARM Assembly Language Programming book.

ARM code (and the whole system) was elegant in those days...

"I have the VLSI chip manuals still."

Me too. They were posted for free on request.

CHowa · ‎2019-07-28

"Does this include DMAMUX clock?"

No, of course not. <sigh>

Well spotted. Thanks. My first DMA transfer goes through now. Nothing after that, but that's just normal debugging ...

waclawek.jan · ‎2019-07-28

> No, of course not. [DMAMUX clock]

That's why I always recommend to read back the registers' content as the first debugging step. Guess why.

> I had an Archimedes A305 (no hard drive...)

Used one at the university, BASIC only. Been charmed. Does that count?

JW

CHowa · ‎2019-07-28

"That's why I always recommend to read back the registers' content as the first debugging step."

Wouldn't have helped in this case since I'd have read back what I was expecting to see. It was about 3 a.m. though... :\ And today I thought "I've done that" (TM)

"Used one at the university, BASIC only. Been charmed. Does that count?"

Of course -- if you've been corrupted and think that a windowed multi-tasking operating system, with anti-aliased text/graphics in a windowing system that was nicer than the Macintosh's etc etc etc, should be blazing fast, boot in a couple of seconds and fit in a 4 Mbyte (?) ROM (including all the fonts and decent vector graphics and text editors etc etc etc). And this was 30 years ago.

... I'd better stop. We minority quasi-dead systems aficionados all sound like lunatics.

But the reason it was so good, was that a) Sophie Wilson and Steve Furber were brilliant and b) a lot of the system was hand-crafted assembly code.

CHowa · ‎2019-07-28

"Well I think it is more a case of economics, people paying for work tend to be interested in the speed of development, and completion of functional goals, not that the code is small/fast."

You're right, of course. And I'm not going to fight windmills.

But...

[OK, the windmill attacked me first!]

No. I am successfully resisting the urge to explain (and prove!) why the world is wrong...

The world can write 5 Gbyte updates to my laptop's operating system (which seemingly only added animated emojis -- but you pay for that by losing your ESC key...) and I, an artist, will craft operating systems that fit in 15 k, boot in a couple of micro-seconds and contain a LISP interpreter. Or whatever 🙂

thanks4opensource · ‎2019-07-28

I'm following this thread with interest as using SPI with DMA (already have it working without) on several STM32 MCUs is one of my next tasks. Should be easy, but experience has been nothing ever is with STM libraries and documentation. Please post your solution when you find it.

For me the issue is not C versus assembly, but direct/simple/efficient C compared to the indirect, bloated, obfuscated, inefficient HAL examples that are the only ones currently provided by ST. And the reference manuals, which, if they were more complete, better written, and error-free would make example code unnecessary.

BTW, I too started my programming career in assembly, on chips and systems long pre-dating ARM. I've played around with ARM assembly and intend to use it in the limited instances where it can provide performance benefits. "ARM code (and the whole system) was elegant in those days..."? I wish I had been involved with it then. I wrote a partial simulator for the Cortex-M4 Thumb2 instruction set, and the deeper I got into it the more my reaction was, "They call this a reduced instruction set architecture???"

CHowa · ‎2019-07-28

"Should be easy, but experience has been nothing ever is with STM libraries and documentation. Please post your solution when you find it."

Mmmm. STM documentation is perfectly clear -- if you already know precisely what it's trying to say.

JW already found the problem: I simply hadn't enabled the DMAMUX clock. So, the recipe is simply the first list in the first post. Except line 1 should be:

enable GPIOA, DMA1, DMAMUX and SPI1 clocks in RCC.

After that it works properly. (I just had a little copy/paste bug, where I'd omitted to modify the pasted line...)

"but direct/simple/efficient C compared to the indirect, bloated, obfuscated, inefficient HAL examples"

If you look at C code as written by Thompson and Ritchie (Lion's commentary on UNIX 6th edition -- or something like that) and compare it with autogenerated stuff intended for consumption by a compiler... well. It's different.

As to the reference manuals... I think they could be improved.

' I wish I had been involved with it then. I wrote a partial simulator for the Cortex-M4 Thumb2 instruction set, and the deeper I got into it the more my reaction was, "They call this a reduced instruction set architecture???"'

Have a look at the book I linked in my second post. It's now free and still almost all relevant. It's an elegant, well-written little book.

I would love to know whether Wilson and Furber have any opinions on current ARM architectures and Thumb instruction set etc. Oh. I just see that there are interviews with Sophie Wilson on youtube. I'll have a look in a minute...

For example, on the ARM2 and ARM3, you could put the FIQ (fast interrupt request) handler code at the FIQ location in the vector table, because FIQ was the last entry in the table. So your interrupt handler would start executing immediately. And is has banked versions of r8-r14 so it doesn't need to push any registers on the stack.

Then I read that the Cortex-M (which has its interrupt system, LR and even the bloody vector table messed up -- because it's specialised for fast interrupt handling) has an interrupt latency of 12 (possibly 29) cycles on entry and 10 (27) on exit. Oh, and that's if there are no (flash) wait states -- which there are...! So my FIQ handler could load a register, think about it, flip some bits, write a register and return -- and the cortex-M would still be stacking registers which I'm not even going to be modifying! Oh well, that's 30 years of progress for you.

But the ARM2 didn't have a 4 MSPS 12 bit ADC built in. So I'm still happy 🙂