Big ball of DMA questions

Code Wrangler · ‎2017-09-23

Posted on September 24, 2017 at 08:01

I'm about to start implementing driving a single GPIO pin with DMA timer-driven bit-banging and I have a bunch of questions. I'm throwing them out here first in case people know the answers to some (or all) of them and so can save time on discovering things the hard way. I don't expect anyone to do my work or research for me - I'm prepared to bash my head against the wall and find out everything the hard way, but would appreciate any insights people have. To give back to the community, I will update this thread with the answers I discover over the next few days as the project progresses to completion.

1) (Not directly relevant to this project, but I am interested in the answer anyway) - Is it better to use a timer event to trigger the peripheral (let's say a DAC) directly to pull data via a DMA request or to use the timer interrupt to generate a DMA request to push data to the peripheral? Is one more temporally correct (I'm assuming triggering the DAC directly)? Is one more bus access contention efficient?

Answer (per Clive's answer below): It is better to use a timer event to trigger the peripheral to do a DMA pull rather than using a timer interrupt to initiate a DMA push to the peripheral.

2) I moved my existing working DMA buffer array variables to CCM to reduce bus contention and everything died (no bus or hard faults - Program kept running fine, just no more DMA). Am I missing something obvious?

Answer (per Jan's answer below): Yes, I was missing that DMA from CCM memory is impossible. From RM0090 (STMF4xx Reference Manual - Page 61): 'The 64-Kbyte CCM (core coupled memory) data RAM is not part of the bus matrix and can be accessed only through the CPU.'

3) For minimum bus access contention and/or best performance is it better to put only DMA buffers in CCM, or DMA buffers + heap + stack in CCM, or heap + stack in CCM and DMA buffers in SRAM 2? Chip is an STM32F429. In

https://community.st.com/0D50X00009XkYmMSAV

, Clive and Mr. Peacock come to basically opposite conclusions.Answer (per Mr. Peacock in the referenced thread):

CCM: Heap and stacks (No bus contention with DMA accesses)

SRAM1: Globals and large arrays (Data caching has benefit here)

SRAM2: DMA buffers (DMA doesn't care about lack of data caching and it avoids some contention with SRAM1)

BKPSRAM: Non-volatile variables

From AN4031 (Using the STM32F2, STM32F4 and STM32F7 Series DMA controller - Page 32):

Best DMA throughput configuration

When using STM32F4xx with reduced AHB frequency while DMA is servicing a high-speed peripheral, it is recommended to put the stack and heap in the CCM (which can be addressed directly by the CPU through D-bus) instead of putting them on the SRAM, which would create an additional concurrency between CPU and DMA accessing the SRAM memory.

4) Can timer-driven memory-to-GPIO DMA be configured in CubeMX?

No, this currently can't be done in CubeMX as of version 4.22.1.

5) Can DMA be used with GPIO ODR port bit-banding? If so, is DMA from memory to GPIO ODR port bit-banding considered memory-to-peripheral or memory-to-memory?

Answer (per Clive's answer below): Apparently yes, if you are willing to lose atomicity. It is considered a memory-to-memory transfer.

6) Since I am driving a single pin, it is quite wasteful to DMA an entire byte to the GPIO bit-banding register. In my ideal scenario (which quite possibly may not be technically feasible), I'd like to store my output bit pattern as individual bits in 32-bit words, then DMA the 32-bit words of the bit-banded expansion of the source array to the 32-bit word of the GPIO ODR port bit-banded expansion corresponding to my output pin. This would reduce the bit pattern source data array size requirements by a factor of 8. Basically, what I am thinking of doing is this:

Word array in memory -> 32x word-array expansion in bit-banding land -> DMA -> Single word in 32x word-array in GPIO ODR bit-banding land -> GPIO output pin

Is this possible?

Unknown. I have abandoned this goal as I found that I could shift a few pins around and free up an SPI port which will do the job in a much cleaner fashion than DMA-driven GPIO bit-banging. It seems like this 'double bit-banding DMA' question comes up every few years (see example

) and never gets definitively answered.Based on

, it seems like it should be possible to do a timer-driven double-bit-banding DMA transfer from memory to a GPIO pin.

null

Tesla DeLorean · ‎2017-09-24

Posted on September 24, 2017 at 14:11

On the F2/F4 Memory to GPIO is considered a M2M transaction, and needs to be on DMA2, TIM sources thus on APB2

Bit-banding is a CPU level manifestation, and isn't atomic. When driving pattern buffers to select GPIO pins within a bank use 32-bit wide patterns driven into GPIO->BSRR. All changes are applied at a single clock edge, so don't skew. If writing all 16 pins in a bank, clearly a 16-bit write to GPIO->ODR would suffice..

I would have a TIM trigger ADC/DAC, and have the peripheral trigger the DMA

CCM and FLASH would not be good DMA sources. SDRAM might be viable, but is significantly slower.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

View solution in original post

waclawek.jan · ‎2017-09-24

Posted on September 24, 2017 at 12:33

1) [...] Is it better to use a timer event to trigger the peripheral (let's say a DAC) directly or to use the timer event to generate a DMA request to drive the peripheral? Is one more temporally correct (I'm assuming triggering the DAC directly)?

Yes. Indirect DMA-based methods are always inferior to direct hardware connections, not only timing-wise, but also in that they occupy the precious DMA streams/channels.

Is one more bus access contention efficient?

Zero bus access is certainly more 'contention efficient' (whatever that may mean) than one or more.

2) I moved my existing working DMA buffer array variables to CCM to reduce bus contention and everything died (no bus or hard faults - Program kept running fine, just no more DMA). Am I missing something obvious?

Yes - you fail to tell us your target chip 😉 Okay I see in 3) it's the 'classic' 'F4 - look at RM0090 Figure 2 - CCM is accessed only from the processor, not through the bus matrix, that's why the DMA units don't see it at all. (That diagram is to be read as 'on top are masters, on right slaves, masters can go to slaves only from top down to intersection and then to the right').

This voids also question 3). I also don't see contradiction between Clive's and Jack Peacock's answers there - both warn against placing DMA buffers in CCM in 'F4.

4) Can GPIO DMA be configured in Cube?

You mean in CubeMX, or using HAL in Cube, or using LL in Cube? (Cube is name of the 'library'; the clicky program is called CubeMX). And what is GPIO DMA? You mean a timer-triggered DMA with target address in GPIO? I don't know if that can be clicked in CubeMX as I don't have it installed, but surely it can be used in Cube - as you've already aware, LL is only a thin wrapper (renaming, completely unnecessary in my view 😉 ) on direct register access, and HAL probably allows you to use any address as the target address for DMA. Note that the peripheral port of DMA1 in 'F2/'F4/'F7 can access only APB1 (look again at Figure 2 in RM0090).

5) Can DMA be used with GPIO ODR port bit-banding?

No. Bitbanding is accomplished through an appendix on the processor's S port, so no other masters see the bitwise-accessible area.

This voids question 6.

With spending two extra pins interconnected externally and a timer, you can use SPI for serialization, timed from the timer output.

JW

Tesla DeLorean · ‎2017-09-24

Posted on September 24, 2017 at 14:11

On the F2/F4 Memory to GPIO is considered a M2M transaction, and needs to be on DMA2, TIM sources thus on APB2

Bit-banding is a CPU level manifestation, and isn't atomic. When driving pattern buffers to select GPIO pins within a bank use 32-bit wide patterns driven into GPIO->BSRR. All changes are applied at a single clock edge, so don't skew. If writing all 16 pins in a bank, clearly a 16-bit write to GPIO->ODR would suffice..

I would have a TIM trigger ADC/DAC, and have the peripheral trigger the DMA

CCM and FLASH would not be good DMA sources. SDRAM might be viable, but is significantly slower.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Code Wrangler · ‎2017-09-24

Posted on September 24, 2017 at 17:04

Thanks much for your answers. I now have a very crude map in my head and can start diving into the AMBA AHB-Lite / APB technical resources without drowning.

Clive One wrote:

Bit-banding is a CPU level manifestation, and isn't atomic. When driving pattern buffers to select GPIO pins within a bank use 32-bit wide patterns driven into GPIO->BSRR. All changes are applied at a single clock edge, so don't skew. If writing all 16 pins in a bank, clearly a 16-bit write to GPIO->ODR would suffice..

This is horrible! Not only do I lose my desired 1 bit in to 1 bit out compression, but the previous 8-bit memory byte in -> 1-bit GPIO pin out fallback (using memory byte array to GPIO ODR bit-band) becomes 32-bit memory word in -> 1-bit GPIO pin out (I am not writing all 16 pins, so I need to do BSRR). I am going to try the bit-banding way first to see if the skew is acceptable.

I would have a TIM trigger ADC/DAC, and have the peripheral trigger the DMA [pull]

Yes, that was my intuition as well and the way I currently do it.

CCM and FLASH would not be good DMA sources. SDRAM might be viable, but is significantly slower.

According to RM0090 (STMF4xx Reference Manual - Page 61), CCM is not just a 'not-good' DMA source, it is an impossible DMA source: '

The 64-Kbyte CCM (core coupled memory) data RAM is not part of the bus matrix and can be accessed only through the CPU.'

In the thread I previously referenced, you said:

It's complicated. As I read the docs I get the sense CCM is a cycle slower, but if you're doing a lot of DMA the accesses to CCM won't have any contention, and thus will be far more predictable.

Which apparently doesn't mean what it seemingly says. Rather, it seems to mean:

It's complicated. As I read the docs I get the sense CCM is a cycle slower, but if you're doing a lot of DMA the [non-DMA] accesses to CCM won't have any contention [with the DMA accesses in other banks], and thus will be far more predictable.

You might want to go back and edit that, as that exact answer was what led me to try to put the DMA buffers in CCM in the first place. Granted, I tend to fly by intuition (and trust in knowledgeable people) and then go back to the datasheets and manuals when things don't work as expected, so maybe it's karmic punishment for not reading the manuals first.

On the F2/F4 Memory to GPIO is considered a M2M transaction, and needs to be on DMA2, TIM sources thus on APB2

Is this a requirement? I thought any event providing timer can trigger any DMA stream. At least that is what CubeMX allows.

Code Wrangler · ‎2017-09-24

Posted on September 24, 2017 at 18:07

Thanks for your answers.

waclawek.jan wrote:

Yes. Indirect DMA-based methods are always inferior to direct hardware connections, not only timing-wise, but also in that they occupy the precious DMA streams/channels.

Sorry, I wasn't clear (I've edited the question) - The question wasn't DMA vs. non-DMA. It was DMA pull (from the timer event triggered peripheral) vs. timer event triggered DMA push (to the peripheral). According to Clive, DMA pull is better than DMA push.

Look at RM0090 Figure 2 - CCM is accessed only from the processor, not through the bus matrix, that's why the DMA units don't see it at all. (That diagram is to be read as 'on top are masters, on right slaves, masters can go to slaves only from top down to intersection and then to the right').

Right, got it. Please see my answer to Clive below about why the confusion started.

renaming, completely unnecessary in my view 😉

I am Luke to your Darth Vader. There is still good [programming style] within you. I can sense it.

With spending two extra pins interconnected externally and a timer, you can use SPI for serialization, timed from the timer output.

I already use this technique but I'm out of SPI ports. Also, the timer is superfluous - SPI is self-clocking (if you can use one of the existing data rates determined by the SPI clock divisor, which I can in my case).

Tesla DeLorean · ‎2017-09-24

Posted on September 24, 2017 at 19:04

I'm not going back and edit things, really not looking for someone to pick apart threads and critique my responses, if you misinterpret, take out of context, or distort what I said in your head that's really not my problem. Fitting my words to your own narrative or perception is going to give at least one of us a headache.

I see threads as a conversation, where thoughts and expression have specific context and flow in time, editing them for content damages that, I'll fix my dyslexia if I find it and whatever links ST bollixed up in porting to the new platform.

I would probably decompress a bit vector into a small pattern buffer using HT/TC interrupts to decimate interrupt loading, a balance of size vs loading. The GPIO peripheral provides single write operation of single or multiple pins, using bit-banding seems a highly inefficient use of the bus and the CPU in this case. The CPU is very good at loading bits as immediates, instructions being prefetched and cached. ie an immediate load and store to write buffer, vs a RMW action across a slower bus

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Tesla DeLorean · ‎2017-09-24

Posted on September 24, 2017 at 19:05

>>Is this a requirement? I thought any event providing timer can trigger any DMA stream. At least that is what CubeMX allows.

Last time I looked at the matrix it was constraining..

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Tesla DeLorean · ‎2017-09-24

169607CGIL4

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Tesla DeLorean · ‎2017-09-24

Posted on September 24, 2017 at 19:22

https://community.st.com/s/feed/0D50X00009bLPmvSAG

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Code Wrangler · ‎2017-09-24

Posted on September 24, 2017 at 20:39

Clive One wrote:

>>Is this a requirement? I thought any event providing timer can trigger any DMA stream. At least that is what CubeMX allows.

Last time I looked at the matrix it was constraining..

Sorry, I completely and totally botched the question. I meant to write that any event producing timer can be used to trigger a DMA-enabled peripheral (such as DAC, see below), not that any timer could directly trigger any DMA stream.