cancel
Showing results for 
Search instead for 
Did you mean: 

Seeking information on real-world PSSI performance

BarryWhit
Senior

Hello All,

 

I'd be glad to hear from someone who has used PSSI in a high-bandwidth application. I'm interested

in hard numbers on sustained throughput with/without flow-control. My intended application is to interface 

some member of the STM32 product line (H7,L4,U5) with an FPGA. The intended clock rate is about 15Mhz

at a width of 16bits, and to stream the data to onchip SRAM via DMA.  In terms of execution profile, the STM32 application would simply way until the capture a single long burst of data finishes, then post-process the data "offline". The CPU could therefore be sleeping during capture and I don't expect any other significant activity on the bus during capture.

 

Specific concerns are:

1. The PSSI peripheral contains a FIFO, so I'm confident the peripheral itself can keep up with a 15Mhz clock

with an HCLK of 100Mhz+, but I'm worried about the possibility of a FIFO overrun due to limited bandwidth

on the AHB bus, so that the FIFO fills up because DMA can't drain it quickly enough into SRAM.

 

2. Using a smaller device package (<144 pins) means (generally) that only 8-bit wide PSSI is available. But, those parts

may cheaper and take up less board space. On the FPGA side, It would be easy to switch to an 8-bits bus, but of course this would double the necessary clock rate. I'm wondering whether this is feasible. Basically, I'm looking for a reassurance that as long as the AHB frequency is a least "M times" the PSSI_CLK frequency, for some known M, DMA to SRAM can keep the FIFO from filling up.

 

3. Finally, can I avoid flow-control altogether and just rely on the peripheral (and FIFO) to be ready? It would not be difficult for the FPGA to consider a PSSI_RDY signal from the PSSI peripheral, but It would easier still not to use Control-Flow signals.

 

Your experienced advice and any related war-stories would be much appreciated,

 

1 ACCEPTED SOLUTION

Accepted Solutions
BarryWhit
Senior

To try and answer my own question, I did a test. I only have a G431 eval board which does not have PSSI, but I still made some progress.

 

Since my concern is that actual performance might be lower than expected to contention  on the bus starving the DMA, the question really becomes "Can the DMA peripheral on an STM32 saturate the bus". For example, is there some kind of fairness imposed which limits the time-slots its allocated or whether, if the DMA traffic is interleaved with communication from other peripherals, if perhaps there is an overhead to initiating each DMA transfer chunk which substantially hurts throughput. Or perhaps some other factor I haven't thought of.

 

To test this, I created an example design which uses a timer's update event to pace a DMA transfer of 16Kbytes from a peripheral (it doesn't not mater which register we read, I used GPIO) to SRAM, to simulate the kind of bus activity the PSSI would generate.

 

Since the firing rate of the timer can be manually set to higher and higher frequency, then as long as the DMA completes the transfer without reporting an overrun, the timer frequency gives us a clean throughput measurement.

I I used HAL_DMA_Start() and polled until the transfer completed after which a breakpoint triggers. I then used the CubeIDE memory browser to verify that destination buffer was overwritten, to make sure data really moved across the bus.

 

The results were very good. I was able to get down to CLK_DMA=HCLK/2 (since a timer has minimum period of 2 cycles). I used a high HCLK of 150Mhz, but I see no reason why the result should depend on absolute frequency.

Furthermore. I deliberately setup the DMA width to use Byte-Byte transfer. The documentation says that we can use DMA with the PSSI's FIFO to transfer 32bits at a time across the bus. This could (potentially) reduce the bus cycles consumed by 4 (if using an 8-bit PSSI), if the FIFO itself can "pack" and output 32bits at a time. Since ST recommends this configuration for better performance, this should be the case.

 

Since the PSSI peripheral has a built-in FIFO, and we've demonstrated that the STM32 bus architecture allows the DMA to transfer data at much higher rates than my application requires, I feel confident that PSSI is a good solution to my requirements. Furthermore, the DMA available bandwidth is well above the PSSI's own throughput limit of roughly HCLK/2.5 (due to the need to synchronize the external inputs), this leaves a lot of margin for my application even if I use a smaller, cheaper part which only has 8-bit wide PSSI rather than 16-bits. This really looks like an extremely useful peripheral to have.

 

I've seen some other threads about trouble implementing PSSI, due to surprising HAL behavior and cache issues. Hopefully I can work my way through those. I'll try and post any lessons learned, when the time comes.

 

 

View solution in original post

1 REPLY 1
BarryWhit
Senior

To try and answer my own question, I did a test. I only have a G431 eval board which does not have PSSI, but I still made some progress.

 

Since my concern is that actual performance might be lower than expected to contention  on the bus starving the DMA, the question really becomes "Can the DMA peripheral on an STM32 saturate the bus". For example, is there some kind of fairness imposed which limits the time-slots its allocated or whether, if the DMA traffic is interleaved with communication from other peripherals, if perhaps there is an overhead to initiating each DMA transfer chunk which substantially hurts throughput. Or perhaps some other factor I haven't thought of.

 

To test this, I created an example design which uses a timer's update event to pace a DMA transfer of 16Kbytes from a peripheral (it doesn't not mater which register we read, I used GPIO) to SRAM, to simulate the kind of bus activity the PSSI would generate.

 

Since the firing rate of the timer can be manually set to higher and higher frequency, then as long as the DMA completes the transfer without reporting an overrun, the timer frequency gives us a clean throughput measurement.

I I used HAL_DMA_Start() and polled until the transfer completed after which a breakpoint triggers. I then used the CubeIDE memory browser to verify that destination buffer was overwritten, to make sure data really moved across the bus.

 

The results were very good. I was able to get down to CLK_DMA=HCLK/2 (since a timer has minimum period of 2 cycles). I used a high HCLK of 150Mhz, but I see no reason why the result should depend on absolute frequency.

Furthermore. I deliberately setup the DMA width to use Byte-Byte transfer. The documentation says that we can use DMA with the PSSI's FIFO to transfer 32bits at a time across the bus. This could (potentially) reduce the bus cycles consumed by 4 (if using an 8-bit PSSI), if the FIFO itself can "pack" and output 32bits at a time. Since ST recommends this configuration for better performance, this should be the case.

 

Since the PSSI peripheral has a built-in FIFO, and we've demonstrated that the STM32 bus architecture allows the DMA to transfer data at much higher rates than my application requires, I feel confident that PSSI is a good solution to my requirements. Furthermore, the DMA available bandwidth is well above the PSSI's own throughput limit of roughly HCLK/2.5 (due to the need to synchronize the external inputs), this leaves a lot of margin for my application even if I use a smaller, cheaper part which only has 8-bit wide PSSI rather than 16-bits. This really looks like an extremely useful peripheral to have.

 

I've seen some other threads about trouble implementing PSSI, due to surprising HAL behavior and cache issues. Hopefully I can work my way through those. I'll try and post any lessons learned, when the time comes.