cancel
Showing results for 
Search instead for 
Did you mean: 

Speeding up complex computation

megahercas6
Senior
Posted on October 28, 2014 at 19:00

For my university i am making camera based laser spot position detector.

By simply making mass centre calculation i can get position of laser spot on ccd camera. Problem is, array is large 1024x1024x16b, and computations take a bit too long. With maximum optimization, i can get 7FPS , it would be extreamle good to get 20-30 calculations per second. for now, data is inside SRAM, and DCMI don't do anything.

uint32_t y_eilutes[1024],x_eilutes[1024];
uint16_t data[1024];
float pozicija_x = 0.0f,pozicija_y=0.0f;
uint32_t xx=0,yy=0,pointer = 0;
uint32_t sum_x = 0,sum_y=0;
float coef_x = 0,coef_y=0;
void pozicija(void)
{
xx=0,yy=0;
pointer = 0;
sum_x = 0;
sum_y = 0;
coef_x = 0;
coef_y = 0;
while(yy<1024)
{
while(xx<1024)
{
data[xx]=(*(__IO uint16_t*) (SRAM_BANK_ADDR + pointer));
pointer+=2;
xx++;
}
xx=0;
while(xx<1024)
{
x_eilutes[xx]+=data[xx];
y_eilutes[yy]+=data[xx];
xx++;
}
xx=0;
yy++;
}
xx=0; yy=0;
while(xx<1024)
{
sum_x+=x_eilutes[xx];
sum_y+=y_eilutes[xx];
coef_x+=(float)xx*x_eilutes[xx];
coef_y+=(float)xx*y_eilutes[xx];
xx++;
}
pozicija_x=(float)(coef_x/sum_x);
pozicija_y=(float)(coef_y/sum_y);
calc_x=(uint16_t)pozicija_x;
calc_y=(uint16_t)pozicija_y;
xx=0;
while(xx<1024)
{
x_eilutes[xx]=0;
y_eilutes[xx]=0;
xx++;
}
}

so this is my startingpoint, and it does work fine. Now, since i am not good programmer, maybe some can spot how can i speed up this ? ( last thing will be rewriting parts of code in assembler) 1) First is obvious , i need copy data to internal memory by using DMA or DMA2D. since DMA2D is a bit simpler, i rewrited code of copying line data to internal SRAM

this part:
while(xx<1024)
{
data[xx]=(*(__IO uint16_t*) (SRAM_BANK_ADDR + pointer));
pointer+=2;
xx++;
}
changed to :
void DMA2D_Config(uint32_t offset)
{
DMA2D_InitTypeDef DMA2D_InitStruct;
DMA2D_FG_InitTypeDef DMA2D_FG_InitStruct;
/* Enable the DMA2D Clock */
RCC_AHB1PeriphClockCmd(RCC_AHB1Periph_DMA2D, ENABLE);
/* DMA2D configuration */
//DMA2D_DeInit();
/* Transfer mode */
DMA2D_InitStruct.DMA2D_Mode = DMA2D_M2M;
/* Color mode */
DMA2D_InitStruct.DMA2D_CMode = DMA2D_RGB565;
/* Output Address */
DMA2D_InitStruct.DMA2D_OutputMemoryAdd = (uint32_t) &data[0];
/* Output Offset */ 
DMA2D_InitStruct.DMA2D_OutputOffset = 0;
DMA2D_InitStruct.DMA2D_NumberOfLine = 1;
DMA2D_InitStruct.DMA2D_PixelPerLine = 1024;
/* Initialize the alpha and RGB values */
DMA2D_InitStruct.DMA2D_OutputGreen = 0;
DMA2D_InitStruct.DMA2D_OutputBlue = 0;
DMA2D_InitStruct.DMA2D_OutputRed = 0;
DMA2D_InitStruct.DMA2D_OutputAlpha = 0;
/* Initialize the output offset */
DMA2D_InitStruct.DMA2D_OutputOffset = 0;
/* Initialize DMA2D */
DMA2D_Init(&DMA2D_InitStruct);
DMA2D_FG_StructInit(&DMA2D_FG_InitStruct);
DMA2D_FG_InitStruct.DMA2D_FGCM = DMA2D_RGB565;
DMA2D_FG_InitStruct.DMA2D_FGMA = SRAM_BANK_ADDR+offset*2;
DMA2D_FGConfig(&DMA2D_FG_InitStruct);
}
and inside while loop:
while(yy<1024)
{
DMA2D_Config(yy*1024);
DMA2D_StartTransfer();
while(DMA2D_GetFlagStatus(DMA2D_FLAG_TC) == RESET);
while(xx<1024)
{
x_eilutes[xx]+=data[xx];
y_eilutes[yy]+=data[xx];
xx++;
}
xx=0;
yy++;
}

At this point, i don't know how i can speed up more code by using plain C, any ideas ? DMA2D did give boost from 7FPS to 12,5FPS, but i still need to double calculation speed. 2) since i am using STM32F429, i will replace with arm cortex A7 for better performance, bus this can be done only next year
10 REPLIES 10
Posted on October 28, 2014 at 19:49

Why does the data need to go through the data[xx] array at all?

Access the SDRAM array with a pointer.
Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
Posted on October 28, 2014 at 21:12

coef_x+=(float)xx*x_eilutes[xx];

Watch out for arithmetic operations involving float and integer types - unless you explicitly convert *all* operands to float, they may be implicitly converted to double, and costly double arithmetics performed. Nonetheless, if you are concerned about performance, you should start with examining of the generated assembler. Then it should be evident, whether the above problem is the case. JW
megahercas6
Senior
Posted on October 29, 2014 at 06:42

i am using data[1024] because it should be faster way of making calculations. i tested SRAM, it can only do 36MTps, i believe internal SRAM can do more than that. I would like to use CCM, but DMA don't exactly copy data to this part of memory, so for now i will stick with internal SRAM

Also with over clocking, with no errors i can go up to 225MHz, higher than that, and i start to get errors in data transfered to computer, but over clocking is not exactly an option also, can i somehow state, to keep values inside registers, like array counters, sum_x,y, coef_x,y numbers, ? register float coef_x = 0,coef_y=0; does not seem to help. also, i trimed DMA2D config, so only new address is added, that did speed up program by few FPS

void DMA2D_Config(uint32_t offset)
{
DMA2D_FG_InitStruct.DMA2D_FGMA = SRAM_BANK_ADDR+offset*2;
DMA2D_FGConfig(&DMA2D_FG_InitStruct);
}

and i keep

DMA2D_FG_InitStruct

global and don't change update: just checked, with DMA2D i can get around 52,6MTps data rate, while i only can get 36MTps with software accsess. also have idea to check DMA2D transfer counter register (if that exist, will try to read manual, i know that i did this trick with Sharc DSP, and it did boosted efficiency by many factors), this will allow me to do computations while i am loading data to internal SRAM, and in theory when i am done loading, i simply could set new address to DMA2D and start it again
frankmeyer9
Associate II
Posted on October 29, 2014 at 08:54

i tested SRAM, it can only do 36MTps, i believe internal SRAM can do more than that.

 

As a matter of fact, data transfer bandwith is an upper limit to your throughput rate.

Have you considered reducing the resolution of the images ?

megahercas6
Senior
Posted on October 29, 2014 at 09:23

No, i must use all data to get precision i need. ( also trying to decide what part of sram will yield same result will use time as well)

Also, storing line data to internal sram for computations will mean, that i only read external SRAM once.

ivani
Associate II
Posted on October 29, 2014 at 11:21 It is not realistic to expect speed improvement of 3 - 5 times as you require but some measures could be taken: 1) Consider unrolling of loops (say, by factor of 4, or even more) 2) The data, which is used in several operations, could be assigned to temporary variables - for example,

y_eilutes[yy]+=data[xx];

could use a temporary variable for y_eilutes[yy] as the index is changed once per 1024 operations. 3) You could avoid copying to the internal RAM at all - the buffer

data[xx]

is generally redundant. Instead, you can read each sample from the external RAM only once in a temporary variable and use it in all calculations. Another approach if you still want to copy the data by DMA is to use two alternating buffers - while the DMA is copying the data in one of them you could process the data from the other one. 4) Interleave the floating-point operations with integer ones. The VFP is inserting an extra cycle for operations, which use the result from the previous instruction. Note, that casting to float is also involving the VFP.
megahercas6
Senior
Posted on October 29, 2014 at 12:57

Hi, thank you for answer.,

At this point, i have with no over clocking around 13FPS (16.5FPS with over clocking to 225MHz), if i could reach 20FPS, that would be great

1) will try to move as much variables from arrays to temp variables, but i don't see how can i don't use data[] array for computations, since it looks like i will have to read external sram more times, this solution looks very elegant

2) i was trying to manipulate variable types, i can use f32, u64,u32 in some places but impact on performance is very small, in 0,05FPS range

3) dual buffer looks like good idea, but will try to do that on single buffer by pooling data transfer count register  and doing math while still loading data ( will look at this later when i have time and oscilloscope for FPS counting), if calculation speed is faster than data transfer from external sram, dual buffer will have very minor performance gain (if any)

Danish1
Lead II
Posted on October 29, 2014 at 14:29

Worst-case, how small might your laser spot be? And how large might it be?

You might vastly reduce the necessary number of calculations if you can do the operation in two passes.

The first, with a relatively coarse grid, finds the laser spot. Of course the grid has to be sufficiently fine to guarantee that you find the spot.

The second, with a fine grid but only looking around where you have found the spot, accurately locates the spot.

You could make further optimisations if you know the spot to be reasonably gaussian e.g. by having the second scan only being horizontal and vertical lines crossing at the approximate location of the spot found in stage 1.

Hope this helps,

Danish
Posted on October 29, 2014 at 15:42

Again, why does it need to be copied? You can read the stuff once into a local/register variable an do the math directly

while(yy<1024)
{
while(xx<1024)
{
data[xx]=(*(__IO uint16_t*) (SRAM_BANK_ADDR + pointer));
pointer+=2;
xx++;
}
xx=0;
while(xx<1024)
{
x_eilutes[xx]+=data[xx];
y_eilutes[yy]+=data[xx];
xx++;
}
xx=0;
yy++;
}
xx=0; yy=0;

Becomes

while(yy<1024)
{
while(xx<1024)
{
uint16_t datahold;
datahold=(*(__IO uint16_t*) (SRAM_BANK_ADDR + pointer));
x_eilutes[xx]+=datahold;
y_eilutes[yy]+=datahold;
pointer+=2;
xx++;
}
xx=0;
yy++;
}
xx=0; yy=0;

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..