2008-10-08 08:33 PM
STR910 Instruction and GPIO speed
2011-05-17 12:35 AM
Hi All
I have seen a few discussions about instruction speed and the speed of toggling GPIOs. However the discussions are not very thorough and leave many open questions, including actual results and their relevence. Since I have to deliver results at a critical stage in a project I have spent some time doing real measurements and am having difficulties working out whether they represent optimum speed or whether there is some setting which is causing slower results as expected. Test set up: The testing involves measuring a GPIO output pin and benchmark reference is from the ST presentation where 12MHz toggle speed is stated as achievable. As well as determining the toggle speed possible, also the instruction speed from FLASH and SRAM was measured. 1. Setup. Running on STR912F with PLL set to 48MHz. No other dividers activated as far as aware. [verification - speed when running from 25MHz oscillator was about half that as from PLL and various additional dividers did decrease the speed accordingly - the dividers were all removed for the measurements below] 2. Test 1. RAW GPIO toggle speed based on a sequence of assember instructions optimised for one instruction per output state change: str r2,[r0,#0] set '1' str r1,[r0,#0] set '0' str r2,[r0,#0] set '1' str r3,[r0,#0] set '0' str r4,[r0,#0] set '1' The period between '0' and '1' was measured as: - 185ns when running from FLASH - 185ns when running from SRAM with no wait states - 210ns when running from SRAM with wait states The results were identical with or without buffered peripherals (this is contrary to statements in other postings ?). Accesses between buffered and unbuffered is understood to be basically 0x4800xxxx and 0x5800xxxx addresses. This gives a toggle frequency of 5,4MHz according to period or 2,7MHz when measured as the generated square wave frequency (it is not clear how the ST value is defined). Assuming that the speed will be doubled at the max. 96MHz this gives 10,8MHz (or 5,4MHz) which is a little less than the stated 12MHz or a little less than half of it. It seems as though the toggle speed is not identical to the instruction execution speed in this case but limited in the port access hardware to some extent (see instruction speed measurement in next point) 3. Instruction speed To interprete the speed of instruction execution a small loop was placed between two of the toggles. A variable was incremented in a register and the resulting loop caused a total of 65 instructions in Thumb mode to be executed (I don't think that mode (ARM or Thumb) is actually relevant for the instruction speed test). By measuring the time increase between the GPIO changes and dividing it by the total quantity of instructions the single instruction execution time was calculated. Time for 65 instructions when running in FLASH = 10,2us Time for 65 instructions when running in SRAM with wait states = 5,33us Time for 65 instructions when running in SRAM without wait states = 3,97us The instruction times are therefore: 157ns / 82ns / 61ns or expressed in instructions per second 6.4M / 12M / 16.4M The results suggest that the instruction speed from SRAM could be faster than the GPIO toggle speed, so probably the port accesses are slowing. Since the PLL speed was 48MHz is suggests that about 3 or more clock are required to execute one instruction. Now these are the measurement results and everyone knows that measurement results have to be treated with great care because they may not be accurate. And this is the main reason why I want to show them here. I was expecting the instruction speed to be equal to the PLL speed but the results deviate by a factor of about 3 and more (depending on where the code is running). The fact that it can be slower is not the point because this is clear from the way the FLASH and its queue operates. The other way of stating the results are : what am I doing wrong to not measure faster instruction speed? If there is an incorrect chip setting what is it (or could it be)? Are the GPIO results accurate (same basic question about settings). If we assume some measurement inaccuracy and the actual factor between clock and instruction and GPIO toggle to ST stated amximum is a factor of 2, where can this half speed reduction be coming from??? Many thanks for any serious analysis and suggestions!! Regards Mark Butcher2011-05-17 12:35 AM
Going to ask just incase, are you enabling the buffered mode in the CP15 register ?
MRC p15, 0, r0, c1, c0, 0 /* Read CP15 register 1 into r0 */ ORR r0, r0, #0x8 /* Enable Write Buffer on AHB */ MCR p15, 0, r0, c1, c0, 0 /* Write CP15 register 1 */ Cheers sjo2011-05-17 12:35 AM
Hi sjo
Thanks for the tip. I looked around for this setting and found it in the start up assember file. It is NOT activated so I will change this and repeat. I also see that the start up code is activating the wait states in SRAM. Do you know when and whether this is necessary? If I go to 96MHz will the wait states then be necessary or are the superfluous? Also do you know why one would want to disable the buffered operation per default? Are there risks or power consumption increases to cause ST to default them off in the start up? I will update the report once I have re-measured. regards Mark2011-05-17 12:35 AM
Hi All
I have an update after testing with buffered mode enabled. Buffered mode not enabled: Port toggling: - 185ns when running from FLASH - 185ns when running from SRAM with no wait states - 210ns when running from SRAM with wait states Time for 65 instructions when running in FLASH = 10,2us Time for 65 instructions when running in SRAM with wait states = 5,33us Time for 65 instructions when running in SRAM without wait states = 3,97us The instruction times are therefore: 157ns / 82ns / 61ns or expressed in instructions per second 6.4M / 12M / 16.4M Buffered mode enabled Port toggling: - 168ns when running from FLASH (GPIO accesses in buffered space) - 132ns when running from buffered SRAM space with no wait states - 130ns when running from non-buffered SMAR space with no wait states - 126ns when running from D-TCM SRAM space Time for 65 instructions when running in FLASH = 10,7us Time for 65 instructions when running in buffered SRAM space with wait states = 4,79us Time for 65 instructions when running in non-buffered SRAM space without wait states = 5,39us Time for 65 instructions when running in D-TCM SRAM space without wait states = 3,89us This is giving best GPIO and Instruction performance when running in D-TCM SRAM space and using buffered GPIO access. However the relationships are still not clear - can anyone shed light on exactly what is going on. The present best setting are therefore achieving GPIO toggling in 126ns and about 17M instructions per second at 48MHz. Testing at 96MHz has proved to not work at the moment. The PLL locks but as soon as the PLL is selected as clock FLASH memory accesses seem to be no longer accurate and the code crashes. Any ideas? Regards Mark2011-05-17 12:35 AM
Hi All
Another result which is interesting. The 65 instruction test was a small loop register volatile int x = 0; while (x < 10) x++; Now I have straightened out the loop. register volatile int x = 0; x++; x++; x++; x++; x++; etc. Now I am measuring the time for 58 Thumb instructions. From FLASH - 7,26us - 8M Instructions per second at 48MHz From SRAM - 2,8us - 20M Instructions per second at 48MHz (zero wait state) This is showing again quite a large difference between operation from FLASH and SRAM (is the factor 2,5 expected or could it indicate a problem with settings somewhere?) The performance out of SRAM is a bit better now without the loop but shoud I not be expecting more? Is there an explaination for this? Regards Mark2011-05-17 12:35 AM
What rev silicon are you using ?
What is your setup for 96 MHz, i have not found any problems. Regards sjo2011-05-17 12:35 AM
Hi
I have an improvement by enabling the PFQBC, which was being disabled in the ST start up file 91x_init.s ; --- Enable 96K RAM LDR R0, = SCRO_AHB_UNB LDR R1, = 0x0196 <--- sets SRAM wait states and disables PFQBC STR R1, [R0] Now the straight line instruction performance has improved in FLASH to the same as in SRAM - 20MIPs at 48MHz. This is better and shows that the problems are probably still set up related. Since in this case it is the ST standard start up code disabling it must in fact be quite a common problem for beginners(?) There must be some more secret bits to set and/or clear to get the device to operate as fast as originally expected... where can they be hiding? I wonder how many times I have already re-read the user's manual? Any one know more? BTW. Clock / PLL register setups 00020000 = SCU_CLKCNTR 000bc019 = SCU_PLLCONF (with the value 0xac019 it locks to 96MHz but the program crashes - it can not read correctly from FLASH? However I could previously run at 96MHz before playing around with other stuff...). CHips are marked with 610 - I think that this is Rev. D. Regards Mark2011-05-17 12:35 AM
mjbcswitzerland wrote:
variable was incremented in a register and the resulting loop caused a total of 65 instructions in Thumb mode to be executed (I don't think that mode (ARM or Thumb) is actually relevant for the instruction speed test). I disagree, my tests showed that at 96 MHz the Thumb code is faster than ARM code by 42%. Hand-crafted assembly code. No compiler magic. If we assume some measurement inaccuracy and the actual factor between clock and instruction and GPIO toggle to ST stated amximum is a factor of 2, where can this half speed reduction be coming from??? Those who know aren't telling, and those who don't know resort to guessing. Based on the public domain opinions scattered all over Inet plus doing my homework - my guess is that ST's marketing and engineering are disconnected - and - you and I are ''early adopters'' (i.e. beta testers-volunteers). What your local ST FAE has to say about your tests?2011-05-17 12:35 AM
Just FYI, I am in the same boat as Mark at this point.
The fastest I can toggle the GPIO is at 124 ns between edges. Here is my loop (toggling P6.0): while (1) { *(U32*)(0x4800C004) = 0x00000000; *(U32*)(0x4800C004) = 0x00000001; *(U32*)(0x4800C004) = 0x00000000; *(U32*)(0x4800C004) = 0x00000001; } The C optimizer translates this into 4 STR and 1 B, so this is well optimized. At 48MHz, 124 ns translates into 6 cycles. I would have expected one STR to consume either 1 cycle or 3 cycles, but not 6. In the STR91x library, you can modify 91x_init.s and enable this define to get the buffering to work. #define BUFFERED_Mode ; Work on Buffered mode, when enabling this define All other clocks are 1:1 with MCLK. Also, I can't run at 96MHz. It just crashes when MCLK switches to the PLL. -Mark 2