Announcement: New logic analyzer firmware for STM32F103, plus ARM assembly questions

thanks4opensource · ‎2020-11-06

Logic analyzer (and more) firmware:

https://github.com/thanks4opensource/buck50

buck50: Test and measurement firmware for "Blue Pill" STM32F103

buck50 is open-source firmware that turns a "Blue Pill" STM32F103 development board (widely available for approx. US$1.50) into a multi-purpose test and measurement instrument, including:

8 channel, 6+ MHz logic analyzer
Live monitoring and logging of digital, analog, USART (sync/async), SPI (MOSI/MISO), and I2C data
Simple dual-channel approx. 1 MHz digital storage oscilloscope, approx. 5K sample buffer depth (10K if single channel
3 channel digital pulse train generator with user-defined frequency and per-channel duty cycle and polarity
Bidirectional bridge/converter from USART/UART (async/synchro), SPI (master/slave), or I2C ... to USB ... to host terminal, UNIX socket, or UNIX pty device file
8-bit parallel output counter (binary or gray code)
Host terminal ascii or binary input data to 8-bit parallel output
Firmware written in a combination of C++ and ARM Thumb-2 assembly code, with several non-standard hacks^H^H^H^H^Htechniques of possible general interest to advanced software developers
Python host driver program with comprehensive inline help system and usability features

Download, flash, use, enjoy, report bugs -- or ignore -- at your pleasure.

ARM assembly questions:

1) Why does Cortex-M3 code on the STM32F103 take more cycles per instruction than is documented in the ARM Cortex-M3 Processor Technical Reference Manual? Before answering, I'd encourage looking at https://github.com/thanks4opensource/buck50#why_so_slow for more details on the subject. Including: Why does the code run faster when executing from flash than from RAM? Yes, I didn't believe it either, but that's what I found. Independent testing and/or results to the contrary also appreciated.

2) Doing a "C" longjmp to exit an interrupt handler. Again, see https://github.com/thanks4opensource/buck50#longjmp_from_interrupt_handler for details. If any potential responses are, "Yes, of course, everyone knows how to do that", please provide links and I'll kick myself for not finding them and wasting much time and pain figuring it out myself.

Peter BENSCH · ‎2020-11-06

Wow, amazing, also a very good documentation !

Thanks for sharing, good job!

Best regards

/Peter

In order to give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

Piranha · ‎2020-11-06

> Unfortunately, the TRM is work of complete fiction.

Absolutely not!. It just documents core CPU, not the whole MCU, which ARM cannot document, because ARM doesn't have such a product.

> One possible explanation I've seen online is that the cycle counts listed in the TRM are for CPU execution only and do not include additional time required for memory accesses and other issues related to the non-ARM logic that ST and other licensees add to their chips.

Look at performance of STM32F1 vs STM32F2. They have the same core but F2 has twice higher CoreMark/MHz. That's because of cache (ART accelerator).

thanks4opensource · ‎2020-11-06

Thanks, /Peter. Coming from someone inside ST that's quite nice to hear.

And, yes, the documentation took almost as long to write as the code. Well, not really, but it sure felt that way. :beaming_face_with_smiling_eyes:

thanks4opensource · ‎2020-11-06

> Absolutely not!. It just documents core CPU, not the whole MCU, which ARM cannot document, because ARM doesn't have such a product.

>> One possible explanation I've seen online is that the cycle counts listed in the TRM are for CPU execution only and do not include additional time required for memory accesses and other issues related to the non-ARM logic that ST and other licensees add to their chips.

Yes, as per the line you quoted from my README doc, I eventually came to that conclusion. My apologies if this was obvious to everyone else besides me.

The fact remains that the ARM "core only" timings are interesting from an intellectual standpoint but of little use to an application programmer trying to design/optimizer their code. I realize that static code analysis is very difficult on modern CPUs/systems, but believe that a) the STM32F1xx/Cortex-M3 is comparatively simple, and b) regardless the complexity, the systems are deterministic.

So, given that ARM can't know what silicon their licensees will place around their cores, does ST document the cycles-per-instruction for their products? I can't find anything in the reference manual, datasheet, or errata. Even if too complex to be completely described, any starting point would be welcome. I spent a lot of time writing test code to reverse-engineer this information.

And, yes, there is documentation about flash wait states and the various buses (AHB, APB1, APB2) and their prescalers. Do you feel that these, plus the ARM core timings, suffice for estimating the real-world cycles/clock performance?

waclawek.jan · ‎2020-11-07

Fascinating work, thanks.

Re waitstates/performance: there's no single concise work, for many reasons as already discussed above (and the root reason being, that these chips are indeed VERY complex at many levels, and this complexity evolved through very long time). In realm of 32-bitters, the universal answer to this question is handwaves and the "good enough" saying. This is the price we pay for trying (and successfully) to jump Moore's law, which has already came to near stop.

"Why runs faster from FLASH than RAM" in 'F1's case is probably because it's old (see also below). The root cause is probably the one extra cycle when processor reads through S port (i.e. accesses mapped at 0x2000'0000 to 0xDFFF'FFFF), where RAM on 'F1 is mapped. This has been discussed in connection of the Technical Updates https://community.st.com/s/question/0D50X00009Xkba4SAB/stm32-technical-updatesback then when they appeared, there was a benchmark for the then-new 'F4 in one of them, revolving exactly around this thing. The S-bus is heavily loaded that's why the extra waitstate; I don't have a pointer to that information at hand. Additional possible cause is the handshake between processor and bus arbitrator when starts to access a particular bus in the busmatrix, but this is unfortunately entirely undocumented.

As Piranha said above, any other CM3/CM4-based STM32 would be better than the ancient 'F1. I understand the calling of the cheap bulepill, but then the bad reputation from the producers of these using pulled/damaged/cloned chips will transfer to your work, as things possibly won't work as intended. I'd recommend you to have a look at some of the DISCOs - 'F3, 'F4, 'L4. The smaller Nucleos don't have the target's USB brought out so that is sort of a showstopper.

Thanks again for sharing your work.

JW

PS. You should consider tea attachment, too, for diversity reasons.

PS2. Digging through some older threads (https://community.st.com/s/question/0D50X00009fFJSC/my-results-when-examining-interrupt-latency-of-a-stm32f439 - thanks ST for insisting on the crappiest javascriptoid stuff instead of proper forum software, resulting in restarts and failed migrations and link rot - I came up with:

12.5.6. Pipelined instruction fetches

Note

To provide a clean timing interface on the System bus, instruction and vector fetch requests to this bus are registered. This results in an additional cycle of latency because instructions fetched from the System bus take two cycles. This also means that back-to-back instruction fetches from the System bus are not possible.

12.5.6. Pipelined instruction fetches

Note

Instruction fetch requests to the ICode bus are not registered. Performance critical code must run from the ICode interface.

ARM's documentation is an entirely other level of mess, and besides moving it around on their crappy web ("modern feel and look") they also tend to remove information from the manuals in time; so this info may be already absent from modern incarnations of the CM3 TRM.

thanks4opensource · ‎2020-11-07

Thanks for the compliments, JW. Coming from you, they mean a lot to me.

I went through several levels of forum and external links starting with the ones you provided (informative, as you always are). The bottom line seems to be that I'm not going to find a definitive table of ST-specific instruction timings, even with all the caveats such a listing would require ("on the APB2 bus", "except if interleaved with non-cached flash accesses", "extra wait state if DMA contention", etc).

Yes, due to the poor excuse for forum software we have here, the Technical Update attachments seem to be permanently lost, and I can't find them elsewhere searching the web. Thanks for the important info that the F1 accesses to RAM in the 0x20000000 bank require an extra cycle. That goes a long way to explaining some of the results I found in my tests. Also the registered system bus fetches.

I know you and others understand the price/performance appeal of the "Blue Pill" boards, and I do know how bad these Chinese clones are. But my "buck50" project is solidly in the free open source arena -- any problems users have with it fall under "caveat emptor". I started the code because I was walking around one day thinking about the 72MHz clock speed and the ARM cycle timings (which I naively misunderstood) and went, "Hey! I could turn this thing into an 8 MHz logic analyzer" -- with the various hacks I described in the README. Kind of an Edumund Hillary "because it is there" moment.

Of course I then got caught between my own predilection for adding more capabilities (again, because the hardware "is there") and the usual disaster of trying to write clean/efficient/register-level code given the state of ST's documentation. Getting I2C slave-to-master TX working was hell (the RM, errata, app notes, optimized examples, and HAL source all conflict with each other) and even fully working ADC DMA in all its modes (ADC->SR.EOC does *not* get set if using DMA and continuous mode -- have to read DMA->ISR.TCIF1 instead) and many other such undocumented pitfalls for the unwary. But you've been through all this for longer than I have. ;) ... or maybe :(

The problem is that NUCLEO/DISCO boards with USB are all in the US$15-30 range, and 24 MHz+ Cypress FX2-based logic analyzer clones are $10 or less. I think I'd struggle to get that speed with low- to mid-level F4 chips (and F7/H7 pushes way towards the $30 upper end), so a decent analyzer plus bluepill/buck50 for all the other capabilities makes more sense. I'd also have to rewrite my "papoon_usb" library for the OTG peripheral in the newer chips, and you've already warned me that that's going to be another arduous/lengthy task.

I think I'll get some use out of my work on "buck50", and if others do, that's a bonus. I think the analyzer will be useful for low-end tasks, and all the USART/SPI/I2C/GPIO stuff for testing new chips/boards/designs without having to write any software to do so. BTW, in case it's not obvious, the oscilloscope function is a complete toy -- internal and external ADC noise make it only useful for getting a rough idea about the input signals.

Whether the was worth the time I put into it is debatable, but I'll add in the learning experience value as well. Even if that boils down to: "Never trust ST documentation". ;) And I'll put the tea attachment on the already long list of enhancements.

Any thoughts you'd have time to share regarding the "longjmp from interrupt handler" issue?