Do Interrupt Priorities Have To Start At Zero?

Christensen.Tyler · ‎2018-10-26

I've been fighting a very subtle strange QSPI bug for a while now where data is occasionally spuriously clocked out unexpectedly. Without going into too much detail about that right now, my question is about interrupt priorities. I found what appears to be a fix that makes my spurious transmissions go away. Previously my interrupts were set to priorities 1 through 7 with no interrupt assigned priority level 0. I decreased each priority by 1 so that now they are priorities 0 through 6 (same relative priorities, just subtracted one from each one, no re-ordering). With only that change, my problem with the QSPI peripheral spurious transmissions goes away. I went back and forth a few times because I just couldn't believe the result, but it really does make the bug disappear with only that single change.

Is there any logic behind this? I've never worried about having priorities 0-N before and lots of my programs have fairly random priority numbering schemes that don't necessarily start at 0. I can't find any documentation that this should matter or any discussions on the topic.

waclawek.jan · ‎2018-10-26

My crystal Cube says that there is some interrupt which priority is not assigned by you, but by some "library".

JW

Christensen.Tyler · ‎2018-10-26

That would seem to be a compelling explanation. I don't think it's the case though. I'm not using libraries, I reject the concept of HAL and equivalent libraries. Why would I want a layer between the hardware and my code that to my eyes actually obfuscates code (and makes situations like you describe happen). All my code is at the register level, all interrupts configured in one place in initial bootup configuration.

waclawek.jan · ‎2018-10-26

> I don't think it's the case though.

Prove it! :)

Read out and check/post the content of NVIC_IPRx and NVIC_ISERx registers.

It should not matter, but once at it, how is AIRCR.PRIGROUP set?

And I have no explanation for the behaviour you are experiencing, except that the change in the priorities might have caused changes in code (loading constants to registers might be different for example) which conicidentally "fixed" your problem.

JW

Christensen.Tyler · ‎2018-10-26

Confirmed... The non-zero entries of NVIC->IP are as follows:

IRQ - Priority in IP registers

8 - 60

18 - 10

20 - 20

27 - 70

29 - 50

35 - 30

38 - 20

43 - 50

67 - 40

71 - 20

corresponds to ISER

0 - 0x28140100

1 - 0x00000848

2 - 0x00000088

3+ - 0x0

All 10 enabled interrupts correspond to non-zero priority and all correctly match to the 10 interrupts I enable in my device init config. Doesn't look like any rogue interrupts are getting enabled. This is testing on the old IRQ priority code 1-7 and confirmed I do again have the spurious transmissions during this debug session.

SCB->AIRCR is 0xFA050000

There is a chance the code timing changed I guess. The problem is extremely finicky and adding a NOP almost anywhere in the code, even in functions that never even get called during run-time, makes the QSPI problem go away or change, but then a week later it random comes back and you have to move the NOP somewhere else by trial and error (obviously not sustainable). At this point pre-release checklist basically contains a step of "find a random place to put a NOP that makes the QSPI peripheral not spuriously transmit data". So basically no fix is truly testable because any change might seem to fix the problem. This particular fix leaves the same program size (unsurprising) and presumably loading a sequence of fixed numbers that is one less than the last sequence of fixed numbers would run the identical machine code with the new numbers, but ultimately who knows I haven't yet diffed the assembly on that.

T J · ‎2018-10-26

I am not an expert like Jan or Clive or Frank...

I guess NMI and Reset will remain at level 0, are there any other hardware interrupts ?

Hardfault, how is that handled. ?

what is the NOP doing ?

shifting code to 32bit alignment ?

do you fiddle with the instruction set ? are you using 32bit or 16bit ? endianness ?

to me it seems like you have a rouge table pointer... counter being corrupted or bypassed errantly.

waclawek.jan · ‎2018-10-26

> Confirmed...

OK, no more ideas.

Don't you want to discuss the problem itself, maybe in a separate thread?

> guess NMI and Reset will remain at level 0, are there any other hardware interrupts ?

> Hardfault, how is that handled. ?

From ARMv7 ARM:

> The group priorities of Reset, NMI and HardFault are -3, -2, and -1 respectively, regardless of the value of PRIGROUP.

> what is the NOP doing ?

> shifting code to 32bit alignment ?

That too, and that might influence timing, too. (Tyler did not tell us his STM32 model yet, we could chew on this further).

Christensen.Tyler · ‎2018-10-26

I have been refraining from discussing the QSPI symptoms here because this topic was really about interrupt priorities. It turns out, that fix already broke as I started working on other parts of the code. I guess that really was indeed just a coincidence that temporarily fixed the problem.

Here is discussion on the real problem (yup, been intermittently debugging this one for the last 3 months!). Note the first post is not mine, I'm discussing my very similar problem in follow-ups. Back then it was when I plugged in a USB cable, but that was just a coincidence that part of the code tripped it I think. Now it's just always sending out spurious QSPI transmissions at fairly random and unexplained intervals: https://community.st.com/s/question/0D50X00009XkgwcSAB/stm32f7-quadspi-and-the-data-cache-odd-behaviour

The chip is an STM32F746NG.

I'm honestly not sure precisely what the NOP does. Things like the functions are still odd-aligned (as needed by thumb code references). Certainly things are going to move around but I don't know what relevant factors to look at.

There are many interrupts (I enable 10 at start-up), but they're all ones I've enabled and have handlers for of course. And none of them interact with the QSPI peripheral, only the main loop operates the flash memory system as a background task (making it somewhat unlikely an interrupt is triggering QSPI). Interrupts just set flags instructing the main loop what to do on the next pass through the flash-operation call.

All fault handlers (NMI, HardFault, etc.) have custom handlers in my code that turn off gatedrive so the hardware doesn't blow up. It's not entering a fault state, though.

waclawek.jan · ‎2018-10-27

I'd still suggest you to start a fresh thread, stating the problem as clearly as you can, linking to these two threads and copy/pasting relevant portions from them. Although the ultimate work is at you of course.

With this sort of problems - hard to isolate and easy to symptomatically go away - I attempt to do exactly the opposite, find out circumstances under which the problem occurs as often as possible. If there are external stimuli, I try to increase their occurence at or beyond maximum (e.g. connecting pushbuttons to a generator), I try to induce faults to excerise the paths normally seldom used (e.g. injecting noise to I2C buses by touching them). Of course I make sure I can observe the problem's symptoms and catch/count/whatever them in this process. (This process often reveals other shortcomings of the code and/or system, too; don't give up though.) The universal key to debugging is perseverance.

> specifically when calling HAL_PCD_EP_Open

Hey, and you've said above you don't use Cube/HAL! =)

No I am not judging you, and I won't even say there's a direct problem there - as yourself said in that very post. But after one year fighting with the Synopsys OTG module I assert that even with full ownership of the code controlling that IP there's still significant chance of surprise, thanks to the accretion of historical layers within it, and the - mildly stating - substandard accompanying documentation. For taster, https://community.st.com/s/question/0D50X00009nKK7wSAG/synopsys-otg-fifo-clash . So the easiest related question is, do you use the HS-OTG with DMA? Note that you can't data-breakpoint on DMA, as the breakpoint unit watches accesses of the processor alone and is not aware of anything else going on in the bus system.

JW

T J · ‎2018-10-27

did you play this game ?

cache off

-O0

-O1

-O2