I have been simulating errors on the bus by removing the other node (a motor controller). As expected I see a number of failed transmissions. I can see that the FDCAN peripheral signals an Active Error on the bus (presumably as a result of not getting an ACK). I use the IR.PEA flag to get an interrupt. I can see that the ECR register is incrementing on each error as expected. After 16 failures it reaches 0x00100080: ECR.CEL is 16 and ECR.TEC is 128. So far so good. My understanding is that the peripheral now enters the Passive Error state. I get an interrupt for this using the IR.EP flag.
The issue is that the device actually appears to behave as if it is in the Bus Off state. I can can send one message after the IR.EP interrupt. I get another IR.PEA interrupt and schedule the next message. But I get no more interrupts. I never see the IR.BO flag set. ECR remains at 0x00100080. Are there some other flags I should enable for failed transmissions in Passive Error state?
The only way I have found to recover is to hard reset the FDCAN peripheral in the IR.EP interrupt. From the documentation, I had expected to do this in the IR.BO interrupt. Note that I have CCCR.DAR set to prevent automatic retransmission: I don't know it that is relevant. The note in the Reference Manual about setting and resetting CCCR.INIT and waiting with bus idle for 129*11 bit times had no effect at all.
Though I am able to recover the bus now, I would like to understand what's going on in case it reveals a deeper problem in my code. Why is the peripheral not reaching the point where IR.BO is set?
I should add that I seen some quite confusing behaviour if I break on the interrupts. In this case, it seems I can send messages just fine. I see them marked with passive errors in my logic analyser (it shows "NAK" rather than "Error"). When debugging like this, it seems that the ECR does not increment. Is there some timing condition in the peripheral which could cause this? I understand there are two clock domains, so...