2023-02-15 01:03 AM
I've got a design with a STM32F405RE bootloading onboard STM32F722RE via SPI1.
I've implemented SPI bootloader host (mostly same as serial with addition of adding extra 5A frame and dummy reads in a couple places) and individually verified commands like GetVersion, ReadMemory, WriteMemory, EraseMemory etc as working.
However, EraseMemory followed by WriteMemory does not work.
This is referencing AN4286 throughout this post.
Two questions regarding this:
1) Get ACK Procedure (Figure 2 in AN) has the following:
send dummy / receive data, checking if received data is ack or nack, however, there's a loop back to beginning where you keep sending/receiving data and waiting for correct byte. I am seeing this behavior here as well, and it takes about 10-15 ack reads before ACK is returned. I understand that's what the flowchart says so I'm assuming this is OK. Right?
2) Here is where it gets weird. I've implemented EraseMemory as mass erase (Figure 18, left-side flow), by sending 0xff 0xff 0x00 and then waiting for ack.
here's the traffic flow over SPI:
SPI 5a->a5 < frame
SPI 44->a5 < cmd + crc
SPI bb->a5 < cmd + crc
SPI 00->a5 ... SPI 00->a5 < wait for ack
SPI 00->79 < ack
SPI 79->a5 < confirm ack
SPI ff->a5 < data frame for mass erase
SPI ff->a5 < data frame for mass erase
SPI 00->a5 < data frame for mass erase (crc)
SPI 00->a5 ... SPI 00->a5 < wait for ack
SPI 00->79 < ack
SPI 79->a5 < confirm ack
the SPI 00->a5 ... SPI 00->a5 bits are repeated transfers of 0x00 -> reads of 0xa5 until ack is finally returned.
According to this, the mass erase should now commence, right? I've peeked at the datasheet and max mass erase time is 6.9 sec. I tried waiting that duration before using WriteMemory, but that didn't work either. WriteMemory will fail waiting for ack after sending 1st command frame.
What DOES work is simply resetting the STM32F722 between mass erase and writing, like, reset (entering bootloader), erase, reset again, write memory, reset -> works.
Solved! Go to Solution.
2023-04-17 06:08 AM
Hi @timecop1818
I now have the confirmation of the issue root cause and its scope.
Root cause:
The issue root cause is that Flash is stalled due to the mass erase operation (execution of instructions from flash is stopped till end of erase operation), while SPI DMA is still running in circular mode and sending the content of its buffers.
The content of the buffers being inconsistent and not updated by the code which is stalled, it sends incorrect ACKs to the host.
This issue exists on all products older than STM32L4, but visible only when erase operation duration is long enough for the host to get wrong data from SPI DMA buffers. (not visible when erase operation is short).
Workaround:
As explained above, 3 workarounds are possible:
Fixes:
This issue was actually spotted and fixed starting from STM32L4 and on all newer devices.
SPI DMA buffers content is cleared before starting the erase operation, which prevents it from sending inconsistent ACKs to the host while Flash is stalled.
Documentation:
The issue was not documented up to now. It will be described in AN4286 along with full list of impacted STM32 devices (basically: STM32F4, STM32F7, STM32H7) and the workarounds description. Document update is planned shortly.
As per your answer that workaround 2 didn't work in your setup, I think that you might be inserting the delay in the wrong place? If you could share the section of the code with us, we may spot the issue.
I hope this provides you with all the information you need regarding this issue ?
Please let me know shall you need further details/information.
2023-02-27 09:57 AM
Hi @timecop1818
From first glance your sequence seems to be correct.
First let me confirm two points:
Do you know how much time did it take from step10 to step 11? (from end 0xFF FF 00 frame to receiving to actual ACK)
Did you check sending another command like GetCommand or Read Memory after the end of the mass erase operation? Did it result in same behavior ?
Do you have HW IWDG activated in the option bytes ?
2023-02-27 03:06 PM
Hi,
Thanks for looking into this.
I understand that I shouldn't need to wait (uart bootloader replies AFTER ME is done, for example), and that I shouldn't need to reset/reconnect. Since I can chain any other commands (other than ME) it's obvious that the overall system works.
Answers to your questions below:
1) the ack after FFFF00 is ~instant, same as all other replies that don't do anything time-consuming. As in, I only needed to send maybe ~10 dummy bytes/reads until I receive and ACK.
2) Yes, if I send another command after mass erase (such as write block for example), it will fail, as the remote state machine seems to be out sync. However, sending commands before mass-erase, in sequence or etc, works fine.
3) I do not have IWDG or WWDG activated.
2023-03-26 10:10 PM
Any update on this?
Specifically, you never said if repeated sending of the ACK waiting frame is something required or not. As well as the desync during mass erase request.
2023-03-29 02:24 AM
Hi @timecop1818 ,
Sorry for the delayed answer.
Indeed, I confirm there is an issue with this specific device bootloader.
The device doesn't seem to get the ACK from the Host unless a delay is introduced (not the delay that you added).
Let me try to find a workaround and get back to you.
(by the way, sending ACK waiting frame is required, but the answer is supposed to occur just after the end of the erase operation)
Kind Regards,
2023-03-29 07:09 PM
Thanks, good news.
Yeah if a workaround is available that would be nice. Waiting for update.
2023-04-03 07:23 AM
Hi @timecop1818
The root cause is not fully understood yet (seems to be related to latency on flash, and tests done from SRAM show the issue disappear).
But the workaround we found for the moment is the following:
Add 650ms delay before requesting the ACK from the slave.
Which means, in your steps below, insert a 650ms delay between steps 3 and 4.
Could you please check on your side and let me know ?
SPI 5a->a5 < frame
SPI 44->a5 < cmd + crc
SPI bb->a5 < cmd + crc
SPI 00->a5 ... SPI 00->a5 < wait for ack
SPI 00->79 < ack
SPI 79->a5 < confirm ack
SPI ff->a5 < data frame for mass erase
SPI ff->a5 < data frame for mass erase
SPI 00->a5 < data frame for mass erase (crc)
SPI 00->a5 ... SPI 00->a5 < wait for ack
SPI 00->79 < ack
SPI 79->a5 < confirm ack
2023-04-03 05:33 PM
Thanks for looking into it.
680ms didn't work, 700-750 did. As in, the desync didn't happen. But, while the protocol is not interrupted, the actual erase didn't happen, and following write_memory also doesn't work.
Am I understanding correctly that the delay is ONLY needed in case of CMD_EE (0x44) for extended erase, and not for EVERY command going through the frame, right?
What about the delay after sending FF FF mass erase data? needed?
2023-04-05 07:51 AM
Hi @timecop1818 ,
Sorry my bad, I misunderstood the frame, I meant to add the delay between steps 9 and 10 (after sending the full Mass Erase frame and getting the ACK.
I confirm that this workaround of adding a delay is only for Erase command and due to latency caused by Flash erase operation. (for example, we ran the Bootloader from SRAM and the issue was not reproduced).
There is another workaround not based on timing but rather on getting the right ACK, you can find an implementation example below. Would that be interesting for your case ?
#define MASS_ERASE_DONE 0U
#define MASS_ERASE_ONGOING 8U
/**
* @brief Check the status of the current Mass Erase command.
* If the Data_in buffer contains "0x79" the Mass Erase is still ongoing.
* @param Data_in Buffer contaning the received data.
* @param BufferSize The size of the buffer.
* @retval The return value can be one of the following values:
* @arg MASS_ERASE_ONGOING: The operation is still ongoing
* @arg MASS_ERASE_DONE : The operation is finished
*/
uint32_t GetMassEraseStatus(uint8_t Data_in[20], uint32_t BufferSize)
{
for (uint32_t count = 0; count< BufferSize; count++)
{
if (Data_in[count] == 0x79U)
{
return MASS_ERASE_ONGOING;
}
}
return MASS_ERASE_DONE;
}
/**
* @brief Perform Mass Erase command via SPI protocol.
* @retval The return value can be one of the following values:
* @arg CMD_FAIL.
* @arg OK.
*/
uint32_t MassErase(void)
{
uint8_t data_in[20];
uint8_t data_out[3] = {0xFF, 0xFF, 0x00};
/* Send Mass Erase OpCode */
if (spi_driver->SendData(data_out, 3) < 0)
{
return CMD_FAIL;
}
/* Keep reading the DMA buffer untill we have not Ack within it */
do{
spi_driver->ReceiveData(data_in, 20);
}while (MASS_ERASE_ONGOING == GetMassEraseStatus(data_in, 20));
/* Send Ack to the device */
data_out[0] = 0x79;
spi_driver->SendData(data_out, 1);
return OK;
}
2023-04-05 07:09 PM
I am sorry, I do not understand the provided example of waiting for non-ack. What flowchart in AN4286 is it based on?
I would have to special-case this in my otherwise clean bootloader driver which seamlessly handles serial/CAN/SPI, and I do not understand the benefit.
For now, I've worked around the issue by adding 9 seconds delay after sending the mass erase frame (FF FF 00) and before waiting for ack on SPI targets. However, where did you come up with the 650ms number from?
Isn't it dependent on the flash memory contents? When I fill 512kb on the F7 with rand(), mass-erase takes about 8.1 seconds.
Is bootloader code going to keep returning not-ack while erase is in progress?
The flowchart on AN4286 Figure 18, "Wait for ACK or NACK frame".
My "wait for ack" code is exactly like this (AN4286 Figure 2)
write 00 (ignore received data)
while !ack
write 00, receive byte
if byte is ack, write ack, return ack
if nack, write ack, return nack
check if timeout has been reached and if not, continue loop.
If I add 680ms delay before this loop, ACK is properly received, and bootloader protocol is synched but since mass erase is still ongoing, the following write_memory parts fail.
If I understand your wait code correctly, you are actually looping for as long as ACK is received, then stop when something that isn't ACK is received, but it's not clear what it should be. Why are you receiving 20 bytes? How is that even possible, since SPI is synchronous so that means you're also sending 20 bytes? Again, I'd rather understand the issue than add strange workarounds.
P.S. Once you understand the issue will it be documented in AN2606 or AN4286, for the affected devices?