I can guess where the 3195 number comes from, and I can assure you it is wrong under at least some conditions and that you are just lucky it works as written. Here is the code we use to delay for *at least* a requested number of processor cycles (for an absolute delay in seconds, it is left to the caller to know what the current CPU frequency is and pass in the correct number of cycles):
static inline void
__spin_delay_m4(uint32_t cycles)
{
// Spin for at least cycles MCU clock cycles. SUBS + BNE is 3 cycles on
// M4, so we round up to the next multiple of 3 cycles.
uint32_t iterations = (cycles + 2)/3;
asm volatile(
".balign 4;"
"1:"
" subs %0, 1;"
" bne 1b;"
:"+r"(iterations)
:
:"memory", "cc");
}
For a simple delay loop, there is a bit going on behind the scenes to make this possible. In this example, we assume that the code can be fetched with 0 wait states. If the loop is in flash and the flash implements a prefetch buffer (look for "ART Accelerator" in your reference manual) then we need to ensure the code is aligned to avoid straddling two prefetch lines which is why the ".balign 4" directive is there. If you have a full-on instruction cache or are executing from SRAM or ITCM then this may not be an issue for you. If you don't have any type of acceleration and you are running from flash, then this delay will probably take much longer than necessary unless your CPU is running slow enough that you can use 0 flash wait states.
This also assumes that the requested delay is a minimum and that its OK to wait longer than requested; if there are interrupts involved then any interrupt taken will not be accounted for and the delay will be longer than strictly necessary - so it isn't suitable for realtime purposes and should only be used to satisfy constraints such as "wait at least 2 ms for the clock to stabilize".
In this case, to delay for 1 ms when the CPU is running at 16 MHz you would pass a delay value of 16000, which would get divided by 3 for a count of 5334 loop iterations. My guess is that in the for-loop case of your example, the compiled loop consists of some sequence that takes 5 cycles instead of 3 and the correct value for that would be roughly 3200 iterations - pretty close to your 3195. Of course, a better or different optimization level could then ruin your week some time in the future when that compiled loop now only takes 3 cycles instead of 5 to execute... and all your delays end up way too short and none of your device drivers work any more.
Finally, as suggested by the name "__spin_delay_m4()", this is only valid for an M4 CPU. The M7 can use the same delay loop body, but with speculation it executes both the subtract and branch in a single cycle so the iteration count must be increased to match the desired cycle count (omit the (cycles + 2)/3 math). I haven't tested on an M0 or an M33 so cannot say what adjustment may be needed, if any.
As others have said, if you have access to a timer unit of some sort (SYSTICK, TIM or DWT) then those are all options for having a more accurate delay, and with those you could potentially put the CPU into low-power mode with a WFI instruction while waiting for an interrupt to wake you up at the end of the delay period. And they all have the benefit of being CPU-agnostic.