STM32F4 Flash data cache prefetch possible?

Question asked by taelman.johannes on Sep 20, 2016
Is it possible to prefetch flash data to the ART data cache without stalling execution?

As far as I can see the only possibility is using DMA, probably using the DMA2D engine.  So the idea is not to copy data from Flash to SRAM, but to trigger the data prefetch, so it can then be accessed without latency at the natural address.
I think it would only require two clock cycles in the inner loop to prefetch, say, 4 128bit rows, while only transferring 4 words by using the DMA2D line offset.
Would it be possible to discard the output of the DMA2D engine, perhaps by using unmapped address space, to further reduce bus occupation?

Or is there an easier way to trigger prefetching of flash data?