STM32 F4 rendering performance

I'm working on a project use a F4 MCU. Its clock speed is 168MHZ. The resolution of my LCD is 320 * 480. After some testing, I got some figures. Rendering a full screen by direct accessing LCD controller by MCU can reach about 55ms. By DMA, the speed is about 70ms. I cannot use MCU approach because it just blocks the whole app. Has anyone done some similar bench mark? What's your figures? Is there any way to improve the performance. Any suggestions are appreciated.