As the issue width and the number of function units of superscalar processors continue to increase, the fetch unit must support a large fetch bandwidth in order to fully utilise the datapath resources. This trend makes power issue worse in the fetch unit since the traditional instruction fetch mechanism is not optimised for power consumption. This paper explores the problem of extra power consumption in traditional instruction caches because of dynamic control flows. Capturing the dynamic paths/characteristics of code during the course of execution, trace caches provide a potential framework for power optimisation in the fetch unit. Our study shows that conventional trace caches (CTC) may increase power consumption in the fetch unit because of the simultaneous access to both the trace cache and the instruction cache, and sequential trace caches (STC) have the advantage of lower power consumption at the cost of a significant performance loss. In order to address this problem, we perform a detailed study of trace distribution and access locality. Based on this study, we first propose a new model, the selective trace cache (SLTC). SLTC uses both compiler and hardware support to selectively control trace cache lookup and update. Experimental evaluation shows that our selective trace cache achieves up to 42.2% power reduction over CTC and an additional reduction of up to 21.8% over STC, on the average, while only trading a performance loss of no more than 1.8% compared to CTC. Further, we propose a dynamic direction prediction based trace cache (DPTC), which eliminates the need for compilation and instruction set architecture (ISA) modification involved in SLTC. Powered by a fetch direction predictor, DPTC achieves competitive power efficiency. On the average, DPTC reduces the power consumption by up to 40.5% and 17.6% in the fetch unit compared to CTC and STC, respectively, by trading a performance loss of less than 2.4% to CTC.