The design of embedded hardware/software systems is often subject to strict requirements concerning its various aspects, including real-time performance, power consumption, and die area. Especially for data intensive applications, the number of memory accesses is a dominant factor for these aspects. In order to meet the requirements and design a well-adapted system, the software parts need to be optimized and an adequate system and processor architecture needs to be designed. In this paper, we focus on finding an optimized memory hierarchy for bus-based architectures. Additionally, useful instruction set extensions for application-specific processor cores are explored. For complex applications, this design space exploration is difficult and requires in-depth analysis of the application and its implementation alternatives. Tools are required which aid the designer in the design, optimization, and scheduling of hardware and software. We present a profiling tool for fast and accurate performance, power, and memory access analysis of embedded systems. This paper shows how the tool can be applied for an efficient hardware/software co-exploration within the design flow of processor-centric architectures. This concept has been proven in the design of a mixed hardware/software system with multiple processing units for video decoding.