To scale with the on-chip wire delay, the inside of a processor will further be partitioned into more small banks. As the amount of data cache bank scales, the load routing latency will contribute a considerable portion to program executing time. In a tiled processor like T-Flex, we observed that the average load routing latency (hit in data cache) can be largely reduced (by 72.1%) when data is perfectly placed in where the load issues. To reduce the long routing latency of critical loads, we give out a solution for localizing load execution at issuing side in this paper. However, this method will induce overhead of data copies and extra communications to maintain coherence and memory order. We explore the design space for localizing data access, with special respect to maximizing benefits at expense of relatively small overhead. We observed the access frequency and load store behaviors for different copies vary largely, with large amount of successive load accesses concentrating on small amount of data blocks. Our experiments show that with special replication and data copy invalidation strategies, copying overhead will be controlled while maintaining considerable performance profits.
Published in:
Computer Science and Information Technology - Spring Conference, 2009. IACSITSC '09. International Association of
Date of Conference: 17-20 April 2009