Abstract:
Cache structures on modern GPUs or CPUs occupy a large area and are frequently accessed. This increases their vulnerability to transient errors. With some area and energy...Show MoreMetadata
Abstract:
Cache structures on modern GPUs or CPUs occupy a large area and are frequently accessed. This increases their vulnerability to transient errors. With some area and energy overhead, these structures are often protected by ECC or parity checking. However, in deference to the energy efficiency and scalability challenges in high-performance computing, it is crucial to minimize any unnecessary overhead while maintaining the desired reliability. This paper evaluates the reliability of unprotected tag SRAM structures in modern GPUs, and studies the use of a low-overhead tag error mitigation mechanism. The proposed mechanism exploits Galois-based hash functions for set-index calculation to mitigate some pathological address strides that cause false hit events. Extensive analysis on a modern GPU indicates that the hash-based mechanism yields 10x reduction in false hit probability (with 2% improvement in hit rate) for write-through data caches when compared to a baseline cache indexing scheme.
Published in: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Date of Conference: 25-28 June 2018
Date Added to IEEE Xplore: 23 July 2018
ISBN Information:
Electronic ISSN: 2158-3927