Note: The full-text represented below is best viewed using either Netscape®, version 6.x or higher, or Microsoft® Internet Explorer, version 5.x or higher. Please consult the corresponding PDF format if your browser has difficulty displaying this text.
PROCEEDINGS OF THE IEEE, VOL. 91, NO. 7, JULY 2003

The Influence of Processor Architecture on the Design and the Results of WCET Tools

REINHOLD HECKMANN, MARC LANGENBACH, STEPHAN THESING, AND REINHARD WILHELM

Invited Paper

    The architecture of tools for the determination of worst case execution times (WCETs) as well as the precision of the results of WCET analyses strongly depend on the architecture of the employed processor. The cache replacement strategy influences the results of cache behavior prediction; out-of-order execution and control speculation introduce interferences between processor components, e.g., caches, pipelines, and branch prediction units. These interferences forbid modular designs of WCET tools, which would execute the subtasks of WCET analysis consecutively. Instead, complex integrated designs are needed, resulting in high demand for memory space and analysis time. We have implemented WCET tools for a series of increasingly complex processors: SuperSPARC, Motorola ColdFire 5307, and Motorola PowerPC 755. In this paper, we describe the designs of these tools, report our results and the lessons learned, and give some advice as to the predictability of processor architectures.
  

    Keywords—Predictability, processor model, real-time, static analysis, worst case execution time.

    Manuscript received August 14, 2002; revised December 17, 2002. This work was supported by the European IST-project Daedalus.
    R. Heckmann is with the AbsInt Angewandte Informatik GmbH, D-66123 Saarbruecken, Germany (e-mail: heckmann@absint.com).
    M. Langenbach, S. Thesing, and R. Wilhelm are with the Fachrichtung Informatik, Saarland University, D-66123 Saarbruecken, Germany (e-mail: mlangen@cs.uni-sb.de; thesing@cs.uni-sb.de; wilhelm@cs.uni-sb.de).
Digital Object Identifier: 10.1109/JPROC.2003.814618

0018-9219/03$17.00 © 2003 IEEE

I.  INTRODUCTION
    A.  Using Abstract Interpretation for WCET Computation
        1)  Applying Abstract Interpretation
    B.  The Architecture of WCET Tools
II.  RELATED WORK
III.  CACHE ANALYSIS
    A.  Cache Memory: General Remarks
    B.  [$A$]-Way Set-Associative Caches
    C.  LRU Caches
        1)  The LRU Strategy
        2)  Concrete Cache States
        3)  Updates of Concrete Cache States
        4)  Full Cache Analysis
        5)  Must Analysis
        6)  May Analysis
        7)  Must and May Analysis Together
    D.  ColdFire MCF 5307: Pseudo-Round-Robin Replacement
        1)  An Example
        2)  Problems
        3)  Must Analysis
    E.  PowerPC 750/755: Pseudo-LRU Replacement
        1)  Pseudo-LRU Replacement Strategy
        2)  Examples
        3)  Analyses
    F.  Data Caches
    G.  Value Analysis
IV.  ANALYSIS OF A SIMPLE PIPELINE
    A.  Dependence of the Caches on the Pipeline
    B.  Analysis Architecture
V.  ANALYSIS OF MORE COMPLEX PIPELINES
    A.  Pipeline Modeling
    B.  An Example: The ColdFire 5307
    C.  A Complex Example: The PowerPC 755
VI.  OBSERVATIONS
VII.  ARCHITECTURAL ADVICE: PREDICTABLE PERFORMANCE
VIII.  FUTURE WORK AND OPEN PROBLEMS
IX.  CONCLUSION
ACKNOWLEDGMENT
REFERENCES

I.  INTRODUCTION

    Hard real-time systems are subject to stringent timing constraints that are dictated by the surrounding physical environment. A schedulability analysis has to be performed in order to guarantee that these timing constraints will be met (“timing validation”). All existing techniques for schedulability analysis require the knowledge of the worst case execution time (WCET) of each task in the system. Since this is not computable in general, estimates of the WCET have to be calculated. These estimates have to be safe, i.e., they must never underestimate the real execution time. Furthermore, they should be tight, i.e., the overestimate should be as small as possible.
    In modern microprocessor architectures, caches, pipelines, and control speculation are key features for improving performance. Caches are used to bridge the gap between processor speed and the access time of main memory. Pipelines enable acceleration by overlapping the executions of different instructions. Control speculation is used to avoid pipeline stalls caused by conditional jumps. The consequence is that the execution behavior of instructions cannot be analyzed in isolation since it depends on the execution history. Processor architectures are optimized for average-case performance and not for predictable performance, as would be required for hard real-time systems. This paper deals with the consequences of processor architectures for the design and the effectiveness of WCET tools.
    
A.  Using Abstract Interpretation for WCET Computation

    The determination of the WCET of a program is composed of several tasks: classification of memory references as cache misses or hits, usually called cache analysis; analysis of the behavior of the program on the processor pipeline, the so-called pipeline analysis; prediction of the results of control speculation and the determination of the worst case execution path of the program, in the following called path analysis. All of these tasks are quite complex for modern microprocessors and digital signal processors (DSPs). They must be executed on the machine-code level, since the semantics of high-level languages does not refer to architectural components such as caches, pipelines, or branch prediction units. Since data cache analysis and pipeline analysis depend on the knowledge of effective addresses—in general only known at run time—another static analysis is needed to try to determine effective addresses statically, in our case called value analysis.
    The identification of these different phases of WCET determination allows to use different methods tailored to the subtasks. In our case, value analysis, cache analysis, pipeline analysis, and branch behavior prediction are done by abstract interpretation, a semantics-based method for static program analysis. Path analysis is done by integer linear programming. Both precision of the results and efficiency of the WCET computation are acceptable in practice, but depend, as will be shown, on the processor architecture.
    1)  Applying Abstract Interpretation:    Abstract interpretation [1], [2] is a well-established method of static program analysis with a host of available theoretical results. Static program analysis is well suited for the approximative establishment of safety properties of programs, i.e., the proof that “something bad does not happen.” It is approximative in the sense that it may not establish all safety properties that actually hold. However, it is sound in the sense that all safety properties it claims to hold do actually hold. Which are the bad things for WCET that we hope to exclude by static program analysis? Of course, cache misses, pipeline stalls, and mispredicted branches. Each excluded cache miss allows us to exclude the costs of a cache miss penalty from the WCET, each excluded pipeline stall eliminates the costs for a pipeline bubble from the WCET, and each excluded branch misprediction precludes expensive damage to the instruction cache and costs for cleanup.
    A static program analysis is considered an abstraction of a standard semantics of the programming language. A standard (operational) semantics of a language is given by a concrete domain of data and a set of functions describing how the statements of the language transform concrete data. An abstract semantics then consists of a (simpler) abstract domain and a set of abstract semantic functions, so-called transfer functions, for the program statements computing over the abstract domain.
    The designer of a program analysis faces the following design tasks.

  • Defining the domain: The abstract domain is obtained from the concrete domain by abstracting from all aspects up to those, which are subject of the analysis to be designed. An abstraction function maps concrete domain elements to abstract domain elements.
    Both domains usually are complete partially ordered sets of values. The partial order on the abstract domain corresponds to precision, i.e., quality of information. By agreement, elements higher up in the order are considered to contain less information.1
    The partial order determines the least upper bound operation, [$\sqcup$], on the lattice, which is used to combine information stemming from different sources, e.g., from several possible control flows into one program point.
  • Defining the transfer functions: The transfer functions describe how the statements transform abstract data. They must be monotonic to guarantee termination.

    Abstract interpretation has been profitably applied to cache analysis and pipeline analysis. It is executed on the control flow graph of the program, which can be constructed by analyzing the machine program [3].

    
B.  The Architecture of WCET Tools

    Both the architecture of WCET tools and the precision of the results of WCET analyses strongly depend on the architecture of the employed processor. The cache replacement strategy influences the obtainable precision of cache behavior prediction. Instruction prefetching, out-of-order execution, and control speculation introduce interferences between processor components, e.g., caches, pipelines, prefetch queues, and branch prediction units. As we will see, these interferences forbid modular designs of WCET tools, which would execute WCET analysis in a sequence of subtasks. Let us consider, what it means to separate the analysis of component [$A$] from the analysis of component [$B$], where the behavior of [$A$] depends on that of [$B$]. In order to be on the safe side, the “damage” done to the state of [$A$] by activities of [$B$] has to be taken into account. Since nothing is known to the [$A$] analysis about the state of [$B$], an upper bound on the potential damage has to be determined and used. This upper bound can be far away from any real damage if the interference between the processor components is highly dynamic and if the worst case damage occurs seldom.
    Thus, to bound the loss of precision complex processors need complex integrated tool designs resulting in high demand for memory space and analysis time.

II.  RELATED WORK

    There exists a vast literature on WCET determination. We only list references that treat complex processors with all features considered in combination, not architectural features in isolation.
    Healy et al. [4], [5] presented another approach on predicting WCETs in the presence of caches and simple pipelines. In a first step of the analysis a static cache simulator classifies instructions as cache hits or misses. This information is used by a pipeline path analysis that computes the execution time for a sequence of instructions. Loops are handled in a bottom-up manner. Only the simple pipeline of a MicroSPARC is considered and in [4] only direct-mapped caches and simple pipelines are taken into account that can be described by resource usage patterns of instructions. For their experimental results the authors only consider a small direct-mapped cache with small test programs.
    Li et al. suggest a solution using integer linear programming [6]. Both cache and pipeline behavior prediction are formulated as a single linear program. The i960 kB is investigated, a 32-bit microprocessor with 512-byte direct mapped instruction cache and a fairly simple pipeline. Only structural hazards need to be modeled, thus keeping the complexity of the integer linear program moderate. Variable execution times, branch prediction, and instruction prefetching are not considered at all. Obtaining the ILP modeling for a more complex processor will be difficult. Using this approach for superscalar pipelines does not seem very promising considering the analysis times reported in the article. Nonetheless, the description of the worst case path through the program via ILP is an elegant method and can be efficient if the size of the ILP is kept small. This is the case in our tool.
    Lundqvist and Stenström present an integrated approach for obtaining WCET bounds through simulation of the pipeline in [7], [8]. They extend a pipeline simulator to handle unknown values in inputs. We share conceptual similarities with this approach in that we perform a cycle-wise evolution of a pipeline (model). In contrast to our approach, Lundqvist and Stenström use an integrated method in which value analysis for register/memory contents and execution time computation are parts of the same simulation. If the simulation cannot determine a branch condition exactly due to dependencies on unknown (input) values, both branches have to be simulated. This method does not guarantee termination of the analysis, but offers the advantage of sometimes determining loop bounds and/or recursion bounds “for free.”2 However, we feel that this analysis is very costly due to the huge amount of data that has to be kept for each branch followed. In contrast, our method does not retain information like register or memory contents in the pipeline analysis phase, contents that have already been determined in the value analysis to predict conditional and computed branches, for example. In [8], experiments with a PowerPC-like architecture are conducted for small example programs using an extended PSIM simulator with simple reservation tables for instructions. All in all, it is not clear how well this method scales up to programs of realistic size.
    In contrast to Lundqvist and Stenström's integrated approach, Engblom presents a WCET tool with a clear separation of all the analysis modules in [9]. The modules communicate using interface data structures. One main component is a simulator that estimates the execution time for a given sequence of instructions. These timing estimates are composed to form the execution time of the entire program. The quality of the obtained WCET is greatly influenced by the quality of the simulator used. Cache behavior prediction is not incorporated in the tool as the addressed targets do not have any caches. This eliminates the problem of cache and pipeline interaction, which becomes more difficult with more complex pipelines, prefetching, and branch prediction. The author comes to the conclusion that “[$\ldots\,$]out-of-order processors are definitely too complex to model with current techniques.”
    Colin and Puaut describe a framework for tree-based WCET analysis in [10]. Instruction cache and pipeline behavior as well as branch prediction are taken into account and are analyzed independent of one another, reducing the precision of the obtained WCET estimate.
    The analyses are based on two intermediate representations: the syntax tree, and the control flow graph built from assembly output of the compiler. As the program is not yet translated to object code, it is not clear which machine instruction an assembly instruction is mapped to, and as the program is not linked, information on instruction addresses are not available. The syntax tree is used to compose the WCET from smaller parts. This is not appropriate as it disregards the execution context leading to imprecise results.

III.  CACHE ANALYSIS

    A.  Cache Memory: General Remarks

    Caches are used to improve the access times of fast microprocessors to relatively slow main memories. They are an upper part of the storage system hierarchy and fit in between the register set and the main memory. Excluding the register set, caches have the shortest access times of all levels of the storage system. They can reduce the number of cycles a processor is waiting for data by providing faster access to recently referenced regions of memory. Caching is more or less used for all general purpose processors, and with increasing application sizes it becomes more and more relevant and used for high-performance microcontrollers and DSPs.
    At any time, a cache memory duplicates a subset of main memory locations. For the purpose of caching, the main memory is partitioned into memory blocks of size [$B$] bytes, numbered consecutively starting with 0. Usually, [$B$] is a power of [$2, B = 2^b$]. Then, byte addresses can be easily translated into block numbers by omitting the lowest [$b$] bits. By an access to memory block [$i$], we mean a read or write access to a memory location belonging to block [$i$].
    When the processor wants to access a memory block, it first checks whether the cache contains (a copy of) the block. If so (cache hit), the processor can quickly access the block in the cache. If not (cache miss), the block is copied from main memory into the cache, where it is stored for this reference and future ones. Clearly, the handling of cache hits is much faster than that of cache misses since the main memory is not involved.
    A memory access can be the reading of an instruction (a prerequisite of its execution) or the reading or writing of data during the execution of an instruction. The processor may have one unified cache that contains both instructions and data (e.g., ColdFire 5307), or two separated caches, one for instructions (I-cache) and one for data (D-cache). PowerPC 750 and 755 processors contain separate caches, having the same size, structure, and principal behavior.
    
B.  [$A$]-Way Set-Associative Caches

    There are three commonly used cache architectures: direct-mapped caches, fully associative caches, and [$A$]-way set-associative caches (where [$A$] is a natural number).
    An [$A$]-way set-associative cache consists of [$S$] cache sets [11]. Each cache set consists of [$A$] ways or lines, where the number [$A$] denotes the associativity of the cache. Each way can hold the copy of a memory block consisting of [$B$] consecutive bytes. Hence, the total capacity of the cache is [$S \cdot A$] memory blocks, or [$S \cdot A \cdot B$] bytes. Usually, the numbers [$S, A$], and [$B$] are powers of [$2$]; [$S = 2^s, A = 2^a$], and [$B = 2^b$].

  • The Motorola ColdFire MCF 5307 has [$S \!\!=\!\! 128 \!\!=\!\! 2^7$], [$A = 4 = 2^2$], and [$B = 16 = 2^4$]. Hence, the total capacity of the cache is [$2^{13}$] byte[${} = 8$] kB.
  • Motorola PowerPC MPC 750 and 755 processors have caches with [$S = 128 = 2^7$], [$A = 8 = 2^3$], and [$B = 32 = 2^5$]. Hence, the total capacity of the caches is [$2^{15}$] byte[${} = 32$] kB each.
The other two cache architectures can be considered as degenerate special cases of [$A$]-way set-associativity: direct-mapped caches correspond to the case [$A = 1$] (each set has only one line), and fully associative caches correspond to the case [$S = 1$] (there is only one cache set).
    Each memory block can only be stored in one specific cache set. The number of this set consists of the lowest [$s$] bits of the block number. Thus, neighboring blocks will be stored in different cache sets.
    A cache line may be either valid, i.e., contain a memory block, or invalid, i.e., be currently free. A valid line containing block [$m$] not only contains the bit pattern forming the contents of [$m$], but also a tag identifying [$m$]. This tag is the block number of [$m$] without the [$s$] bits used as set number.
    In the PowerPC 750/755, addresses have 32 bits. The lowest 5 bits are chopped off to obtain the block number of 27 bits. Of these 27 bits, the 7 lower bits indicate the cache set where the block can be stored, and the 20 upper bits form the tag. Thus, [$2^{20} \approx 10^6$] memory blocks are competing to be stored in each set.
    When a memory block [$m$] is accessed, its number is partitioned into set number [$i$] and tag [$j$]. Then the tags of all valid lines in set [$i$] are compared with [$j$]. If there is a match, [$m$] has been found in the cache (cache hit). Otherwise, [$m$] is copied into the cache. For this, a line [$l$] of set [$i$] is determined where [$m$] is placed. If [$l$] is invalid, it is allocated for [$m$]. If it is valid, the memory block residing there so far is replaced by [$m$].
    The algorithm used to determine [$l$] is the replacement strategy of the cache. Common replacement strategies are least recently used (LRU), first in first out (FIFO), and random. SPARC processors have LRU caches, but ColdFire MCF 5307 and PowerPC 750/755 have special replacement strategies called pseudo-round-robin (ColdFire) and pseudo-LRU (PowerPC 750/755). In the following sections, we sketch the modeling of LRU, pseudo-round-robin, and pseudo-LRU caches. More complete descriptions of LRU cache analysis can be found in [12][13][14]. The important observation will be that a true LRU replacement strategy in contrast to all kinds of “pseudo” strategies, e.g., pseudo-LRU or pseudo-round-robin, offers the chance for very precise results from a cache analysis.

Table 1
Example for Age Updates (LRU)




Table 2
Age Update Function (LRU)



    
C.  LRU Caches

    In an LRU cache, each cache set has its own replacement logic. Therefore, the cache sets are independent from each other, and it suffices to describe the behavior of a single set. When speaking of “the cache” in the sequel, we actually mean this single set.
    1)  The LRU Strategy:    When a new memory block is copied into the cache and there are invalid lines, the block is written into the first such line. If all lines are valid, the LRU replacement strategy causes replacement of the memory block that has been least recently used. This can be modeled by assigning ages to the blocks in the cache. For an [$A$]-way set-associative cache, the set of ages is [${\cal A} = \{0, \ldots, A-1\}$]. The most recently used block has age 0, and the least recently used block has the maximal age [$A-1$].
    In case of a cache miss, the accessed block is put into the cache with age 0, all blocks in the cache age by 1, and the block with age [$A-1$] (if any) is removed from the cache. When a block is accessed that is currently in the cache with age [$a$], its age is reset to 0, all blocks younger than [$a$] age by 1, while blocks older than [$a$] are not affected.
    Table 1 presents a sample access sequence for a four-way set-associative cache, starting from an empty cache.

    2)  Concrete Cache States:    The assignment of lines to memory blocks is irrelevant for the question which blocks are in the cache at present and in the future. One only needs to know what blocks are in the cache, and what their age is. This information is given by a function [$c: {\cal M} \to {\cal A}'$] where [${\cal M}$] is the set of memory blocks and [${\cal A}' = {\cal A} \cup \{\top\}$] is the set of ages plus an additional element [$\top$]. For a block [$m, c(m) = a \neq \top$] means [$m$] is in the cache with age [$a$], while [$c\,(m) = \top$] means [$m$] is not in the cache. A concrete cache state is such a function, restricted by the property that no two different memory blocks have the same age [$\neq \top$].

    3)  Updates of Concrete Cache States:    When a memory block [$m_0$] is accessed, the current concrete cache state [$c$] is updated into a new concrete cache state [$c' = {\rm up}(m_0)(c)$] defined by

[$$ c'(m) = \cases{0,\hfill & {if}\ $m=m_0$\hfill\cr {\rm up}_{\cal A} (c (m_0)) (c(m)),\hfill & {otherwise}\hfill} \eqno{\hbox{(1)}} $$]

using an age update function [${\rm up}_{\cal A}{:} {\cal A}' \to ({\cal A}' \to {\cal A}')$]. The age update function for four-way set-associative caches is shown in Table 2.
    If the accessed block is not in the cache, all other blocks age by one, and the one with age 3 (if any) is removed (last line). Otherwise, all blocks younger than the accessed block age by one, and all older blocks keep their age. Note that [${\rm up}_{\cal A}(a)(a)$] is undefined for [$a \neq \top$]. The reason is that these values are not needed because different memory blocks have different [${\rm ages} \neq \top$] in concrete cache states.

    4)  Full Cache Analysis:    Full cache analysis tries to compute for each program point (and calling context) the set of all concrete cache states possible at that point. Since this set is not computable in general, the analysis can only produce a safe approximation of the exact set, which in this case means a superset. This approximation should be as precise as possible, i.e., the superset should be close to the exact set.

Table 3
Age Update for Must Analysis (LRU)



    In practice, full cache analysis is intractable since the memory consumption of the analyzer would be prohibitive. Thus, two less ambitious analyses were developed by Ferdinand [12], [13]: must analysis (Section III-C5) telling which memory blocks are certainly (must be) in the cache, and may analysis (Section III-C6) telling which memory blocks may be in the cache.

Table 4
Age Update for May Analysis (LRU)




    5)  Must Analysis:    The basic idea of must analysis is to approximate the set [$C$] of concrete cache states possible at a program point [$\pi$] by one abstract cache state [$C^{u}$] that provides upper bounds for the ages of memory blocks in all states contained in [$C$]. To formalize the idea of an upper bound, the set [${\cal A}' = \{0, \ldots, A-1, \top\}$] of ages is ordered by [$0 < 1 < \cdots < A-1 < \top$]. Hence, [$C^{u} (m) < \top$] implies [$c(m) < \top$] for all [$c$] in [$C$], i.e., all states in [$C$] agree that block [$m$] is in the cache. Thus, one may say that [$m$] must be in the cache at program point [$\pi$], no matter what the concrete cache state at [$\pi$] is.
    Abstract ages are upper bounds of concrete ages: an abstract age [$a$] stands for concrete ages [$0, \ldots, a$]. Hence, the update function [${\rm up}_{\cal A}^{u}$] for abstract ages is derived from the function [${\rm up}_{\cal A}$] for concrete ages by

[$$ {\rm up}_{\cal A}^{u} (a_0)(a) = \max \{{\rm up}_{\cal A}(a_0') (a') \,\vert\, a_0' \leq a_0, a' \leq a \} $$]

where undefined values [${\rm up}_{\cal A}(a_0') (a')$] are neglected, and [$\max \emptyset$] is set to [$0$]. The resulting function for [$A = 4$] is shown in Table 3.
    Like concrete cache states, abstract cache states are functions from [${\cal M}$] to [${\cal A}'$], but they may map different memory blocks to the same age. An abstract state [$C^{u}$] approximates a concrete state [$c$] if [$C^{u}(m) \geq c (m)$] for all memory blocks [$m$]. The update function [${\rm up}^{u}$] for abstract cache states of must analysis has the same form as the one for concrete cache states (1), but uses the abstract age update function [${\rm up}_{\cal A}^{u}$] instead of [${\rm up}_{\cal A}$]. It is correct in the sense that if [$C^{u}$] approximates a concrete state [$c$], then [${\rm up}^{u} (m_0) (C^{u})$] approximates [${\rm up}(m_0) (c)$].

    6)  May Analysis:    May analysis is dual to must analysis. It approximates the set [$C$] of concrete cache states possible at a program point [$\pi$] by one abstract state [$C^{\ell}$] that provides lower bounds for the ages of memory blocks in all states contained in [$C$]. The order on [${\cal A}'$] is the same as in must analysis: [$0 < 1 < \cdots < A-1 < \top$]. Hence, [$C^{\ell} (m) = \top$] implies [$c (m) = \top$] for all [$c$] in [$C$], i.e., all states in [$C$] agree that block [$m$] is not in the cache.
    An abstract age [$a$] now stands for concrete ages [$a, \ldots, \,$][$ A- 1,\top$]. Hence, the update function [${\rm up}_{\cal A}^{\ell}$] for abstract ages of may analysis is derived from the concrete function [${\rm up}_{\cal A}$] by

[$$ {\rm up}_{\cal A}^{\ell}(a_0) (a) = \min \{{\rm up}_{\cal A}(a_0') (a') \,\vert \, a_0' \geq a_0, a' \geq a \} $$]

where undefined values [${\rm up}_{\cal A}(a_0') (a')$] are neglected. The resulting function for [$A = 4$] is shown in Table 4.
    Again, abstract cache states are arbitrary functions from [${\cal M}$] to [${\cal A}'$].

Table 5
Example for Combined Must and May Analysis (LRU)


Now, an abstract state [$C^{\ell}$] approximates a concrete state [$c$] if [$C^{\ell} (m) \leq c (m)$] for all memory blocks [$m$]. The update function [${\rm up}^{\ell}$] for abstract cache states, which results from (1) by replacing [${\rm up}_{\cal A}$] by [${\rm up}_{\cal A}^{\ell}$], is correct in the same sense as the update function of must analysis.

    7)  Must and May Analysis Together:    Must and may analysis performed together yield lower and upper bounds, i.e., intervals. Thus, the combined analysis has abstract states [$C^i: {\cal M} \to {\cal I}({\cal A}')$], where [${\cal I}({\cal A}') = \{ [l,u] \in {\cal A}' \times {\cal A}' \,\vert \, l \leq u \}$] is the set of age intervals. Table 5 shows the evolution of an interval cache state under a sequence of accesses.
    The example starts with the “unknown” abstract cache, which maps all memory blocks to the interval [$[0,\top]$] that provides no information. This is the appropriate state at the entry of a task analyzed separately where the analyzer has no information about the previously executed code. The example shows that this lack of knowledge only matters at the beginning: the first three accesses cannot be classified as hits or misses, but cause the intervals to shrink. From access 5 in this example, all intervals are singletons, i.e., the cache analyzer has exact knowledge about the cache contents. This exact knowledge may be destroyed by control-flow joins where the incoming intervals have to replaced by their join, i.e., the least interval containing all of them. Another source of uncertainty are accesses whose target address is not exactly known (see Section III-F). Yet, these uncertainties disappear again while straight-line code with exactly known target addresses for accesses is analyzed, as it happened in the example of Table 5. So LRU caches admit a quite precise analysis leading to complete knowledge of the cache contents in some cases.

    
D.  ColdFire MCF 5307: Pseudo-Round-Robin Replacement

    The ColdFire cache has a size of 8 kB. It is four-way set-associative with 128 cache sets of four lines each. Each line may store a memory block of 16 bytes. As in all set-associative caches, each memory block [$m$] can only be put into one cache set, whose number is derived from the address of [$m$].
    The ColdFire MCF 5307 employs a so-called pseudo-round-robin replacement strategy. The state of the replacement logic is given by a 2-bit counter. The counter is neither used nor modified in case of a cache hit or if a block is put into a set with empty lines; in the latter case, the block is put in the first such line. If a block is put into a full cache set, the 2-bit counter indicates which of the four lines is replaced. After the replacement, the counter is increased by one (modulo 4). There is only one counter for the whole cache. Hence, a replacement in one cache set influences all other sets.
    1)  An Example:    Assume a program accesses the memory blocks [$0, 1, 2, 3,\ldots,$] and block [$i$] is put into cache set [$i$] mod 128. Such a scenario corresponds to a linear program without data access to memory (all data are in registers or in some noncachable memory area). Assume further the program starts with an empty cache. Then the blocks 0–127 are put into the first line of each set, the blocks 128–255 into the second line, 256–383 into the third line, and 384–511 into the fourth line. The resulting cache state is depicted in Table 6, where the columns represent the cache sets. The next memory block 512 is put into set 0. The counter has not been used so far, and still has value 0; hence, block 512 is put into line 0 and replaces block 0. The counter is set to 1, and so, block 513 is put into line 1 of set 1, replacing block 129. Continuing like this until block 639, the resulting cache state is as shown in Table 7, where the recently added blocks are printed in boldface. Block 640 then replaces 512, 641 replaces 513, etc.

Table 6
Example: ColdFire Cache (After Block 511)




Table 7
Example: ColdFire Cache (After Block 639)



    From this example, one may learn that some blocks (like 1 and 128) may stay in the cache forever although they are never referenced again, while other blocks (like 512 and 513) are removed from the cache when their cache set is referenced for the next time. Although these remarks in their full strength only hold in this regular example, they show that in general, an analysis must take into account that some blocks may survive many cache updates, while others are thrown out immediately.

    2)  Problems:    The computation of (an approximation of) the set of all concrete cache states possible at a program point is intractable. An abstraction of this set into one abstract cache state should contain a model of the counter. The counter stays the same or increases by one; in presence of uncertainties caused by control-flow joins or an initial unknown cache state, the analyzer cannot know what happens to the counter if an access cannot be classified as hit or miss. After three such uncertainties all counter information is lost and can never be recovered again.
    Absolute counter values can be avoided by assigning ages to the lines: The line the counter points to has age 3, the next line age 2 etc. Yet, ages stay the same or increase by one; and sometimes one does not know what happens to them. Thus, there is the same problem as above: after three uncertainties, all age information is lost.
    May analysis tries to determine which blocks may be in the cache (or equivalently, which blocks are certainly not in the cache at a given program point). Without counter or age information, one can never be sure that a block is removed from the cache. Thus, may sets get larger and larger. When starting from an unknown cache state, the initial may set already contains all memory blocks, and this never changes. Therefore, may analysis for the ColdFire cache is completely useless.

    3)  Must Analysis:    In must analysis, we want to compute the set [$M$] of memory blocks that are definitely in the cache (for each program point). Initially, the set [$M$] is empty—no matter whether we start out with an empty cache or a cache with an unknown state since in the latter case, we do not know of any memory block that it is definitely in the cache. When a memory block [$m$] is accessed, it will be certainly in the cache afterwards so that it can be added to [$M$]. If it has not yet been in the cache before, then another block may be thrown out of the cache. Without counter or age information, we do not know which one. Hence, whenever a new element [$m$] is added to [$M$], all elements of [$M$] that are in the cache set where [$m$] is put must be removed from [$M$]. Therefore, [$M$] can contain at most one memory block for each cache set. This property is also preserved at control-flow joins where all incoming sets are replaced by their intersection.
    This kind of must analysis is simple and efficient, but not very precise: for each cache set, it determines at most one memory block that is definitely in the cache, although concretely, a cache set can hold up to four blocks. Thus, one may say that the analysis models only [$1/4$] of the cache, but we do not know of any better analysis.

    
E.  PowerPC 750/755: Pseudo-LRU Replacement

    PowerPC 750/755 processors have two separate caches for instructions and data. Each cache has a size of 32 kB and is eight-way set-associative with 128 cache sets of eight lines each. Each line may store a memory block of 32 bytes. The replacement logics of the two caches are of the same kind.
    Each cache set has its own instance of the replacement logic. Therefore, the cache sets are independent from each other, and it suffices to describe the behavior of a single set. When speaking of “the cache” in the sequel, we actually mean a single set of one of the two caches.
    1)  Pseudo-LRU Replacement Strategy:    Older PowerPC models have four-way set-associative caches with LRU replacement. After the upgrade to eight-way set-associative caches, LRU was replaced by a so-called pseudo-LRU (PLRU) strategy to save hardware costs [15].
    In the following description of PLRU, the eight lines of the cache will be called [${\rm L0}, {\rm L1},\ldots, {\rm L7}$]. The PLRU replacement logic for such an eight-line cache has an inner state given by the values of 7 bits [${\rm B0}, {\rm B1},\ldots, {\rm B6}$]. When memory block [$m_0 \in {\cal M}$] is accessed, the following happens.
  1. Determine what to do, and the involved line:
    • If [$m_0$] is already in the cache (hit), let [$l_0$] be the line where it is.
    • If [$m_0$] is not in the cache (miss):
      1. — If there is an invalid line, let [$l_0$] be the first such line, and put [$m_0$] there (allocate).
      2. — If all lines are valid, let [$l_0$] be the line the replacement bits point to. This line is calculated from the settings of the replacement bits [${\rm B0},\ldots, {\rm B6}$] as specified in Fig. 1. Put [$m_0$] into line [$l_0$] replacing its previous contents.

  2. Update the replacement bits so that they point away from the involved line [$l_0$]. The update is specified in Table 8. The bits not mentioned in the table are not changed.


Fig. 1. Determination of replacement line (PLRU).



Table 8
PLRU Bit Update Rules




Fig. 2. Effect of repeated misses on the PLRU cache.


    The rule for updating the replacement bits negates the 3-bit values that lead to the replacement of [$l_0$]. For instance, L5 is selected if [${\rm B0} = 1, {\rm B2} = 0$], and [${\rm B5} = 1$], and for [$l_0 = {\rm L5}$], the bit updates are [${\rm B0} {:}= 0, {\rm B2} {:}= 1$], and [${\rm B5} {:}= 0$].

    2)  Examples:    In the following examples, the values of [${\rm B0},\ldots, {\rm B6}$] are written in the format [$\hbox{\tt x xx xxxx}$], where the gaps group the decision levels (cf. Fig. 1).

    Example 1:    First, assume the bit setting is [$\hbox{\tt 0 00 0000}$], and all lines are invalid. (This is the situation after cache invalidation.) The first access is a miss since all lines are invalid. The accessed memory block is placed into the first invalid line, [$l_0 ={\rm L0}$]. Accidentally, this is the line the bit setting [$\hbox{\tt 0 00 0000}$] points to. At the end, the bit setting is updated into [$\hbox{\tt 1 10 1000}$]. Assume the second access is again a miss. Then the accessed memory block is placed into the first invalid line, which is now [$l_0 ={\rm L1}$]. Note that this time, the line the actual bit setting [${\rm 1 10 1000}$] points to is different, namely L4. At the end, the bit setting is updated into [$\hbox{\tt 1 10 0000}$], which also points to L4.

Fig. 3. Example involving hits (PLRU).


    In all the other examples, we assume that all lines are valid. Hence, a miss causes the block in the line the actual bit setting points to be replaced. This line is indicated after each bit setting, following the arrow.
    Fig. 2 shows the behavior of the cache if only misses occur, for some arbitrarily chosen initial setting of the replacement bits. All eight lines are replaced (in some strange order), and after eight misses, the original bit setting is recovered. These observations are true for all 256 possible initial bit settings.
    The cache is less well behaved if hits may occur.
    Fig. 3 shows the effect of alternating between accessing the block in L0 and accessing a block not in the cache (miss). Note that the state in the last line of the example is the same as the state in the second line. Hence, the states will cycle through the ones listed in the example forever unless the regular access pattern changes. The blocks in L4–L7 are replaced by new blocks, while the blocks in L0–L3 stay in the cache forever. For L0, this is natural since the block in L0 is continuously accessed. Yet, the blocks in L1–L3 also survive although they are never accessed. Such a behavior could not happen with a proper LRU strategy.
    Although these remarks in their full strength only hold in this regular example, they show that in general, one must take into account that some blocks may survive many cache updates although they are never accessed, while others are thrown out quickly.

    3)  Analyses:    Like LRU, the PLRU strategy admits the introduction of ages. Yet, the age update function is not as regular as the one of the LRU strategy, which hampers both must and may analysis.
    In fact, may analysis does not yield any information at all: starting from an unknown cache, it never determines any memory block that certainly is removed from the cache. The example of Fig. 3 shows that indeed some blocks may reside in the cache forever although they are never accessed.
    Must analysis does yield some information, but not as much as in LRU caches: it finds at most four memory blocks in every cache set (of eight blocks possible in practice). This analysis is more complicated than the ColdFire analysis, but models [$1/2$] of the cache.
    To assess how much information a cache analysis of the PLRU cache strategy loses compared to the results of a cache analysis of an LRU cache of the same size, one can compare the results of the must analysis. An ad-hoc precision parameter is the number of cache lines guaranteed to be in the cache at every program point. Comparing the sum of this parameter divided by the number of program points gives a measure for a given program under analysis. The higher this value, the better the analysis is. Since we do not know the number of cache lines to be guaranteed in the concrete execution of the program, this constitutes only a relative precision comparison. Fig. 4 gives the resulting precision value for a larger benchmark (84 kB) of PowerPC code for the instruction cache must analysis. This benchmark contains code pieces typical for avionics software (filters, CRC computation, etc.). The different lines in the figure correspond to different context mappings of the underlying data-flow analysis. Context mappings are used to distinguish different execution histories (e.g., loop iterations or call sequences) of a program in the data-flow analysis.

Fig. 4. PLRU versus LRU cache analysis.


    As the results show, the precision depends on the one hand on the precision of the data-flow analysis itself, i.e., the mapping used. The cs0 mapping uses the callstring(0) approach. This approach does not, e.g., distinguish different loop iterations in the analysis, so it is not very exact. Therefore, the results for LRU and PLRU are nearly the same. The vivu mapping distinguishes call histories and the first iteration of loops from the remaining iterations. The other [$\hbox{\tt vivu}\hbox{\tt (n)}$] mappings distinguish in addition up to [$n$] loop iterations. As can be seen, the gap between the results for an LRU and a PLRU cache increase for the more precise analyses, up to a factor of 1.609. Taking into account that the benchmark size is just around 2.5 of the whole cache size, this is already a significant loss of precision for the WCET prediction.

    
F.  Data Caches

    In the description above, we always assumed that the address of a memory access is exactly known. While this is true for instruction access and data access with absolute addressing, the addresses of indirectly accessed data are in general unknown at compile time.
    Assume the concrete cache state is [$c$] when an access happens that may refer to block [$m_1$] or block [$m_2$]. Then the resulting cache state is [${\rm up}(m_1) (c)$] or [${\rm up}(m_2) (c)$]. This is the same situation as at a control-flow join, where two different concrete cache states may arrive. So with abstract cache states, accesses to unknown addresses may be handled by the same merge operation as at control-flow joins.
    The resulting loss of precision is more problematic for unified caches (e.g., ColdFire MCF 5307) than for separate data and instruction caches (e.g., PowerPC 750/755). In the latter case, abstract instruction cache states are not ruined by indirect data accesses.
    To limit the loss of information due to indirect addressing, it is of tantamount importance to safely reduce the set of possible target addresses. Two methods can be used: the exploitation of knowledge about memory allocation by the compiler, and a value analysis attempting to determine effective addresses at compile time.
    In [16], methods are described to statically determine the addresses of memory references to procedure parameters or local variables by a static stack level simulation [17]. This method works well for programs that use only scalar variables.
    
G.  Value Analysis

    Value analysis computes for each processor register an interval of possible values as approximations to the values occurring during runtime. To do this, abstract versions of all processor instructions have to be modeled that are based on interval values as operands. This includes not only simple arithmetic operations like add or mul, but also complex addressing modes like register indirect with scaled index to approximate the effective addresses of memory references.
    Since registers and memory cells have a finite precision, the detection of (possible) overflows requires special attention to compute a correct approximation. For example, the add instruction is implemented as follows:

[$$\displaylines{ [l_1, u_1] + [l_2, u_2]\hfill\cr \hfill {:}= \cases{ [l_1 + l_2, u_1 + u_2], \hfill & {if\ no\ overflow\ is\ possible}\hfill\cr {\hbox{ unknown}},\hfill &{otherwise}. \hfill} }$$]

The [$\sqcup$] operator for merging two abstract register or memory cell values at control-flow joins is a simple union of intervals

[$$ [l_1, u_1] \sqcup [l_2, u_2] {:}= [ \min (l_1, l_2), \max (u_1, u_2) ]. $$]


    Sometimes the approximated values indicate that a branch condition always (or never) holds. Then, value analysis has detected an infeasible path. The information about infeasible paths is forwarded to the cache and pipeline analyses to improve analysis quality by reducing the number of combine operations at control-flow joins.

IV.  ANALYSIS OF A SIMPLE PIPELINE

    The foundations of pipeline analysis and a proposal for a pipeline analysis for the MicroSPARC architecture are described in [12], and a first implementation for the superscalar pipeline of the SuperSPARC I is reported in [18] and [19].
    Here, a short characterization of the SuperSPARC I architecture is described. It has a three times superscalar pipeline, and groups instructions with at most one memory instruction per group. It has separate first-level data and instruction caches with four instructions per cache line. Loads into the cache work in burst mode, i.e., two lines are loaded together. The cache replacement strategy is LRU.
    The SuperSPARC I performs static branch prediction with conditional branches predicted as taken and has a delay slot for one instruction. At a conditional branch, blocks of four consecutive instructions are prefetched in both directions, namely, instructions in the drop-through direction into a sequential prefetch queue (SPQ) with a capacity of eight instructions and instructions starting at the branch target into a target prefetch queue (TPQ) with a capacity of four instructions.
    
A.  Dependence of the Caches on the Pipeline

    Instruction prefetching across a conditional branch will “damage” the instruction cache, since prefetching in the direction of sequential control flow is performed, before the branch is identified, and in the direction of the target, when the branch has been decoded and before the condition is evaluated and folded. The damage consists in the replacement of potentially useful instructions by prefetched but currently useless instructions. The amount of damage depends on the number of prefetched instructions. The question is how to statically limit this damage so that a serialization of cache and pipeline analysis does not lose too much precision.
    We will now discuss this question in the context of the SuperSPARC processor. As described above, the SuperSPARC predicts conditional branches as taken and fetches instructions in both directions into its prefetch queues. The maximal damage done to the (concrete) instruction cache depends on whether the branch indeed is taken and whether the prefetched instructions are in the cache or not.

[$$\matrix{ \noalign{\hrule} \cr \ \scriptstyle{\hbox {Max.\ damage}}\hfill & \scriptstyle\smash{\vrule height 12pt depth 31pt}& \scriptstyle{\hbox {cache\ hit}}\hfill &\smash{\vrule height 12pt depth 31pt}& \scriptstyle{\hbox {cache\ miss}}\ \hfill \cr \noalign{{\vskip2pt}\hrule} \cr \ \scriptstyle{\hbox {branch\ taken}}\hfill & &\scriptstyle{\hbox {2\ lines}}\hfill & &\scriptstyle\scriptstyle{\hbox {3\ lines\ (burst\ mode)}}\ \hfill \cr \ \scriptstyle{\hbox {branch\ not\ taken}}\hfill & &{\hbox {1\ line}}\hfill & &\scriptstyle{\hbox {2\ lines\ (burst\ mode)}\ \hfill } \cr \noalign{{\vskip1pt}\hrule}}$$]


    The damage done to the abstract instruction cache should reflect the respective worst case damage, i.e., three lines on the branch-taken edge and two lines on the fall-through edge. It consists in the removal of this maximal number of lines from the cache without safe information about lines moving in. It is different for cache hits and cache misses. A cache hit for a prefetched line leads to an increase in the ages of younger lines in the same set. A cache miss leads to the aging of all lines in the set and removal of the oldest in case the set is full.
    An integrated cache and pipeline analysis may, in some cases, have enough information about the contents of the pipeline and the prefetch queue to determine sharper bounds on the damage. Hence, the difference in precision between a serial and an integrated implementation is bounded by the damage as described above.
    Superscalarity has no additional effect on the cache behavior because the instruction-cache effects are caused by instruction prefetching and not by the dynamic grouping of instructions.
    The target address is static for all but the JMPL and RETT instructions. These use register-indirect target addresses. The branch-target queue cannot be used for them, since the addresses are computed late in the pipeline.
    There are no dynamic effects on the data cache since all address computations happen in the integer pipeline, which has the effect of serializing memory accesses.
    
B.  Analysis Architecture

    Cache analysis is performed first. Its results are fed into the pipeline analysis. Predicted and potential cache misses are considered as causing pipeline stalls. However, compensation of cache miss penalties and pipeline delays due to long-running instructions are also possible. Again, abstract interpretation is used for pipeline analysis.
    Superscalar pipelines execute instructions not only in an overlapped fashion, but also concurrently. The concurrently executed instructions are dynamically selected by the processor. The instruction-grouping decisions taken by the superscalar processor depend on the availability of instructions, control flow changes, data dependences, and resource conflicts. The concrete pipeline semantics formalizes conditions for pipeline stalls and models the selection of instructions for concurrent execution.
    While the cache state usually contributes a considerable part to the size of the execution state of a program, the pipeline state contributes relatively little. Therefore, sets of concrete pipeline states are taken as elements of the abstract domain for the pipeline analysis.
    The abstract pipeline update function reflects what happens when a new instruction enters the pipeline. It takes into account the current set of pipeline states, in particular the resource occupations, the state of some special resources, e.g., the prefetch queue, the grouping of instructions for concurrent execution, and the classification of memory references as cache hits or misses.
    At control flow merge points, the abstract pipeline states of the joining paths are combined by set union.
    The output of the analyzer is a mapping cycles of instruction/context pairs to pairs of integers representing clock cycles

[$$ {\hbox{ cycles}}{:}\ {\rm IC} \rightarrow {\rm I}\!{\rm N} \times {\rm I}\!{\rm N}. \eqno{\hbox{(2)}} $$]

The first element of a clock-cycle pair is the number of cycles needed by the instruction to enter the pipeline. The second element is either the number of cycles that are needed to flush the pipeline for exit instructions or zero for all other instructions.

V.  ANALYSIS OF MORE COMPLEX PIPELINES

    The analysis of relatively simple pipelines like the one of the SuperSPARC as presented in the previous section can be done by separate cache and pipeline behavior analyses resulting in simpler, modular designs of and reduced space and time consumption by the WCET tool. For complex pipelines, which use a combination of advanced features to enhance performance, this is no longer possible. The interaction between several architectural features, e.g., branch prediction, other types of speculative execution, superscalarity, out-of-order execution, and (unified) caches, leads to imprecise results if these features are treated in isolation, because the approximations to be made to stay on the correct (safe) side are so conservative that the obtained results are useless in practice.
    Example 2 (Instruction Prefetching and Cache Accesses to a Unified Instruction/Data Cache):    Instruction prefetching alters the cache contents, which again alters the timing of data accesses, if data elements in the cache are replaced by instruction prefetches. Since the amount of prefetching depends on the pipeline state in the presence of a sufficiently large prefetch queue, the amount of this interference cannot be determined precisely if cache analysis is performed before pipeline analysis.

    Example 3 (Branch Prediction):    If the CPU predicts branches early in the fetch stages and redirects fetching before instruction dispatch—as is the case in most modern processors—fetching can be redirected through several levels of branches, causing several areas of instruction memory to be accessed and its placement in the cache to be altered. Since the amount of change to the instruction cache caused by this depends on the state of the prefetch queue and, thus, on the pipeline state, it can only be crudely approximated by a separate cache analysis.
    To assess the effect of a separation of cache and pipeline analyses, one can look at the effects that the necessary approximations of the behavior of other processor components for the cache analysis have. For the ColdFire 5307, first the possible effects of branch prediction have to be estimated. The number of instructions along the predicted target of a branch can be bounded to be between one and 14 instructions, since prediction is done very early in the fetch pipeline and an instruction buffer of eight entries can be filled with prefetched instructions, plus up to four instructions in the instruction assembly stage (IED) and two instructions in the fetch stages. Since prefetching can be performed across multiple branches, an additional data-flow analysis on the control flow graph has to be performed to collect all possible instruction fetches at each branch. The information obtained in this way contains for each program point a sequence of guaranteed fetches and a sequence of possible fetches. The effect of the latter is especially disastrous for the cache analysis. Since it cannot be guaranteed that these instructions are fetched, they cannot be guaranteed to be in the cache. However, since they may be fetched, they may replace other cache lines from a cache set. The must analysis needs to combine both possibilities emptying all cache sets that possibly fetched lines map to.
    Performing a cache analysis with the results of these possibly fetched instructions as described above and comparing the result to those of a cache analysis where no branch prediction is taken into account gives an indication of the damage inflicted by the branch prediction approximation needed for a separate cache analysis. Fig. 5 shows the results based on the same measure as that in Fig. 4 for a set of avionics benchmarks. The programs in that benchmark have around 40 kB of executable code.

Fig. 5. Precision loss due to separate cache analysis.


    Note, however, that this does not include the effects of data accesses to the unified ColdFire cache. Since we do not know the ordering of instruction fetches and data fetches as they depend on the pipeline state in advance, data fetches may replace instruction fetches in the cache and vice versa. Again, we would not have any knowledge on the contents of the sets that data accesses go to. Fig. 5 shows for 12 tasks the number of lines known to be in the cache per instruction context. The PURE data is for a cache analysis without taking branch prediction into account. The APPROX data is for the approximation explained above. The factor between the two data varies from 1.3 to 1.484. It is in a comparable range as the precision loss between PLRU and LRU caches.
    The approximations, i.e., upper bounds on the damage that have to be made in these cases are normally very unrealistic in that they do not occur in practice. Nonetheless, they are needed to stay on the safe side. Other troublesome features include pessimism introduced by approximating branch mispredictions, effects of out-of-order execution, and bus contention between fetch and load/store units.
    In order to obtain sufficiently precise results, one must perform an integrated cache and pipeline analysis, which analyzes all relevant features in combination. Abstract cache states are therefore incorporated into the abstract pipeline states.

    
A.  Pipeline Modeling

    Any analysis of pipeline behavior should be based on an appropriate abstraction of a concrete pipeline model. This abstraction is, in our case, obtained in a sequence of steps. The concrete pipeline model describes the clockwise evolution of the pipeline state during instruction execution. Of course, this evolution depends on the state of memory. The pipeline can be modeled as a huge finite-state machine, making transitions between states on every clock cycle. The transitions between states are deterministic up to the dependence on the memory state. They are very complex encoding features like branch prediction.
    Every state is structured as a collection of components. A component can be a register or the reservation station of a processor functional unit, etc.
    Given the complexity of modern processors, it seems impossible and, in fact, unnecessary to explicitly give their full definition as finite-state machines, because only few components of the states are relevant for timing. Therefore, the concrete models are designed with the restrictions of an abstract model (to be used in the analysis) in mind. No component is precisely represented in the concrete model if it need not be represented in the abstract model. Hence, in a first step, instead of the full concrete pipeline model, a reduced concrete pipeline model is developed. In order to keep this reduced concrete model deterministic, it suffices to treat the eliminated components as “oracles” with deterministic but unknown values. One example for this is register contents: Since no knowledge of the values of individual registers will be available in the analysis,3 there is no need to model registers in the concrete model. Whenever the evolution of the concrete pipeline depends on the state of such an “opaque” component, the oracle is asked for the correct decision to take. In the evolution of the abstract pipeline, the oracle is replaced by nondeterminism: The analysis has to consider every possibility at such decision points.
    To obtain the components of pipeline states and the transitions between them is a complex task since the interactions between different components of a state can be subtle and complicated (not to speak of the conditions on the transition rules). Complexity is reduced by a structuring step. The components of a state are partitioned into (disjoint) units (cf. Fig. 6). Not surprisingly, modeling can be easy if these units correspond to processor entities, e.g., fetcher, dispatcher, ALU, etc. Conceptually, units are just containers for a number of components, where the components are closely related to one another, e.g., the instruction queue and the prediction tracking registers. Interactions between elements in different units are communicated by signals. Signals represent abstract events, like “fetch an instruction” or “flush the pipeline.” They depend only on the components in the sending units and their input signals. Signals usually influence the evolution of the receiving units. To model events that take effect only in the next cycle, some signals are delayed. This means that they are received in the cycle following the current one. Delayed signals are thus a combination of a logical event and cross cycle hardware latches. The other signals are received instantaneously in the same cycle. A transition is performed by first applying evolution rules to each unit, which depend only on the components of that unit and its input signals. These rules can change the state of components of the unit and send out signals to other units. When all units have been updated this way, one evolution cycle has been completed.

Fig. 6. Partitioning of a state.


    An evaluation order for units makes this evolution deterministic. Units receiving instantaneous signals are only updated after all corresponding sending units have been updated. A model is only feasible if there is no cycle consisting only of instantaneous signals. This type of modeling bears many similarities to the approaches of some HDLs such as Verilog and VHDL; there, components are encapsulated and communicate only via signals, too.
    Starting from this (structured reduced) concrete model of pipeline evolution, the collecting semantics gathers for each program point the set of pipeline states that may occur during execution of the corresponding instruction at that point. This collecting semantics is abstracted to an appropriate abstract domain, mapping each concrete state to one abstract state. This mapping of states is performed component-wise, i.e., a component of a concrete state is mapped to an abstract component in the abstract state. The opaque components of the pipeline states are mapped to “[$\top$],” i.e., there is no knowledge about them in the abstract world. The caches, which are components in the concrete state, are mapped to their abstractions described in Section III. The remaining components are left as in the concrete domain.

Fig. 7. Pipeline of the MCF 5307.


    The cycle-wise evolution of the abstract states is applied one-to-one to the states in the current set at a program point. As noted earlier, in cases where the evolution depends on an opaque value, several successor states are obtained for one state. This evolution is repeated as long as the corresponding instruction has not left the pipeline. The number of evolutions at one program point is used to compute an upper bound on the WCET for that instruction. This way, WCETs for basic blocks are derived. These are the input for the path analysis, which determines the WCET for the entire program.
    
B.  An Example: The ColdFire 5307

    We applied this technique to a popular processor in the embedded world, the Motorola ColdFire 5307. The ColdFire family is the successor of the well known M68k line of processors. It implements most of the M68k instructions, limiting instruction lengths to 2, 4, or 6 bytes to speed up instruction handling. Fig. 7 shows the schematics of the MCF 5307 pipeline. It consists of two separate pipelines—the instruction fetch pipeline (IFP), and the operand execution pipeline (OEP)—coupled through an eight entry instruction buffer. The IFP fetches instructions and performs branch prediction in its four consecutive stages, while the OEP takes completely decoded instructions from the instruction buffer and executes them in up to two iterations through its two stages. The memory access interface is pipelined, fetches are requested in the IC1 stage and instructions received in IC2, data accesses are issued in the AGEX stage, and read data returned in the DSOC stage.
    The MCF5307 has a hierarchy of memory busses attached to it. Directly connected to the instruction-fetch and data-access stages is the K-Bus, which runs at the same speed as the processor core. At this K-Bus, the unified instruction/data cache and a 4-kB internal SRAM are connected. Accesses that are uncached or miss in the cache are forwarded through several controllers to the external bus, which runs at one third the speed of the processor core.
    The ColdFire features a simple form of branch prediction in the IED stage of the fetch pipeline: branches going backward are predicted taken and fetching in the IAG/IC1/IC2 stages is redirected to the target address of that branch.

Fig. 8. Model for the MCF 5307.



Fig. 9. Contents of the IED unit.


    Since the amount of prefetching depends on the state of the OEP and the number of instructions in the instruction buffer and can go through several levels of branches, the exact instruction addresses accessed cannot be determined precisely by a separate analysis. Also, since the cache is unified, instruction fetches and data accesses influence the cache contents and behavior in a way that depends on the contents of the pipeline.
    Using the documentation available from Motorola, the pipeline model shown in Fig. 8 has been constructed. The arrows represent signals; signals with a name in italics are delayed signals. In this model, signals are used that carry additional data arguments with them. Fig. 9 shows the contents of the IED unit in more detail.
    This unit assembles complete instructions from the long words fetched by the IC1/IC2 stages. It contains a small internal buffer, holding up to 8 bytes from the previous stage. Only the length of this buffer is represented in the model and named [${\bf B}$] in Fig. 9. The assembled instructions are inserted into the instruction buffer as soon as a place is free. A queue [${\bf Q}$] holds waiting instructions.
    The unit is also responsible for redirecting fetching after branch prediction. It does so by issuing the [$\hbox{\bf{\sf set}}(a)$] signal to the address generation unit for the new target address [$a$]. The evolution of this unit is described as a sequence of pseudo-code instructions (“if IB is empty then emit the [$\hbox{\sf instr}$] signal,”etc.). For other units, evolution can be coded in the style of transition rules of finite-state machines.
    The evolution sequence for the model goes bottom up through Fig. 8. This way, all instantaneous signals are generated before they are used.
    For example, a complete evolution cycle may happen in the following way, assuming that we received a pipeline flush in an earlier cycle after a branch misprediction and the EX unit is waiting for the instructions from the correct branch target.
  1. The SST unit (holding a write delay timer) is not active, so it does nothing.
  2. The EX unit is empty at the moment, so it requests the next instruction by issuing the [$\hbox{\bf{\sf next}}$] signal.
  3. The IB is also empty, so it does nothing (especially, it does not provide the next instruction to EX).
  4. The IED unit is in the state depicted in Fig. 9. It receives the next fetched longword via a [$\hbox{\bf{\sf put(0x}} \hbox{\bf{\sf 404)}}$] signal and emits an assembled instruction (at address 0x400) via a delayed [$\hbox{\bf{\sf instr}}$] signal. After that, the state of the unit is [$\hbox{\tt Q}{:}[\hbox{\tt 0x} \hbox{\tt 406]}\hbox{B}{:}\hbox{\tt 2}$].
  5. The IC2 unit did not receive a [$\hbox{\bf{\sf code(0x}} \hbox{\bf{\sf 408)}}$] signal (for which it is waiting) and, thus, cannot accept anything from IC1; thus, it emits the [$\hbox{\bf{\sf wait}}$] signal.
  6. The IC1 unit receives this [$\hbox{\bf{\sf wait}}$] signal and emits a [$\hbox{\sf wait}$] signal to IAG.
  7. The Bus Unit waits for the long word at address 0x408 and decrements its cycle wait counter. The counter reaches zero and, thus, the unit emits a delayed [$\hbox{\bf{\sf code(0x}} \hbox{\bf{\sf 408)}}$] signal to make the data available for the next cycle.
  8. The IAG receives the [$\hbox{\bf{\sf wait}}$] signal and does nothing.

    
C.  A Complex Example: The PowerPC 755

    A more demanding example is the PowerPC 755. This processor is a 32-bit implementation of the PowerPC architecture and features superscalarity, speculative execution, branch prediction, and out-of-order execution. It has separate 32-kB instruction and data caches, two integer units, a pipelined floating point unit, a pipelined load/store unit, and a system register unit (cf. Fig. 10).

Fig. 10. Pipeline of the PPC 755.


    Dispatching and retirement are done in-order to conform to the PowerPC architecture specification. Execution, however, is done out of order, and the PPC 755 uses a completion queue (CQ) to keep track of up to six instructions executing in parallel. Instruction fetching is tightly coupled with branch prediction in the FU and the BPU. Branches are resolved as soon as they are fetched and entered into the instruction queue (IQ). Known branches are folded away at this stage; unknown branches are predicted and fetching continues at the predicted target of the branch. Such speculatively fetched instructions can be dispatched to the execution units. On misprediction, these instructions must be flushed from the execution units and the CQ. A second level of branch speculation may be fetched into the IQ.
    Modeling these features is a demanding task. Fig. 11 shows our model for this processor (delayed signals go upward in this figure). The model is able to keep track of the state of the internal units, the IQ/CQ, and the attached caches and bus activities. Branch prediction is modeled in the evolution rules of the FBPU unit, which integrates the real FU and BPU units. Here, information is kept about the level of speculation, the branches that introduced the speculation and the instructions that resolve it. The contents of the IQ are modeled in detail, as well as side conditions on instruction dispatch, e.g., that some instructions must not be dispatched before their predecessors are retired to guarantee single-threaded access to some system registers. The integer units IU1/IU2 contain the addresses of the instructions currently executing or being in the reservation stage of that unit. The completion unit (CU), on the other hand, does not have an inner state. Its evolution rules make use of the contents of the completion queue and ([$\hbox{\bf{\sf done}}$]) signals sent out by the execution units upon completion of an instruction.

Fig. 11. Model for the PPC 755.


    By structuring the complex pipeline states into units communicating via signals only, it was possible to model this complicated processor in a reasonable amount of time.

VI.  OBSERVATIONS

    In the course of the DAEDALUS project, we have implemented the models for the ColdFire 5307 and the PowerPC 755 and performed extensive analyses on several benchmarks in cooperation with Airbus Industries. The analyzers were also installed at Airbus Toulouse and evaluated by them on a large benchmark of avionics software, 12 tasks of altogether 1.2 million instructions. Reference [20] reports on the results of this evaluation. The WCET predictions obtained by our tool were between 6.2% and 13.5% lower than those obtained by Airbus with their legacy method.
    Our experience taught us that the need to perform an integrated pipeline analysis increases the complexity of the analysis task in two ways.

  • The design of the analysis is more complicated compared to the design of several separate analyses.
  • The running time and space consumption of the analysis is higher than for several sequential simpler separate analyses.

    Nonetheless, the integrated approach is the only one that bears a chance of giving useful results: e.g., when one separates the cache analysis from the pipeline analysis for the ColdFire 5307, one has to make the assumption that up to eight instructions after every branch are prefetched, touching one or two cache lines. In conjunction with the fact that the ColdFire cache is modeled as a direct mapped cache, this throws away any information for those cache lines. In addition, for every possible clash of data accesses with such prefetched lines, nothing is known about the data access at all. For larger programs this effectively means that no information about the cache can be obtained by a separate analysis.
    Our tests on real-life benchmarks have shown that the subtle interactions between processor features are really noticeable in the analysis results and that they can be accurately predicted.
    The main cause of the time and space complexity of the analyses is the need to analyze all possible successor evolutions if an evolution depends on “opaque” components, i.e., in the case of nondeterminism. In theory, one could discard all successors except for the one that represents the worst case, i.e., leads to (an upper bound for) the WCET. In practice, we discovered that the decision, which of the successors represents this worst case is not easy. Successors that represent a local worst case (e.g., a cache miss versus a cache hit) may not lead to the WCET globally. This is due to the subtle influences and interdependencies of processor components.

VII.  ARCHITECTURAL ADVICE: PREDICTABLE PERFORMANCE

    The preceding sections have indicated that it is not a particularly good idea to optimize for average performance if one aims at processors with high predictable performance. The loss in precision can be magnified if at the same time a few gates are saved in the cache architecture at the wrong places. Of course, the claims we can make are not absolute claims of the kind, “there is no method by which sufficient precision about the timing behavior of programs can be obtained if a pseudo-round-robin cache replacement strategy is used.” Our claims are dependent on the use of static program analysis and our ways of modeling processor components. However, we do not see alternative methods that are both efficient enough to be used in practice and deliver sufficient precision.
    In the following, we list a number of processor properties whose combination will allow high precision in statements about the timing behavior and a modular design of the timing analyzer.

  • Separate data and instruction caches: separate caches eliminate the interdependencies of instruction prefetching and data accesses. This way, the precision loss of separate cache and pipeline analyses can be reduced. In an integrated analysis, worst case assumptions can be made more easily since these dependencies need not be considered.
  • Cache replacement strategies: these should be immune against “chaos.” This means that when cache contents are not known at one point, subsequent accesses can recover knowledge about the new cache contents. The ColdFire cache with its global replacement counter does not allow to recover knowledge about the counter if one has no information on its value at some point. LRU replacement strategies recover from “chaos”: after some cache updates, the ages of the new elements in the cache are known.
    Naturally, the update strategy should be (locally) deterministic; otherwise, little can be statically said about cache contents.
    The cache architecture should allow both must and may analyses for the caches. Neither the ColdFire nor the PowerPC 755 cache make this possible, although this information about what is guaranteed not to be in the cache, is valuable in restricting the nondeterminism in the pipeline analysis.
  • Branch prediction if any should be static: the modeling of dynamic branch prediction would lead to an even more complex integrated analysis. A static, separate, and precise analysis of dynamic branch prediction is difficult since it also depends on the pipeline state.
  • Out-of-order execution should be limited: with out-of-order execution, one has to consider the effects of all possible interleavings of instructions. Clearly, this is difficult and imprecise to do statically in a separate analysis since there are many possible interleavings, whereas a worst case interleaving is not likely to occur during execution but must be assumed to ensure a correct result. In an integrated approach, all interleavings are considered, but most of them will not be worst cases. The required granularity of the pipeline model for this both increases design complexity and analysis complexity.
  • Shortcuts should be avoided: in general, shortcuts in the hardware design, e.g., special cases to accelerate some operation, if certain (dynamic) conditions hold, should not be used. While they definitely improve average performance, they have little gain in running typical real-time tasks; nonetheless, they must be modeled in quite some detail in the pipeline analysis or give raise to increased nondeterminism.

VIII.  FUTURE WORK AND OPEN PROBLEMS

    There are three major areas for our future research on pipeline analysis.

  • Modeling processors with other features or new combinations of features. When comparing the analysis results on more processors, the effects of individual features on analysis precision and/or complexity can be evaluated more precisely.
  • At the moment, the implementation of our models is done by handwritten code. To reduce the realization phase of a pipeline analysis, we are developing a framework to generate these implementations from concise specifications of the models.
  • The models themselves need to be specified more formally. With a formal model, analyses on the model itself are possible. With the help of such analyses, the nondeterminism in the pipeline analyses can be reduced by limiting the number of successor evolutions that have to be considered for the worst case by identifying one or a few successors that may lead to the global worst case.

IX.  CONCLUSION

    Modern processors are optimized for average case performance. The features that contribute to this average case performance, like caches, branch prediction, speculation, or out-of-order execution, make it difficult to determine the worst case performance. Modular, separate analyses of the behavior of programs on these features cannot be done with sufficient precision due to the interdependencies between different processor components.
    We presented a methodology to analyze cache and pipeline behavior by abstract interpretation and pipeline modeling. We implemented this methodology for two advanced processors—the ColdFire 5307 and the PowerPC 755. As a consequence from lessons learned in this process and the complexity of the resulting analyses, we proposed a few guidelines for the design of processors to be used in hard real-time systems.


ACKNOWLEDGMENT

    Many colleagues have collaborated in the design and implementation of the described WCET tools. Thanks go to C. Ferdinand, F. Martin, M. Schmidt, J. Schneider, M. Sicks, and H. Theiling.

REFERENCES

Reinhold Heckmann received the Dr.rer.nat. degree in computer science from Saarland University, Saarbruecken, Germany, in 1991.
    After working as a Lecturing Assistant at the University of the Saarland and a Research Fellow at Imperial College, London, U.K., he is now a Senior Researcher at AbsInt Angewandte Informatik GmbH, Saarbruecken, Germany. His major research areas include programming languages and compiler construction, document processing, semantics of programming languages and domain theory, exact real arithmetic, and static analysis of real-time systems, in particular cache and pipeline analysis.
Marc Langenbach received the Dipl. degree in computer science from Saarland University, Saarbruecken, Germany, in 1997. He is currently working toward the Ph.D. degree in the Compiler Research Group at the Computer Science Department, Saarland University.
    His research interests include static worst-case time prediction for modern hardware, embedded systems, and compiler construction.
Stephan Thesing received the Dipl. degree in computer science from the University of Bielefeld, Bielefeld, Germany, in 1996. He is currently working toward the Ph.D. degree at the Compiler Research Group at the Computer Science Department, Saarland University, Saarbruecken, Germany. His research interests include static worst-case time prediction for modern hardware, embedded systems, and program analysis.
Reinhard Wilhelm received the Dr.rer.nat. degree from the Technical University of Munich, Munich, Germany, in 1977.
    He has been a Professor of Computer Science at the University of the Saarland, Saarbruecken, Germany, since 1978, and Scientific Director of the International Conference and Research Center for Computer Science, Schloss Dagstuhl, Germany, since 1990. His major research areas include compiler construction and compiler generation, in particular attribute grammars, tree pattern matching, tree parsing, code selection, instruction scheduling, static analysis, parallel languages, and their implementation, run-time guarantees for real-time programs. He is coauthor of several textbooks on languages and compilation and on document processing.

  1Things would work equally well the other way around due to the duality principle of lattice theory.
  2If they do not depend on unknown input values in a nontrivial way.
  3Besides the contents of address registers obtained by value analysis, cf. Section III-G.