PROCEEDINGS OF THE IEEE, VOL. 91, NO. 7, JULY 2003The Influence of Processor
Architecture on the Design and the Results of WCET Tools
REINHOLD HECKMANN, MARC LANGENBACH, STEPHAN THESING, AND REINHARD WILHELMInvited Paper The architecture of tools for the determination of worst case
execution times (WCETs) as well as the precision of the results of WCET analyses
strongly depend on the architecture of the employed processor. The cache replacement
strategy influences the results of cache behavior prediction; out-of-order
execution and control speculation introduce interferences between processor
components, e.g., caches, pipelines, and branch prediction units. These interferences
forbid modular designs of WCET tools, which would execute the subtasks of
WCET analysis consecutively. Instead, complex integrated designs are needed,
resulting in high demand for memory space and analysis time. We have implemented
WCET tools for a series of increasingly complex processors: SuperSPARC, Motorola
ColdFire 5307, and Motorola PowerPC 755. In this paper, we describe the designs
of these tools, report our results and the lessons learned, and give some
advice as to the predictability of processor architectures.
Keywords—Predictability, processor model, real-time, static analysis, worst case execution time. Manuscript received August 14, 2002; revised December 17, 2002.
This work was supported by the European IST-project Daedalus.
R. Heckmann is with the AbsInt Angewandte Informatik GmbH, D-66123 Saarbruecken,
Germany (e-mail: heckmann@absint.com).
M. Langenbach, S. Thesing, and R. Wilhelm are with the Fachrichtung Informatik,
Saarland University, D-66123 Saarbruecken, Germany (e-mail: mlangen@cs.uni-sb.de;
thesing@cs.uni-sb.de; wilhelm@cs.uni-sb.de).
Digital Object Identifier:
10.1109/JPROC.2003.814618
0018-9219/03$17.00 © 2003 IEEE
I. INTRODUCTION
A. Using Abstract Interpretation for WCET Computation
1) Applying Abstract Interpretation
B. The Architecture of WCET Tools
II. RELATED WORK
III. CACHE ANALYSIS
A. Cache Memory: General Remarks
B.
-Way Set-Associative
Caches
C. LRU Caches
1) The LRU Strategy
2) Concrete Cache States
3) Updates of Concrete Cache States
4) Full Cache Analysis
5) Must Analysis
6) May Analysis
7) Must and May Analysis Together
D. ColdFire MCF 5307: Pseudo-Round-Robin Replacement
1) An Example
2) Problems
3) Must Analysis
E. PowerPC 750/755: Pseudo-LRU Replacement
1) Pseudo-LRU Replacement Strategy
2) Examples
3) Analyses
F. Data Caches
G. Value Analysis
IV. ANALYSIS OF A SIMPLE PIPELINE
A. Dependence of the Caches on the Pipeline
B. Analysis Architecture
V. ANALYSIS OF MORE COMPLEX PIPELINES
A. Pipeline Modeling
B. An Example: The ColdFire 5307
C. A Complex Example: The PowerPC 755
VI. OBSERVATIONS
VII. ARCHITECTURAL
ADVICE: PREDICTABLE PERFORMANCE
VIII. FUTURE WORK AND OPEN PROBLEMS
IX. CONCLUSION
ACKNOWLEDGMENTREFERENCESI. INTRODUCTION
Hard real-time systems are subject to stringent timing constraints that
are dictated by the surrounding physical environment. A schedulability analysis
has to be performed in order to guarantee that these timing constraints will
be met (“timing validation”). All existing techniques for schedulability
analysis require the knowledge of the worst case execution time (WCET) of
each task in the system. Since this is not computable in general, estimates
of the WCET have to be calculated. These estimates have to be safe, i.e.,
they must never underestimate the real execution time. Furthermore, they should
be tight, i.e., the overestimate should be as small as possible.
In modern microprocessor architectures, caches, pipelines, and control
speculation are key features for improving performance. Caches are used to
bridge the gap between processor speed and the access time of main memory.
Pipelines enable acceleration by overlapping the executions of different instructions.
Control speculation is used to avoid pipeline stalls caused by conditional
jumps. The consequence is that the execution behavior of instructions cannot
be analyzed in isolation since it depends on the execution history. Processor
architectures are optimized for average-case performance and not for predictable
performance, as would be required for hard real-time systems. This paper deals
with the consequences of processor architectures for the design and the effectiveness
of WCET tools.
A. Using Abstract Interpretation for WCET Computation
The determination of the WCET of a program is composed of several tasks: classification
of memory references as cache misses or hits, usually called cache analysis; analysis of the behavior of the program on the processor
pipeline, the so-called pipeline analysis; prediction
of the results of control speculation and the determination of the worst case
execution path of the program, in the following called path analysis. All of these tasks are quite complex for modern microprocessors
and digital signal processors (DSPs). They must be executed on the machine-code
level, since the semantics of high-level languages does not refer to architectural
components such as caches, pipelines, or branch prediction units. Since data
cache analysis and pipeline analysis depend on the knowledge of effective
addresses—in general only known at run time—another static analysis
is needed to try to determine effective addresses statically, in our case
called value analysis.
The
identification of these different phases of WCET determination allows to use
different methods tailored to the subtasks. In our case, value analysis, cache
analysis, pipeline analysis, and branch behavior prediction are done by abstract interpretation, a semantics-based method for static
program analysis. Path analysis is done by integer linear programming. Both
precision of the results and efficiency of the WCET computation are acceptable
in practice, but depend, as will be shown, on the processor architecture.
1) Applying Abstract Interpretation: Abstract interpretation [1], [2] is
a well-established method of static program analysis with a host of available
theoretical results. Static program analysis is well suited for the approximative
establishment of safety properties of programs, i.e., the proof that “something
bad does not happen.” It is approximative in
the sense that it may not establish all safety properties that actually hold.
However, it is sound in the sense that all safety
properties it claims to hold do actually hold. Which are the bad things for
WCET that we hope to exclude by static program analysis? Of course, cache
misses, pipeline stalls, and mispredicted branches. Each excluded cache miss
allows us to exclude the costs of a cache miss penalty from the WCET, each
excluded pipeline stall eliminates the costs for a pipeline bubble from the
WCET, and each excluded branch misprediction precludes expensive damage to
the instruction cache and costs for cleanup.
A static program analysis is considered an abstraction of a standard semantics
of the programming language. A standard (operational) semantics of a language
is given by a concrete domain of data and a set of
functions describing how the statements of the language transform concrete
data. An abstract semantics then consists of a (simpler)
abstract domain and a set of abstract semantic functions, so-called transfer
functions, for the program statements computing over the abstract domain.
The designer of a program analysis faces the following design tasks.
- Defining the domain: The abstract domain is obtained from the concrete
domain by abstracting from all aspects up to those, which are subject of the
analysis to be designed. An abstraction function
maps concrete domain elements to abstract domain elements.
Both domains usually are complete partially ordered sets of values. The
partial order on the abstract domain corresponds to precision, i.e., quality
of information. By agreement, elements higher up in the order are considered
to contain less information.1
The partial order determines the least upper bound
operation,
, on the lattice, which is
used to combine information stemming from different sources, e.g., from several
possible control flows into one program point.
- Defining the transfer functions: The transfer functions describe
how the statements transform abstract data. They must be monotonic to guarantee
termination.
Abstract interpretation has been profitably applied to cache analysis and
pipeline analysis. It is executed on the
control flow graph of the program, which can be constructed by analyzing the machine
program
[3].
B. The Architecture of WCET Tools Both the architecture of WCET tools and the precision of the results of
WCET analyses strongly depend on the architecture of the employed processor.
The cache replacement strategy influences the obtainable precision of cache
behavior prediction. Instruction prefetching, out-of-order execution, and
control speculation introduce interferences between processor components,
e.g., caches, pipelines, prefetch queues, and branch prediction units. As
we will see, these interferences forbid modular designs of WCET tools, which
would execute WCET analysis in a sequence of subtasks. Let us consider, what
it means to separate the analysis of component
![[$A$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032427.gif)
from the analysis of component
![[$B$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1033446.gif)
, where the
behavior of
![[$A$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032427.gif)
depends on that of
![[$B$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1033446.gif)
. In order to be on the safe side, the “damage”
done to the state of
![[$A$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032427.gif)
by activities of
![[$B$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1033446.gif)
has to be taken into account. Since nothing is known
to the
![[$A$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032427.gif)
analysis about the state of
![[$B$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1033446.gif)
, an upper bound on the potential damage has to be
determined and used. This upper bound can be far away from any real damage
if the interference between the processor components is highly dynamic and
if the worst case damage occurs seldom.
Thus, to bound the loss of precision complex processors need complex integrated
tool designs resulting in high demand for memory space and analysis time.
II. RELATED WORK
There exists a vast literature on WCET determination. We only list references
that treat complex processors with all features considered in combination,
not architectural features in isolation.
Healy et al. [4], [5] presented another approach
on predicting WCETs in the presence of caches and simple pipelines. In a first
step of the analysis a static cache simulator classifies instructions as cache
hits or misses. This information is used by a pipeline path analysis that
computes the execution time for a sequence of instructions. Loops are handled
in a bottom-up manner. Only the simple pipeline of a MicroSPARC is considered
and in [4] only direct-mapped
caches and simple pipelines are taken into account that can be described by
resource usage patterns of instructions. For their experimental results the
authors only consider a small direct-mapped cache with small test programs.
Li et al. suggest a solution using integer linear programming [6]. Both cache and pipeline
behavior prediction are formulated as a single linear program. The i960 kB
is investigated, a 32-bit microprocessor with 512-byte direct mapped instruction
cache and a fairly simple pipeline. Only structural hazards need to be modeled,
thus keeping the complexity of the integer linear program moderate. Variable
execution times, branch prediction, and instruction prefetching are not considered
at all. Obtaining the ILP modeling for a more complex processor will be difficult.
Using this approach for superscalar pipelines does not seem very promising
considering the analysis times reported in the article. Nonetheless, the description
of the worst case path through the program via ILP is an elegant method and
can be efficient if the size of the ILP is kept small. This is the case in
our tool.
Lundqvist and Stenström present an integrated approach for obtaining
WCET bounds through simulation of the pipeline in [7], [8]. They extend a pipeline simulator to handle unknown values in inputs.
We share conceptual similarities with this approach in that we perform a cycle-wise
evolution of a pipeline (model). In contrast to our approach, Lundqvist and
Stenström use an integrated method in which value analysis for register/memory
contents and execution time computation are parts of the same simulation.
If the simulation cannot determine a branch condition exactly due to dependencies
on unknown (input) values, both branches have to be simulated. This method
does not guarantee termination of the analysis, but offers the advantage of
sometimes determining loop bounds and/or recursion bounds “for free.”2 However, we feel that this analysis is very costly due to the
huge amount of data that has to be kept for each branch followed. In contrast,
our method does not retain information like register or memory contents in
the pipeline analysis phase, contents that have already been determined in
the value analysis to predict conditional and computed branches, for example.
In [8], experiments
with a PowerPC-like architecture are conducted for small example programs
using an extended PSIM simulator with simple reservation tables for instructions.
All in all, it is not clear how well this method scales up to programs of
realistic size.
In contrast to Lundqvist and Stenström's integrated approach, Engblom
presents a WCET tool with a clear separation of all the analysis modules in [9]. The modules communicate
using interface data structures. One main component is a simulator that estimates
the execution time for a given sequence of instructions. These timing estimates
are composed to form the execution time of the entire program. The quality
of the obtained WCET is greatly influenced by the quality of the simulator
used. Cache behavior prediction is not incorporated in the tool as the addressed
targets do not have any caches. This eliminates the problem of cache and pipeline
interaction, which becomes more difficult with more complex pipelines, prefetching,
and branch prediction. The author comes to the conclusion that “
out-of-order processors are definitely too
complex to model with current techniques.”
Colin and Puaut describe a framework for tree-based
WCET analysis in [10].
Instruction cache and pipeline behavior as well as branch prediction are taken
into account and are analyzed independent of one another, reducing the precision
of the obtained WCET estimate.
The analyses are based on two intermediate representations: the syntax
tree, and the control flow graph built from assembly output of the compiler.
As the program is not yet translated to object code, it is not clear which
machine instruction an assembly instruction is mapped to, and as the program
is not linked, information on instruction addresses are not available. The
syntax tree is used to compose the WCET from smaller parts. This is not appropriate
as it disregards the execution context leading to imprecise results.
III. CACHE ANALYSIS
A. Cache Memory: General Remarks
Caches are used to improve the access times of fast microprocessors to
relatively slow main memories. They are an upper part of the storage system
hierarchy and fit in between the register set and the main memory. Excluding
the register set, caches have the shortest access times of all levels of the
storage system. They can reduce the number of cycles a processor is waiting
for data by providing faster access to recently referenced regions of memory.
Caching is more or less used for all general purpose processors, and with
increasing application sizes it becomes more and more relevant and used for
high-performance microcontrollers and DSPs.
At any time, a cache memory duplicates a subset of main memory locations.
For the purpose of caching, the main memory is partitioned into memory blocks of size
bytes, numbered
consecutively starting with 0. Usually,
is a power of
. Then, byte addresses
can be easily translated into block numbers by omitting the lowest
bits. By an access to memory block
, we mean a read or write access to a memory location belonging
to block
.
When the processor wants to access a memory block, it first checks whether
the cache contains (a copy of) the block. If so (cache hit), the processor can quickly access the block in the cache. If not
(cache miss), the block is copied from main memory
into the cache, where it is stored for this reference and future ones. Clearly,
the handling of cache hits is much faster than that of cache misses since
the main memory is not involved.
A memory access can be the reading of an instruction (a prerequisite of
its execution) or the reading or writing of data during the execution of an
instruction. The processor may have one unified cache
that contains both instructions and data (e.g., ColdFire 5307), or two separated
caches, one for instructions (I-cache) and one for data (D-cache). PowerPC
750 and 755 processors contain separate caches, having the same size, structure,
and principal behavior.
B.
-Way Set-Associative
Caches
There are three commonly used cache architectures: direct-mapped caches,
fully associative caches, and
-way set-associative
caches (where
is a natural number).
An
-way set-associative cache consists
of
cache sets [11]. Each cache set consists
of
ways or lines, where the number
denotes the associativity of the cache. Each way
can hold the copy of a memory block consisting of
consecutive bytes. Hence, the total capacity of the cache is
memory blocks, or
bytes. Usually, the numbers
,
and
are powers of
;
, and
.
- The Motorola ColdFire MCF 5307 has
,
, and
. Hence, the total capacity of the cache is
byte
kB.
- Motorola PowerPC MPC 750 and 755 processors have caches with
,
,
and
. Hence, the total capacity
of the caches is
byte
kB each.
The other two cache architectures can be considered as degenerate special
cases of
![[$A$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032427.gif)
-way set-associativity: direct-mapped
caches correspond to the case
![[$A = 1$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152718.gif)
(each
set has only one line), and fully associative caches correspond to the case
![[$S = 1$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152719.gif)
(there is only one cache set).
Each memory block can only be stored in one specific cache set. The number
of this set consists of the lowest
![[$s$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1033584.gif)
bits
of the block number. Thus, neighboring blocks will be stored in different
cache sets.
A cache line may be either
valid, i.e., contain
a memory block, or
invalid, i.e., be currently free.
A valid line containing block
![[$m$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032789.gif)
not only contains
the bit pattern forming the contents of
![[$m$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032789.gif)
,
but also a
tag identifying
![[$m$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032789.gif)
. This tag is the block number of
![[$m$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032789.gif)
without the
![[$s$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1033584.gif)
bits used as set number.
In the PowerPC 750/755, addresses have 32 bits. The lowest 5 bits are chopped
off to obtain the block number of 27 bits. Of these 27 bits, the 7 lower bits
indicate the cache set where the block can be stored, and the 20 upper bits
form the tag. Thus,
![[$2^{20} \approx 10^6$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152720.gif)
memory
blocks are competing to be stored in each set.
When a memory block
![[$m$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032789.gif)
is accessed, its
number is partitioned into set number
![[$i$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032434.gif)
and
tag
![[$j$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1033702.gif)
. Then the tags of all valid lines in
set
![[$i$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032434.gif)
are compared with
![[$j$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1033702.gif)
. If there is a match,
![[$m$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032789.gif)
has
been found in the cache (cache hit). Otherwise,
![[$m$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032789.gif)
is copied into the cache. For this, a line
![[$l$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1033008.gif)
of set
![[$i$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032434.gif)
is determined where
![[$m$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032789.gif)
is placed. If
![[$l$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1033008.gif)
is invalid,
it is
allocated for
![[$m$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032789.gif)
.
If it is valid, the memory block residing there so far is
replaced by
![[$m$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032789.gif)
.
The algorithm used to determine
![[$l$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1033008.gif)
is the
replacement strategy of the cache. Common replacement strategies
are least recently used
(LRU), first in first out
(FIFO), and
random. SPARC processors
have LRU caches, but ColdFire MCF 5307 and PowerPC 750/755 have special replacement
strategies called
pseudo-round-robin (ColdFire) and
pseudo-LRU (PowerPC 750/755). In the following sections,
we sketch the modeling of LRU, pseudo-round-robin, and pseudo-LRU caches.
More complete descriptions of LRU cache analysis can be found in
[12][13][14]. The important observation
will be that a true LRU replacement strategy in contrast to all kinds of “pseudo”
strategies, e.g., pseudo-LRU or pseudo-round-robin, offers the chance for
very precise results from a cache analysis.
| Table 1 Example
for Age Updates (LRU)
| |
| Table 2 Age
Update Function (LRU)
| |
C. LRU Caches In an LRU cache, each cache set has its own replacement logic. Therefore,
the cache sets are independent from each other, and it suffices to describe
the behavior of a single set. When speaking of “the cache” in
the sequel, we actually mean this single set.
1) The LRU Strategy: When a new memory
block is copied into the cache and there are invalid lines, the block is written
into the first such line. If all lines are valid, the LRU replacement strategy
causes replacement of the memory block that has been least recently used.
This can be modeled by assigning ages to the blocks
in the cache. For an
-way set-associative
cache, the set of ages is
. The most recently used block has age 0, and the least recently
used block has the maximal age
.
In case of a cache miss, the accessed block is put into the cache with
age 0, all blocks in the cache age by 1, and the block with age
(if any) is removed from the cache. When a block
is accessed that is currently in the cache with age
, its age is reset to 0, all blocks younger than
age by 1, while blocks older than
are not affected.
Table 1 presents a sample access
sequence for a four-way set-associative cache, starting from an empty cache.
2) Concrete Cache States: The assignment
of lines to memory blocks is irrelevant for the question which blocks are
in the cache at present and in the future. One only needs to know what blocks
are in the cache, and what their age is. This information is given by a function
where
is the set of memory blocks and
is the set of ages plus an additional element
. For a block
means
is in the cache with
age
, while
means
is not in the cache.
A concrete cache state is such a function, restricted
by the property that no two different memory blocks have the same age
.
3) Updates of Concrete Cache States: When a memory block
is accessed, the current
concrete cache state
is updated into a new
concrete cache state
defined by using an age update function
. The age update function for four-way set-associative caches is
shown in Table 2.
If the accessed block is not in the cache, all other blocks age by one,
and the one with age 3 (if any) is removed (last line). Otherwise, all blocks
younger than the accessed block age by one, and all older blocks keep their
age. Note that
![[${\rm up}_{\cal A}(a)(a)$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152732.gif)
is
undefined for
![[$a \neq \top$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152733.gif)
. The reason is that
these values are not needed because different memory blocks have different
![[${\rm ages} \neq \top$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152734.gif)
in concrete cache states.
4) Full Cache Analysis: Full cache analysis
tries to compute for each program point (and calling context) the set of all
concrete cache states possible at that point. Since this set is not computable
in general, the analysis can only produce a safe approximation of the exact
set, which in this case means a superset. This approximation should be as
precise as possible, i.e., the superset should be close to the exact set.
| Table 3 Age
Update for Must Analysis (LRU)
| |
In practice, full cache analysis is intractable since the memory consumption
of the analyzer would be prohibitive. Thus, two less ambitious analyses were
developed by Ferdinand
[12],
[13]:
must analysis (
Section III-C5) telling which memory blocks are certainly (must be) in the cache,
and
may analysis (
Section III-C6) telling which memory blocks may be in the cache.
| Table 4 Age
Update for May Analysis (LRU)
| |
5) Must Analysis: The basic idea of
must analysis is to approximate the set
of concrete cache states possible at a program point
by one abstract cache state
that provides upper bounds for the ages of memory
blocks in all states contained in
. To formalize
the idea of an upper bound, the set
of ages is ordered by
. Hence,
implies
for all
in
, i.e., all states
in
agree that block
is in the cache. Thus, one may say that
must be in the cache at program point
, no matter what the concrete cache state at
is.
Abstract ages are upper bounds of concrete ages: an abstract age
stands for concrete ages
. Hence, the update function
for abstract ages is derived from the function
for concrete ages by where undefined values
are neglected, and
is set to
. The resulting function for
is shown in Table 3.
Like concrete cache states, abstract cache states are functions from
![[${\cal M}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1034071.gif)
to
![[${\cal A}'$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152747.gif)
,
but they may map different memory blocks to the same age. An abstract state
![[$C^{u}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152735.gif)
approximates a concrete state
![[$c$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1034981.gif)
if
![[$C^{u}(m) \geq c (m)$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152748.gif)
for all memory blocks
![[$m$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032789.gif)
. The update function
![[${\rm up}^{u}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152749.gif)
for abstract cache states of must analysis
has the same form as the one for concrete cache states
(1), but uses the abstract age update function
![[${\rm up}_{\cal A}^{u}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152741.gif)
instead of
![[${\rm up}_{\cal A}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152742.gif)
.
It is correct in the sense that if
![[$C^{u}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152735.gif)
approximates
a concrete state
![[$c$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1034981.gif)
, then
![[${\rm up}^{u} (m_0) (C^{u})$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152750.gif)
approximates
![[${\rm up}(m_0) (c)$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152751.gif)
.
6) May Analysis: May analysis is dual
to must analysis. It approximates the set
of concrete cache states possible at a program point
by one abstract state
that provides lower bounds for the ages of memory
blocks in all states contained in
. The order
on
is the same as in must analysis:
. Hence,
implies
for all
in
, i.e., all states in
agree that block
is not in the cache.
An abstract age
now stands for concrete
ages ![[$a, \ldots, \,$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152755.gif)
. Hence, the update function
for abstract ages of may analysis
is derived from the concrete function
by where undefined values
are neglected. The resulting function for
is shown in Table
4.
Again, abstract cache states are arbitrary functions from
![[${\cal M}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1034071.gif)
to
![[${\cal A}'$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152747.gif)
.
| Table 5 Example
for Combined Must and May Analysis (LRU)
| |
Now, an abstract state
![[$C^{\ell}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152752.gif)
approximates a concrete state
![[$c$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1034981.gif)
if
![[$C^{\ell} (m) \leq c (m)$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152759.gif)
for all memory
blocks
![[$m$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032789.gif)
. The update function
![[${\rm up}^{\ell}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152760.gif)
for abstract cache states, which results
from
(1) by replacing
![[${\rm up}_{\cal A}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152742.gif)
by
![[${\rm up}_{\cal A}^{\ell}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152757.gif)
, is correct in the same sense as the update function of must
analysis.
7) Must and May Analysis Together: Must and may analysis performed together yield lower and upper bounds, i.e.,
intervals. Thus, the combined analysis has abstract states
, where
is the set of age intervals. Table 5 shows the evolution of an interval cache state under a sequence
of accesses.
The example starts with the “unknown” abstract cache, which
maps all memory blocks to the interval
![[$[0,\top]$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152763.gif)
that provides no information. This is the appropriate state at the entry of
a task analyzed separately where the analyzer has no information about the
previously executed code. The example shows that this lack of knowledge only
matters at the beginning: the first three accesses cannot be classified as
hits or misses, but cause the intervals to shrink. From access 5 in this example,
all intervals are singletons, i.e., the cache analyzer has exact knowledge
about the cache contents. This exact knowledge may be destroyed by control-flow
joins where the incoming intervals have to replaced by their join, i.e., the
least interval containing all of them. Another source of uncertainty are accesses
whose target address is not exactly known (see
Section III-F). Yet, these uncertainties disappear again while
straight-line code with exactly known target addresses for accesses is analyzed,
as it happened in the example of
Table 5.
So LRU caches admit a quite precise analysis leading to complete knowledge
of the cache contents in some cases.
D. ColdFire MCF 5307: Pseudo-Round-Robin Replacement The ColdFire cache has a size of 8 kB. It is four-way set-associative with
128 cache sets of four lines each. Each line may store a memory block of 16
bytes. As in all set-associative caches, each memory block
![[$m$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032789.gif)
can only be put into one cache set, whose number is derived
from the address of
![[$m$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032789.gif)
.
The ColdFire MCF 5307 employs a so-called
pseudo-round-robin replacement strategy. The state of the replacement
logic is given by a 2-bit counter. The counter is neither used nor modified
in case of a cache hit or if a block is put into a set with empty lines; in
the latter case, the block is put in the first such line. If a block is put
into a full cache set, the 2-bit counter indicates which of the four lines
is replaced. After the replacement, the counter is increased by one (modulo
4).
There is only one counter for the whole cache.
Hence, a replacement in one cache set influences all other sets.
1) An Example: Assume a program accesses
the memory blocks
and block
is put into cache set
mod 128. Such a scenario corresponds to a linear program without data access
to memory (all data are in registers or in some noncachable memory area).
Assume further the program starts with an empty cache. Then the blocks 0–127
are put into the first line of each set, the blocks 128–255 into the
second line, 256–383 into the third line, and 384–511 into the
fourth line. The resulting cache state is depicted in Table 6, where the columns represent the cache sets. The next memory
block 512 is put into set 0. The counter has not been used so far, and still
has value 0; hence, block 512 is put into line 0 and replaces block 0. The
counter is set to 1, and so, block 513 is put into line 1 of set 1, replacing
block 129. Continuing like this until block 639, the resulting cache state
is as shown in
Table 7, where the recently
added blocks are printed in boldface. Block 640 then replaces 512, 641 replaces
513, etc.
| Table 6 Example:
ColdFire Cache (After Block 511)
| |
| Table 7 Example:
ColdFire Cache (After Block 639)
| |
From this example, one may learn that some blocks (like 1 and 128) may
stay in the cache forever although they are never referenced again, while
other blocks (like 512 and 513) are removed from the cache when their cache
set is referenced for the next time. Although these remarks in their full
strength only hold in this regular example, they show that in general, an
analysis must take into account that some blocks may survive many cache updates,
while others are thrown out immediately.
2) Problems: The computation of (an
approximation of) the set of all concrete cache states possible at a program
point is intractable. An abstraction of this set into one abstract cache state
should contain a model of the counter. The counter stays the same or increases
by one; in presence of uncertainties caused by control-flow joins or an initial
unknown cache state, the analyzer cannot know what happens to the counter
if an access cannot be classified as hit or miss. After three such uncertainties
all counter information is lost and can never be recovered again.
Absolute counter values can be avoided by assigning ages to the lines:
The line the counter points to has age 3, the next line age 2 etc. Yet, ages
stay the same or increase by one; and sometimes one does not know what happens
to them. Thus, there is the same problem as above: after three uncertainties,
all age information is lost.
May analysis tries to determine which blocks may be in the cache (or equivalently,
which blocks are certainly not in the cache at a given program point). Without
counter or age information, one can never be sure that a block is removed
from the cache. Thus, may sets get larger and larger. When starting from an
unknown cache state, the initial may set already contains all memory blocks,
and this never changes. Therefore, may analysis for the ColdFire cache is
completely useless.
3) Must Analysis: In must analysis,
we want to compute the set
of memory blocks
that are definitely in the cache (for each program point). Initially, the
set
is empty—no matter whether we
start out with an empty cache or a cache with an unknown state since in the
latter case, we do not know of any memory block that it is definitely in the
cache. When a memory block
is accessed,
it will be certainly in the cache afterwards so that it can be added to
. If it has not yet been in the cache before, then
another block may be thrown out of the cache. Without counter or age information,
we do not know which one. Hence, whenever a new element
is added to
, all elements
of
that are in the cache set where
is put must be removed from
. Therefore,
can contain at most
one memory block for each cache set. This property is also preserved at control-flow
joins where all incoming sets are replaced by their intersection.
This kind of must analysis is simple and efficient, but not very precise:
for each cache set, it determines at most one memory block that is definitely
in the cache, although concretely, a cache set can hold up to four blocks.
Thus, one may say that the analysis models only
of the cache, but we do not know of any better analysis.
E. PowerPC 750/755: Pseudo-LRU Replacement PowerPC 750/755 processors have two separate caches for instructions and
data. Each cache has a size of 32 kB and is eight-way set-associative with
128 cache sets of eight lines each. Each line may store a memory block of
32 bytes. The replacement logics of the two caches are of the same kind.
Each cache set has its own instance of the replacement logic. Therefore,
the cache sets are independent from each other, and it suffices to describe
the behavior of a single set. When speaking of “the cache” in
the sequel, we actually mean a single set of one of the two caches.
1) Pseudo-LRU Replacement Strategy: Older PowerPC models have four-way set-associative caches with LRU replacement.
After the upgrade to eight-way set-associative caches, LRU was replaced by
a so-called pseudo-LRU (PLRU) strategy to save hardware costs [15].
In the following description of PLRU, the eight lines of the cache will
be called
![[${\rm L0}, {\rm L1},\ldots, {\rm L7}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152765.gif)
.
The PLRU replacement logic for such an eight-line cache has an inner state
given by the values of 7 bits
![[${\rm B0}, {\rm B1},\ldots, {\rm B6}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152766.gif)
. When memory block
![[$m_0 \in {\cal M}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152767.gif)
is accessed, the following happens.
- Determine what to do, and the involved line:
- If
is already in the cache (hit),
let
be the line where it is.
- If
is not in the cache (miss):- — If there is an invalid line, let
be the first such line, and put
there (allocate).
- — If all lines are valid, let
be the line the replacement bits point to. This line is calculated
from the settings of the replacement bits
as specified in Fig. 1.
Put
into line
replacing its previous contents.
- Update the replacement bits so that they point away from the involved
line
. The update is specified in Table 8. The bits not mentioned in the table
are not changed.
| Fig. 1. Determination
of replacement line (PLRU).
| |
| Table 8 PLRU
Bit Update Rules
| |
| Fig. 2. Effect
of repeated misses on the PLRU cache.
| |
The rule for updating the replacement bits negates the 3-bit values that
lead to the replacement of
![[$l_0$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152768.gif)
. For instance,
L5 is selected if
![[${\rm B0} = 1, {\rm B2} = 0$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152770.gif)
,
and
![[${\rm B5} = 1$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152771.gif)
, and for
![[$l_0 = {\rm L5}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152772.gif)
, the bit updates are
![[${\rm B0} {:}= 0, {\rm B2} {:}= 1$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152773.gif)
, and
![[${\rm B5} {:}= 0$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152774.gif)
.
2) Examples: In the following examples,
the values of
are
written in the format
,
where the gaps group the decision levels (cf. Fig. 1).
Example 1: First, assume the bit setting is
, and all lines are invalid. (This
is the situation after cache invalidation.) The first access is a miss since
all lines are invalid. The accessed memory block is placed into the first
invalid line,
. Accidentally,
this is the line the bit setting
points to. At the end, the bit setting is updated into
. Assume the second access is again
a miss. Then the accessed memory block is placed into the first invalid line,
which is now
. Note that this
time, the line the actual bit setting
points to is different, namely L4. At the end, the bit setting
is updated into
, which
also points to L4.
| Fig. 3. Example
involving hits (PLRU).
| |
In all the other examples, we assume that
all lines are valid. Hence, a miss causes the block in the line the actual
bit setting points to be replaced. This line is indicated after each bit setting,
following the arrow.
Fig. 2 shows the behavior of the
cache if only misses occur, for some arbitrarily chosen initial setting of
the replacement bits. All eight lines are replaced (in some strange order),
and after eight misses, the original bit setting is recovered. These observations
are true for all 256 possible initial bit settings.
The cache is less well behaved if hits may occur.
Fig. 3 shows the effect of alternating
between accessing the block in L0 and accessing a block not in the cache (miss).
Note that the state in the last line of the example is the same as the state
in the second line. Hence, the states will cycle through the ones listed in
the example forever unless the regular access pattern changes. The blocks
in L4–L7 are replaced by new blocks, while the blocks in L0–L3
stay in the cache forever. For L0, this is natural since the block in L0 is
continuously accessed. Yet, the blocks in L1–L3 also survive although
they are never accessed. Such a behavior could not happen with a proper LRU
strategy.
Although these remarks in their full strength only hold in this regular
example, they show that in general, one must take into account that some blocks
may survive many cache updates although they are never accessed, while others
are thrown out quickly.
3) Analyses: Like LRU, the PLRU strategy admits
the introduction of ages. Yet, the age update function is not as regular as
the one of the LRU strategy, which hampers both must and may analysis.
In fact, may analysis does not yield any information at all: starting from
an unknown cache, it never determines any memory block that certainly is removed
from the cache. The example of Fig. 3
shows that indeed some blocks may reside in the cache forever although they
are never accessed.
Must analysis does yield some information, but not as much as in LRU caches:
it finds at most four memory blocks in every cache set (of eight blocks possible
in practice). This analysis is more complicated than the ColdFire analysis,
but models
![[$1/2$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1111663.gif)
of the cache.
To assess how much information a cache analysis of the PLRU cache strategy
loses compared to the results of a cache analysis of an LRU cache of the same
size, one can compare the results of the must analysis. An ad-hoc precision
parameter is the number of cache lines guaranteed to be in the cache at every
program point. Comparing the sum of this parameter divided by the number of
program points gives a measure for a given program under analysis. The higher
this value, the better the analysis is. Since we do not know the number of
cache lines to be guaranteed in the concrete execution of the program, this
constitutes only a relative precision comparison.
Fig. 4 gives the resulting precision value for a larger benchmark
(84 kB) of PowerPC code for the instruction cache must analysis. This benchmark
contains code pieces typical for avionics software (filters, CRC computation,
etc.). The different lines in the figure correspond to different context mappings
of the underlying data-flow analysis. Context mappings are used to distinguish
different execution histories (e.g., loop iterations or call sequences) of
a program in the data-flow analysis.
| Fig. 4. PLRU
versus LRU cache analysis.
| |
As the results show, the precision depends on the one hand on the precision
of the data-flow analysis itself, i.e., the mapping used. The
cs0 mapping uses the callstring(0) approach. This approach does not,
e.g., distinguish different loop iterations in the analysis, so it is not
very exact. Therefore, the results for LRU and PLRU are nearly the same. The
vivu mapping distinguishes call histories and the first
iteration of loops from the remaining iterations. The other
![[$\hbox{\tt vivu}\hbox{\tt (n)}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152782.gif)
mappings distinguish in addition
up to
![[$n$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032530.gif)
loop iterations. As can be seen, the
gap between the results for an LRU and a PLRU cache increase for the more
precise analyses, up to a factor of 1.609. Taking into account that the benchmark
size is just around 2.5 of the whole cache size, this is already a significant
loss of precision for the WCET prediction.
F. Data Caches In the description above, we always assumed that the address of a memory
access is exactly known. While this is true for instruction access and data
access with absolute addressing, the
addresses of indirectly accessed data are in general unknown at compile time.
Assume the concrete cache state is
![[$c$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1034981.gif)
when
an access happens that may refer to block
![[$m_1$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1035419.gif)
or block
![[$m_2$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1052957.gif)
. Then the resulting cache state
is
![[${\rm up}(m_1) (c)$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152783.gif)
or
![[${\rm up}(m_2) (c)$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152784.gif)
. This is the same situation as at a control-flow join, where
two different concrete cache states may arrive. So with abstract cache states,
accesses to unknown addresses may be handled by the same merge operation as
at control-flow joins.
The resulting loss of precision is more problematic for unified caches
(e.g., ColdFire MCF 5307) than for separate data and instruction caches (e.g.,
PowerPC 750/755). In the latter case, abstract instruction cache states are
not ruined by indirect data accesses.
To limit the loss of information due to indirect addressing, it is of tantamount
importance to safely reduce the set of possible target addresses. Two methods
can be used: the exploitation of knowledge about memory allocation by the
compiler, and a
value analysis attempting to determine
effective addresses at compile time.
In
[16], methods
are described to statically determine the addresses of memory references to
procedure parameters or local variables by a static stack level simulation
[17]. This method works well
for programs that use only scalar variables.
G. Value Analysis Value analysis computes for each processor register an interval of possible
values as approximations to the values occurring during runtime. To do this,
abstract versions of all processor instructions have to be modeled that are
based on interval values as operands. This includes not only simple arithmetic
operations like
add or
mul,
but also complex addressing modes like
register indirect
with scaled index to approximate the effective addresses of memory
references.
Since registers and memory cells have a finite precision, the detection
of (possible) overflows requires special attention to compute a
correct approximation. For example, the
add
instruction is implemented as follows:
The
![[$\sqcup$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152702.gif)
operator for merging
two abstract register or memory cell values at control-flow joins is a simple
union of intervals
Sometimes the approximated values indicate that a branch condition always
(or never) holds. Then, value analysis has detected an infeasible path. The
information about infeasible paths is forwarded to the cache and pipeline
analyses to improve analysis quality by reducing the number of combine operations
at control-flow joins.
IV. ANALYSIS OF A SIMPLE PIPELINE
The foundations of pipeline analysis and a proposal for a pipeline analysis
for the MicroSPARC architecture are described in [12], and a first implementation for the superscalar
pipeline of the SuperSPARC I is reported in [18] and [19].
Here, a short characterization of the SuperSPARC I architecture is described.
It has a three times superscalar pipeline, and groups instructions with at
most one memory instruction per group. It has separate first-level data and
instruction caches with four instructions per cache line. Loads into the cache
work in burst mode, i.e., two lines are loaded together. The cache replacement
strategy is LRU.
The SuperSPARC I performs static branch prediction with conditional branches
predicted as taken and has a delay slot for one instruction. At a conditional
branch, blocks of four consecutive instructions are prefetched in both directions,
namely, instructions in the drop-through direction into a sequential prefetch queue (SPQ) with a capacity of eight instructions
and instructions starting at the branch target into a target
prefetch queue (TPQ) with a capacity of four instructions.
A. Dependence of the Caches on the Pipeline
Instruction prefetching across a conditional branch will “damage”
the instruction cache, since prefetching in the direction of sequential control
flow is performed, before the branch is identified, and in the direction of
the target, when the branch has been decoded and before the condition is evaluated
and folded. The damage consists in the replacement of potentially useful instructions
by prefetched but currently useless instructions. The amount of damage depends
on the number of prefetched instructions. The question is how to statically
limit this damage so that a serialization of cache and pipeline analysis does
not lose too much precision.
We will now discuss this question in the context of the SuperSPARC
processor. As described above, the SuperSPARC predicts
conditional branches as taken and fetches instructions in both directions
into its prefetch queues. The maximal damage done to the (concrete) instruction
cache depends on whether the branch indeed is taken and whether the prefetched
instructions are in the cache or not.
![[$$\matrix{ \noalign{\hrule} \cr \ \scriptstyle{\hbox {Max.\ damage}}\hfill & \scriptstyle\smash{\vrule height 12pt depth 31pt}& \scriptstyle{\hbox {cache\ hit}}\hfill &\smash{\vrule height 12pt depth 31pt}& \scriptstyle{\hbox {cache\ miss}}\ \hfill \cr \noalign{{\vskip2pt}\hrule} \cr \ \scriptstyle{\hbox {branch\ taken}}\hfill & &\scriptstyle{\hbox {2\ lines}}\hfill & &\scriptstyle\scriptstyle{\hbox {3\ lines\ (burst\ mode)}}\ \hfill \cr \ \scriptstyle{\hbox {branch\ not\ taken}}\hfill & &{\hbox {1\ line}}\hfill & &\scriptstyle{\hbox {2\ lines\ (burst\ mode)}\ \hfill } \cr \noalign{{\vskip1pt}\hrule}}$$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152787.gif)
The damage done to the abstract instruction cache should reflect the respective
worst case damage, i.e., three lines on the branch-taken edge and two lines
on the fall-through edge. It consists in the removal of this maximal number
of lines from the cache without safe information about lines moving in. It
is different for cache hits and cache misses. A cache hit for a prefetched
line leads to an increase in the ages of younger lines in the same set. A
cache miss leads to the aging of all lines in the set and removal of the oldest
in case the set is full.
An integrated cache and pipeline analysis may, in some cases, have enough
information about the contents of the pipeline and the prefetch queue to determine
sharper bounds on the damage. Hence, the difference in precision between a
serial and an integrated implementation is bounded by the damage as described
above.
Superscalarity has no additional effect on the cache behavior because the
instruction-cache effects are caused by instruction prefetching and not by
the dynamic grouping of instructions.
The target address is static for all but the
JMPL
and
RETT instructions. These use register-indirect
target addresses. The branch-target queue cannot be used for them, since the
addresses are computed late in the pipeline.
There are no dynamic effects on the
data cache
since all address computations happen in the integer pipeline, which has the
effect of serializing memory accesses.
B. Analysis Architecture Cache analysis is performed first. Its results are fed into the pipeline
analysis. Predicted and potential cache misses are considered as causing pipeline
stalls. However, compensation of cache miss penalties and pipeline delays
due to long-running instructions are also possible. Again, abstract interpretation
is used for pipeline analysis.
Superscalar pipelines execute instructions not only in an overlapped fashion,
but also concurrently. The concurrently executed instructions are dynamically
selected by the processor. The instruction-grouping decisions taken by the
superscalar processor depend on the availability of instructions, control
flow changes, data dependences, and resource conflicts. The concrete pipeline
semantics formalizes conditions for pipeline stalls and models the selection
of instructions for concurrent execution.
While the cache state usually contributes a considerable part to the size
of the execution state of a program, the pipeline state contributes relatively
little. Therefore, sets of concrete pipeline states are taken as elements
of the abstract domain for the pipeline analysis.
The abstract pipeline update function reflects what happens when a new
instruction enters the pipeline. It takes into account the current set of
pipeline states, in particular the resource occupations, the state of some
special resources, e.g., the prefetch queue, the grouping of instructions
for concurrent execution, and the classification of memory references as cache
hits or misses.
At control flow merge points, the abstract pipeline states of the joining
paths are combined by set union.
The output of the analyzer is a mapping
cycles
of instruction/context pairs to pairs of integers representing clock cycles
The first element of a clock-cycle pair is the number of cycles
needed by the instruction to enter the pipeline. The second element is either
the number of cycles that are needed to flush the pipeline for
exit instructions or zero for all other instructions.
V. ANALYSIS OF MORE COMPLEX PIPELINES
The analysis of relatively simple pipelines like the one of the SuperSPARC
as presented in the previous section can be done by separate cache and pipeline
behavior analyses resulting in simpler, modular designs of and reduced space
and time consumption by the WCET tool. For complex pipelines, which use a
combination of advanced features to enhance performance, this is no longer
possible. The interaction between several architectural features, e.g., branch
prediction, other types of speculative execution, superscalarity, out-of-order
execution, and (unified) caches, leads to imprecise results if these features
are treated in isolation, because the approximations to be made to stay on
the correct (safe) side are so conservative that the obtained results are
useless in practice.
Example 2 (Instruction Prefetching and Cache
Accesses to a Unified Instruction/Data Cache): Instruction prefetching
alters the cache contents, which again alters the timing of data accesses,
if data elements in the cache are replaced by instruction prefetches. Since
the amount of prefetching depends on the pipeline state in the presence of
a sufficiently large prefetch queue, the amount of this interference cannot
be determined precisely if cache analysis is performed before pipeline analysis.
Example 3 (Branch Prediction): If the CPU predicts branches early in the fetch stages and redirects fetching
before instruction dispatch—as is the case in most modern processors—fetching
can be redirected through several levels of branches, causing several areas
of instruction memory to be accessed and its placement in the cache to be
altered. Since the amount of change to the instruction cache caused by this
depends on the state of the prefetch queue and, thus, on the pipeline state,
it can only be crudely approximated by a separate cache analysis.
To assess the effect of a separation of cache and pipeline analyses, one
can look at the effects that the necessary approximations of the behavior
of other processor components for the cache analysis have. For the ColdFire
5307, first the possible effects of branch prediction have to be estimated.
The number of instructions along the predicted target of a branch can be bounded
to be between one and 14 instructions, since prediction is done very early
in the fetch pipeline and an instruction buffer of eight entries can be filled
with prefetched instructions, plus up to four instructions in the instruction
assembly stage (IED) and two instructions in the fetch stages. Since prefetching
can be performed across multiple branches, an additional data-flow analysis
on the control flow graph has to be performed to collect all possible instruction
fetches at each branch. The information obtained in this way contains for
each program point a sequence of guaranteed fetches and a sequence of possible
fetches. The effect of the latter is especially disastrous for the cache analysis.
Since it cannot be guaranteed that these instructions are fetched, they cannot
be guaranteed to be in the cache. However, since they may be fetched, they
may replace other cache lines from a cache set. The must analysis needs to
combine both possibilities emptying all cache sets that possibly fetched lines
map to.
Performing a cache analysis with the results of these possibly fetched
instructions as described above and comparing the result to those of a cache
analysis where no branch prediction is taken into account gives an indication
of the damage inflicted by the branch prediction approximation needed for
a separate cache analysis. Fig. 5
shows the results based on the same measure as that in Fig. 4 for a set of avionics benchmarks. The programs in that
benchmark have around 40 kB of executable code.
| Fig. 5. Precision
loss due to separate cache analysis.
| |
Note, however, that this does not include the effects of data accesses
to the unified ColdFire cache. Since we do not know the ordering of instruction
fetches and data fetches as they depend on the pipeline state in advance,
data fetches may replace instruction fetches in the cache and vice versa.
Again, we would not have any knowledge on the contents of the sets that data
accesses go to.
Fig. 5 shows for 12
tasks the number of lines known to be in the cache per instruction context.
The
PURE data is for a cache analysis without taking
branch prediction into account. The
APPROX data is
for the approximation explained above. The factor between the two data varies
from 1.3 to 1.484. It is in a comparable range as the precision loss between
PLRU and LRU caches.
The approximations, i.e., upper bounds on the damage that have to be made
in these cases are normally very unrealistic in that they do not occur in
practice. Nonetheless, they are needed to stay on the safe side. Other troublesome
features include pessimism introduced by approximating branch mispredictions,
effects of out-of-order execution, and bus contention between fetch and load/store
units.
In order to obtain sufficiently precise results, one must perform an integrated
cache and pipeline analysis, which analyzes
all relevant features in combination. Abstract cache states are therefore
incorporated into the abstract pipeline states.
A. Pipeline Modeling Any analysis of pipeline behavior should be based on an appropriate abstraction
of a concrete pipeline model. This abstraction is, in our case, obtained in
a sequence of steps. The
concrete pipeline model
describes the clockwise evolution of the pipeline state during instruction
execution. Of course, this evolution depends on the state of memory. The pipeline
can be modeled as a huge finite-state machine, making transitions between
states on every clock cycle. The transitions between states are deterministic
up to the dependence on the memory state. They are very complex encoding features
like branch prediction.
Every state is structured as a collection
of
components. A component can be a register or the
reservation station of a processor functional unit, etc.
Given the complexity
of modern processors, it seems impossible and, in fact, unnecessary to explicitly
give their full definition as finite-state machines, because only few components
of the states are relevant for timing. Therefore, the concrete models are
designed with the restrictions of an abstract model (to be used in the analysis)
in mind. No component is precisely represented in the concrete model if it
need not be represented in the abstract model. Hence, in a first step, instead
of the full concrete pipeline model, a
reduced concrete
pipeline model is developed. In order to keep this reduced concrete
model deterministic, it suffices to treat the eliminated components as “oracles”
with deterministic but unknown values. One example for this is register contents:
Since no knowledge of the values of individual registers will be available
in the analysis,
3 there is no need to model registers in the concrete model. Whenever
the evolution of the concrete pipeline depends on the state of such an “opaque”
component, the oracle is asked for the correct decision to take. In the evolution
of the abstract pipeline, the oracle is replaced by nondeterminism: The analysis
has to consider every possibility at such decision points.
To obtain
the components of pipeline states and the transitions between them is a complex
task since the interactions between different components of a state can be
subtle and complicated (not to speak of the conditions on the transition rules).
Complexity is reduced by a
structuring step. The
components of a state are partitioned into (disjoint)
units (cf.
Fig. 6). Not surprisingly,
modeling can be easy if these units correspond to processor entities, e.g.,
fetcher, dispatcher, ALU, etc. Conceptually, units are just containers for
a number of components, where the components are closely related to one another,
e.g., the instruction queue and the prediction tracking registers. Interactions
between elements in different units are communicated by
signals. Signals represent abstract events, like “fetch an instruction”
or “flush the pipeline.” They depend
only
on the components in the sending units and their input signals. Signals usually
influence the evolution of the receiving units. To model events that take
effect only in the
next cycle, some signals are
delayed. This means that they are received in the cycle
following the current one. Delayed signals are thus a combination of a logical
event and cross cycle hardware latches. The other signals are received
instantaneously in the same cycle. A transition is performed
by first applying evolution rules to each unit, which depend only on the components
of that unit and its input signals. These rules can change the state of components
of the unit and send out signals to other units. When all units have been
updated this way, one evolution cycle has been completed.
| Fig. 6. Partitioning
of a state.
| |
An evaluation order for units makes this evolution deterministic. Units receiving
instantaneous signals are only updated after all corresponding sending units
have been updated. A model is only feasible if there is no cycle consisting
only of instantaneous signals. This type of modeling bears many similarities
to the approaches of some HDLs such as Verilog and VHDL; there, components
are encapsulated and communicate only via signals, too.
Starting from this (structured reduced) concrete model of pipeline evolution,
the
collecting semantics gathers for each program
point the set of pipeline states that may occur during execution of the corresponding
instruction at that point. This collecting semantics is abstracted to an appropriate
abstract domain, mapping each concrete state to one abstract state. This mapping
of states is performed component-wise, i.e., a component of a concrete state
is mapped to an abstract component in the abstract state. The opaque components
of the pipeline states are mapped to “
![[$\top$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152725.gif)
,”
i.e., there is no knowledge about them in the abstract world. The caches,
which are components in the concrete state, are mapped to their abstractions
described in
Section III. The remaining
components are left as in the concrete domain.
| Fig. 7. Pipeline
of the MCF 5307.
| |
The cycle-wise evolution of the abstract states is applied
one-to-one to the states in the current set at a program point. As noted earlier,
in cases where the evolution depends on an opaque value, several successor
states are obtained for one state. This evolution is repeated as long as the
corresponding instruction has not left the pipeline. The number of evolutions
at one program point is used to compute an upper bound on the WCET for that
instruction. This way, WCETs for basic blocks are derived. These are the input
for the path analysis, which determines the WCET for the entire program.
B. An Example: The ColdFire 5307 We applied this technique to a popular processor in the embedded world, the
Motorola ColdFire 5307. The ColdFire family is the successor of the well known
M68k line of processors. It implements most of the M68k instructions, limiting
instruction lengths to 2, 4, or 6 bytes to speed up instruction handling.
Fig. 7 shows the schematics of the MCF 5307
pipeline. It consists of two separate pipelines—the instruction fetch
pipeline (IFP), and the operand execution pipeline (OEP)—coupled through
an eight entry instruction buffer. The IFP fetches instructions and performs
branch prediction in its four consecutive stages, while the OEP takes completely
decoded instructions from the instruction buffer and executes them in up to
two iterations through its two stages. The memory access interface is pipelined,
fetches are requested in the IC1 stage and instructions received in IC2, data
accesses are issued in the AGEX stage, and read data returned in the DSOC
stage.
The MCF5307 has a hierarchy of memory busses
attached to it. Directly connected to the instruction-fetch and data-access
stages is the K-Bus, which runs at the same speed as the processor core. At
this K-Bus, the unified instruction/data cache and a 4-kB internal SRAM are
connected. Accesses that are uncached or miss in the cache are forwarded through
several controllers to the external bus, which runs at one third the speed
of the processor core.
The ColdFire features a simple form of branch prediction in the IED stage
of the fetch pipeline: branches going backward are predicted taken and fetching
in the IAG/IC1/IC2 stages is redirected to the target address of that branch.
| Fig. 8. Model
for the MCF 5307.
| |
| Fig. 9. Contents
of the IED unit.
| |
Since the amount of prefetching depends on the state of the OEP and the
number of instructions in the instruction buffer and can go through several
levels of branches, the exact instruction addresses accessed cannot be determined
precisely by a separate analysis. Also, since the cache is unified, instruction
fetches and data accesses influence the cache contents and behavior in a way
that depends on the contents of the pipeline.
Using the documentation available from Motorola, the pipeline model shown
in
Fig. 8 has been constructed. The
arrows represent signals; signals with a name in italics are delayed signals.
In this model, signals are used that carry additional data arguments with
them.
Fig. 9 shows the contents of
the IED unit in more detail.
This unit assembles complete instructions from the long words fetched by
the IC1/IC2 stages. It contains a small internal buffer, holding up to 8 bytes
from the previous stage. Only the length of this buffer is represented in
the model and named
![[${\bf B}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1106518.gif)
in
Fig. 9. The assembled instructions are inserted into
the instruction buffer as soon as a place is free. A queue
![[${\bf Q}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1142037.gif)
holds waiting instructions.
The unit is also responsible for redirecting fetching after branch prediction.
It does so by issuing the
![[$\hbox{\bf{\sf set}}(a)$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152789.gif)
signal to the address generation unit for the new target address
![[$a$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1032470.gif)
. The evolution of this unit is described as a sequence
of pseudo-code instructions (“if IB is empty then emit the
![[$\hbox{\sf instr}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152790.gif)
signal,”etc.). For other units,
evolution can be coded in the style of transition rules of finite-state machines.
The evolution sequence for the model goes bottom up through
Fig. 8. This way, all instantaneous signals are generated
before they are used.
For example, a complete evolution cycle may happen in the following way,
assuming that we received a pipeline flush in an earlier cycle after a branch
misprediction and the EX unit is waiting for the instructions from the correct
branch target.
- The SST unit (holding a write delay timer) is not active, so it does
nothing.
- The EX unit is empty at the moment, so it requests the next instruction
by issuing the
signal.
- The IB is also empty, so it does nothing (especially, it does not
provide the next instruction to EX).
- The IED unit is in the state depicted in Fig. 9. It receives the next fetched longword via a
signal
and emits an assembled instruction (at address 0x400) via a delayed
signal. After that, the state
of the unit is
.
- The IC2 unit did not receive a
signal (for which it is waiting) and, thus, cannot accept
anything from IC1; thus, it emits the
signal.
- The IC1 unit receives this
signal and emits a
signal to IAG.
- The Bus Unit waits for the long word at address 0x408 and decrements
its cycle wait counter. The counter reaches zero and, thus, the unit emits
a delayed
signal to make the data available for the next cycle.
- The IAG receives the
signal and does nothing.
C. A Complex Example: The PowerPC 755 A more demanding example is the PowerPC 755. This processor is a 32-bit
implementation of the PowerPC architecture and features superscalarity, speculative
execution, branch prediction, and out-of-order execution. It has separate
32-kB instruction and data caches, two integer units, a pipelined floating
point unit, a pipelined load/store unit, and a system register unit (cf.
Fig. 10).
| Fig. 10. Pipeline
of the PPC 755.
| |
Dispatching and retirement are done in-order to conform to the PowerPC
architecture specification. Execution, however, is done out of order, and
the PPC 755 uses a completion queue (CQ) to keep track of up to six instructions
executing in parallel. Instruction fetching is tightly coupled with branch
prediction in the FU and the BPU. Branches are resolved as soon as they are
fetched and entered into the instruction queue (IQ). Known branches are folded
away at this stage; unknown branches are predicted and fetching continues
at the predicted target of the branch. Such speculatively fetched instructions
can be dispatched to the execution units. On misprediction, these instructions
must be flushed from the execution units and the CQ. A second level of branch
speculation may be fetched into the IQ.
Modeling these features is a demanding task.
Fig. 11 shows our model for this processor (delayed signals go
upward in this figure). The model is able to keep track of the state of the
internal units, the IQ/CQ, and the attached caches and bus activities. Branch
prediction is modeled in the evolution rules of the FBPU unit, which integrates
the real FU and BPU units. Here, information is kept about the level of speculation,
the branches that introduced the speculation and the instructions that resolve
it. The contents of the IQ are modeled in detail, as well as side conditions
on instruction dispatch, e.g., that some instructions must not be dispatched
before their predecessors are retired to guarantee single-threaded access
to some system registers. The integer units IU1/IU2 contain the addresses
of the instructions currently executing or being in the reservation stage
of that unit. The completion unit (CU), on the other hand, does not have an
inner state. Its evolution rules make use of the contents of the completion
queue and (
![[$\hbox{\bf{\sf done}}$]](http://mathfigs.ieeexplore.ieee.org/iel5/5/27343/1215685/1152798.gif)
) signals sent
out by the execution units upon completion of an instruction.
| Fig. 11. Model
for the PPC 755.
| |
By structuring the complex pipeline states into units communicating via
signals only, it was possible to model this complicated processor in a reasonable
amount of time.
VI. OBSERVATIONS
In the course of the DAEDALUS project, we have implemented
the models for the ColdFire 5307 and the PowerPC 755 and performed extensive
analyses on several benchmarks in cooperation with Airbus Industries. The
analyzers were also installed at Airbus Toulouse and evaluated by them on
a large benchmark of avionics software, 12 tasks of altogether 1.2 million
instructions. Reference [20] reports on the results of this evaluation. The WCET predictions
obtained by our tool were between 6.2% and 13.5% lower than those obtained
by Airbus with their legacy method.
Our experience taught us that the
need to perform an integrated pipeline analysis increases the complexity of
the analysis task in two ways.
- The design of the analysis is more complicated compared to the design
of several separate analyses.
- The running time and space consumption of the analysis is higher than for
several sequential simpler separate analyses.
Nonetheless, the integrated approach is the only one that bears a chance
of giving useful results: e.g., when one separates the cache analysis from
the pipeline analysis for the ColdFire 5307, one has to make the assumption
that up to eight instructions after every branch are prefetched, touching
one or two cache lines. In conjunction with the fact that the ColdFire cache
is modeled as a direct mapped cache, this throws away any information for
those cache lines. In addition, for every possible clash of data accesses
with such prefetched lines, nothing is known about the data access at all.
For larger programs this effectively means that no information about the cache
can be obtained by a separate analysis.
Our tests on real-life benchmarks have shown that the subtle interactions
between processor features are really noticeable in the analysis results and
that they can be accurately predicted.
The main cause of the time and space complexity of the analyses is the need
to analyze all possible successor evolutions if an evolution depends on “opaque”
components, i.e., in the case of nondeterminism. In theory, one could discard
all successors except for the one that represents the worst case, i.e., leads
to (an upper bound for) the WCET. In practice, we discovered that the decision,
which of the successors represents this worst case is not easy. Successors
that represent a
local worst case (e.g., a cache
miss versus a cache hit) may not lead to the WCET
globally. This is due to the subtle influences and interdependencies of processor
components.
VII. ARCHITECTURAL
ADVICE: PREDICTABLE PERFORMANCE
The preceding sections have indicated that it is not a particularly good
idea to optimize for average performance if one aims
at processors with high predictable performance.
The loss in precision can be magnified if at the same time a few gates are
saved in the cache architecture at the wrong places. Of course, the claims
we can make are not absolute claims of the kind, “there is no method
by which sufficient precision about the timing behavior of programs can be
obtained if a pseudo-round-robin cache replacement strategy is used.”
Our claims are dependent on the use of static program analysis and our ways
of modeling processor components. However, we do not see alternative methods
that are both efficient enough to be used in practice and deliver sufficient
precision.
In the following, we list a number of processor properties whose combination
will allow high precision in statements about the timing behavior and a modular
design of the timing analyzer. - Separate data and instruction caches: separate caches eliminate the
interdependencies of instruction prefetching and data accesses. This way,
the precision loss of separate cache and pipeline analyses can be reduced.
In an integrated analysis, worst case assumptions can be made more easily
since these dependencies need not be considered.
- Cache replacement strategies: these should be immune against “chaos.”
This means that when cache contents are not known at one point, subsequent
accesses can recover knowledge about the new cache contents. The ColdFire
cache with its global replacement counter does not allow to recover knowledge
about the counter if one has no information on its value at some point. LRU
replacement strategies recover from “chaos”: after some cache
updates, the ages of the new elements in the cache are known.
Naturally, the update strategy should be (locally) deterministic; otherwise,
little can be statically said about cache contents.
The cache architecture should allow both must and may analyses for the
caches. Neither the ColdFire nor the PowerPC 755 cache make this possible,
although this information about what is guaranteed not
to be in the cache, is valuable in restricting the nondeterminism in the pipeline
analysis.
- Branch prediction if any should be static: the modeling of dynamic
branch prediction would lead to an even more complex integrated analysis.
A static, separate, and precise analysis of dynamic branch prediction is difficult
since it also depends on the pipeline state.
- Out-of-order execution should be limited: with out-of-order execution,
one has to consider the effects of all possible interleavings of instructions.
Clearly, this is difficult and imprecise to do statically in a separate analysis
since there are many possible interleavings, whereas a worst case interleaving
is not likely to occur during execution but must be assumed to ensure a correct
result. In an integrated approach, all interleavings are considered, but most
of them will not be worst cases. The required granularity of the pipeline
model for this both increases design complexity and analysis complexity.
- Shortcuts should be avoided: in general, shortcuts in the hardware
design, e.g., special cases to accelerate some operation, if certain (dynamic)
conditions hold, should not be used. While they definitely improve average
performance, they have little gain in running typical real-time tasks; nonetheless,
they must be modeled in quite some detail in the pipeline analysis or give
raise to increased nondeterminism.
VIII. FUTURE WORK AND OPEN PROBLEMS
There are three major areas for our future research on pipeline analysis. - Modeling processors with other features or new combinations of features. When
comparing the analysis results on more processors, the effects of individual
features on analysis precision and/or complexity can be evaluated more precisely.
- At the moment, the implementation of our models is done by handwritten
code. To reduce the realization phase of a pipeline analysis, we are developing
a framework to generate these implementations from concise specifications
of the models.
- The models themselves need to be specified more formally. With a
formal model, analyses on the model itself are possible. With the help of
such analyses, the nondeterminism in the pipeline analyses can be reduced
by limiting the number of successor evolutions that have to be considered
for the worst case by identifying one or a few successors that may lead to
the global worst case.
IX. CONCLUSION
Modern processors are optimized for average case performance. The features
that contribute to this average case performance, like caches, branch prediction,
speculation, or out-of-order execution, make it difficult to determine the
worst case performance. Modular, separate analyses of the behavior of programs
on these features cannot be done with sufficient precision due to the interdependencies
between different processor components.
We presented a methodology to analyze cache and pipeline behavior by abstract
interpretation and pipeline modeling. We implemented this methodology for
two advanced processors—the ColdFire 5307 and the PowerPC 755. As a consequence
from lessons learned in this process and the complexity of the resulting analyses,
we proposed a few guidelines for the design of processors to be used in hard
real-time systems.
ACKNOWLEDGMENT
Many colleagues have collaborated in the design and implementation
of the described WCET tools. Thanks go to C. Ferdinand,
F. Martin, M. Schmidt, J. Schneider, M. Sicks, and H. Theiling.
REFERENCES
Reinhold Heckmann received the Dr.rer.nat. degree
in computer science from Saarland University, Saarbruecken, Germany, in 1991. After working as a Lecturing Assistant at the University of the Saarland
and a Research Fellow at Imperial College, London, U.K., he is now a Senior
Researcher at AbsInt Angewandte Informatik GmbH, Saarbruecken, Germany. His
major research areas include programming languages and compiler construction,
document processing, semantics of programming languages and domain theory,
exact real arithmetic, and static analysis of real-time systems, in particular
cache and pipeline analysis. |
Marc Langenbach received the Dipl. degree in computer
science from Saarland University, Saarbruecken, Germany, in 1997. He is currently
working toward the Ph.D. degree in the Compiler Research Group at the Computer
Science Department, Saarland University. His research interests include static worst-case time prediction for modern
hardware, embedded systems, and compiler construction. |
Stephan Thesing received the Dipl. degree in computer
science from the University of Bielefeld, Bielefeld, Germany, in 1996. He
is currently working toward the Ph.D. degree at the Compiler Research Group
at the Computer Science Department, Saarland University, Saarbruecken, Germany.
His research interests include static worst-case time prediction for modern
hardware, embedded systems, and program analysis. |
Reinhard Wilhelm received the Dr.rer.nat. degree from the Technical University of Munich,
Munich, Germany, in 1977. He has been a Professor of Computer Science at the University of the Saarland,
Saarbruecken, Germany, since 1978, and Scientific Director of the International
Conference and Research Center for Computer Science, Schloss Dagstuhl, Germany,
since 1990. His major research areas include compiler construction and compiler
generation, in particular attribute grammars, tree pattern matching, tree
parsing, code selection, instruction scheduling, static analysis, parallel
languages, and their implementation, run-time guarantees for real-time programs.
He is coauthor of several textbooks on languages and compilation and on document
processing. |
1Things would work equally well the other way around due to the
duality principle of lattice theory.
2If they do not depend
on unknown input values in a nontrivial way.
3Besides the contents of address registers obtained by value analysis, cf.
Section III-G.