Analyzing Arm's MPAM From the Perspective of Time Predictability

With heterogeneous multi-core platforms being crucial to execute the highly demanding workloads of modern applications, memory-access predictability remains a key issue for the system's safety. Many solutions have been proposed over the years, but none has been applied on a large scale. Nowadays, we are in front of an unprecedented opportunity to have an impact on commercial platforms: the Memory System Resource Partitioning and Monitoring (MPAM) specification by Arm, which describes different memory-access regulation mechanisms, presenting a valuable industrial attempt to address this issue. However, several points of the specification are described at a high level only, leaving plenty of room for interpretation to hardware manufacturers. This paper takes a close look at the memory-access regulation mechanisms in the MPAM specification and provides some detailed instantiations of such mechanisms. A fine-grained memory contention analysis is presented for each of them to finally enable a comparison of their worst-case performance.


INTRODUCTION
M EMORY access predictability is a key challenge for modern heterogeneous platforms. Indeed, with the advent of always more demanding workloads such as those required in autonomous driving applications, the usage of powerful computing platforms equipped with multiple processing cores and hardware accelerators is becoming a de-facto requirement for many systems [1]. On the other hand, many of those applications are grandly critical, thus calling for a high degree of predictability.
The real-time systems research community is studying this problem since almost a decade, proposing many clever solutions to improve the memory access predictability of different types of memories and shared buses, both when accessed by CPU cores [2], [3], [4], I/O devices [5], [6], [7], and hardware accelerators [8], [9], [10]. These solutions include the usage of performance counters to keep track of the number of memory accesses [2], the development of memory-aware execution models [11], [12], and the design and implementation of custom components as predictable buses [7], [13] and memory controllers [14], [15], [16].
While these efforts greatly helped understand and mitigate many memory-related problems for predictability, they suffer from a substantial issue: the lack of general applicability and built-in support from hardware vendors. Indeed, providing predictable access to shared memories and buses has been largely under-considered by chip vendors, but they are the only ones that can ultimately solve the problem on a large scale by providing predictable hardware support.
Very recently, a unique opportunity has been offered from the chip vendors landscape: The Memory System Resource Partitioning and Monitoring (MPAM) [17] specification by Arm. After at most a decade in which this problem is well-known and continuously studied in the research community, Arm acted from the industry side by providing a specification with several guidelines to implement memory-access regulation mechanisms, thus giving an unprecedented opportunity to solve the problem on billions of Arm-based platforms, while providing the needed longterm support that is difficult to reach with custom solutions.
Interestingly, to date, the number of platforms implementing the MPAM specification is extremely limited, so there is hopefully still room to provide recommendations to chip manufacturers implementing MPAM to maximize execution predictability. Indeed, the MPAM specification provides quite a high-level description of the regulation mechanisms, leaving plenty of room for interpretations to whoever wants to implement it. Slightly different behaviors of such mechanisms may lead to drastically different worstcase timing performance, and therefore each detail needs to be carefully addressed at design time.
When "measuring" the predictability of a system component, a key metric is the accuracy that we can achieve in deriving analytical bounds. Indeed, while other kinds of performance indicators, such as those that can be collected by running the system in simulation or implementing a prototype, are useful and complementary to the analytical characterization of the worst-case latency that can be experienced in accessing a memory-system component, the possibility of deriving accurate latency bounds really makes a system more or less amenable to certification.
Contribution. In the light of these facts, this paper proposes an analysis-driven evaluation of a set of MPAM mechanisms. First, we take a close look at the MPAM specification, and in particular to those mechanisms aimed at achieving predictability at the level of the Double Data Rate Synchronous Dynamic Access Memory (DDR SDRAM) memory. While doing this, we highlight each step in which the specification is missing relevant information to unambiguously specify the behavior of a mechanism. We then propose a number of different lowerlevel specifications for the studied mechanisms, providing all the required elements to enable real-time analysis. This allows building a memory-contention analysis based on an optimization problem. Thanks to the modularity of the approach, mechanism-specific constraints are derived and independently studied while keeping the overall structure of the analysis framework unaltered. Finally, an analysis-driven comparison of the studied mechanisms is presented, reaching conclusions on their effectiveness in enabling time predictability.

ESSENTIAL BACKGROUND ON DDR DRAMS
A DRAM memory device consists of a set of memory chips organized into ranks. Each rank is further divided into multiple banks. A bank consists of a matrix of memory cells, and it is provided with a row buffer, which behaves similarly as a cache for the memory bank: multiple consecutive accesses to the same row result in smaller lateness. To access a specific memory cell, the content of its row in the matrix needs to be copied into the row buffer. Since the row buffer can store at most one row at a time, this operation may involve copying back the row currently stored in the row buffer. A memory access targeting a row contained in the row buffer is said to be a row-hit; otherwise, it is said to be a row-conflict.
When a row-hit occurs, the memory access (either a read or a write) can be performed by means of the CAS (Column Access Strobe) DRAM command.
In case of a row-conflict, three commands need to be issued in sequence. The first one is the PRE (PREcharge) command, which copies back the content currently stored rowbuffer in the corresponding DRAM row; the second one is the ACT (ACTivate) command, which copies the data from the target DRAM row to the row buffer; the last one is the CAS command, needed to actually perform the operation.
The access to the DRAM memory is orchestrated by a memory controller (MC), which collects the requests incoming from the processing elements and routes DRAM commands to the DRAM chips via the buses that connect them. The memory controller and the DRAM chips are interconnected by means of two buses, one for data and one for commands. Memory requests directed to different banks can be handled in parallel if no contention occurs in the two buses. The memory controller is responsible to implement the scheduling logic for memory requests. The timing constraints to be fulfilled between consecutive transmissions of commands and data on the DRAM buses are regulated by the JEDEC (Joint Electron Device Engineering Council) standard for DRAM memories, and they can be coarsely classified into intra-bank constraints, i.e., related to commands and data targeting the same bank, and inter-bank constraints, i.e., related to different banks. More details on the JEDEC timing constraints can be found in the appendix, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/ TC.2022.3202720.

SYSTEM MODEL
The system comprises a set P ¼ fp 1 ; . . . ; p P g of P identical processing elements (i.e., cores or other bus masters). Processing elements implement the out-of-order execution paradigm and access a globally-shared memory by means of an interconnect and a memory controller.
All the processing elements share a global DRAM memory G. A crossbar switch allows conflict-free point-to-point communication between each processing element and the DRAM memory controller. We consider a single-channel [18] and single-rank DRAM subsystem, where the DRAM memory is composed of one rank divided into a set B of N B banks.
Following the MPAM specifications, different sources of memory transactions are identified by partition identifiers (PARTIDs). This work considers a set R of PARTIDs, where each individual PARTID r i 2 R is used to determine the partitioning of memory resources. As extensively discussed later in Section 4, a PARTID can be, for example, assigned to a task or a virtual machine.
Memory Controller Model. In this paper, we build on the memory controller (MC) model used in previous work [19], which is inspired to realistic designs used both by the industry [20], [21], [22] and academia [18], [23], [24]. Its high-level structure is shown in Fig. 1. For each bank b y 2 B, the MC provides a queue to accommodate read requests. Each of these queues exposes at most one request that is ready for inter-bank arbitration. An inter-bank scheduler selects the request to be forwarded to the DRAM memory among those exposed by the intra-bank queues. Requests are in general composed of multiple commands (i.e., the PRE, ACT, and CAS presented in Section 2). The inter-bank scheduler selects a new request to be scheduled when the previous one is completed (if any: otherwise the new request is served upon arrival), i.e., when all its commands have been issued, and the corresponding JEDEC constraints are satisfied. Write requests are enqueued separately and served in batches [18], [19], [24]. This technique is called write batching and it is typically used in COTS controllers [21] to improve the throughput (in architectures where writes do not stall the processing pipeline and can be processed asynchronously) by giving precedence to reads over writes while avoiding starvation. This memory controller model is used next as a baseline MC in our paper. As this work studies different alternatives to implement the MPAM specifications, when required, the MC behavior defined in this section is either enriched or revised in some of its parts, but keeping the overall structure unaltered. The behavior of the baseline MC is formalized by means of a set of rules, which are classified into three categories: (i) intra-queue arbitration rules, (ii) inter-queue arbitration rules, and (iii) write handling rules.
Intra-queue arbitration rules: MC1. FCFS: Requests in each per-bank queue are organized with a first-come-first-served (FCFS) order. FCFS prioritizes older requests over newer requests. MC2. FR: In each per-bank queue, row-hits are privileged over row-conflicts, i.e., the requests causing row-hits can pass ahead of older requests causing row-conflicts. The number of requests that can pass ahead of any other request, with respect to the FCFS order, is limited by the constant N thr . MC3. FR-FCFS: The overall scheduling policy applies, in order, first rule MC2 and then rule MC1.
Inter-queue arbitration rules: MC4. Round Robin. The inter-bank arbitration employs the round-robin scheduling algorithm. At most one request is served at each round-robin turn. No interbank reordering is allowed.
Write handling rules: MC5. Write Batching. The memory controller enqueues write requests and serves at least N wb of them in a batch as soon as W thr write requests are enqueued (referred to as watermarking threshold [24]). The write batch stops after N wb writes if there is at least one pending read, and at least a read request is served afterwards.
For the write batching, we use the same assumptions as in prior work [18], [19], [24]. Namely, in our model: (i) the write buffer has a size Q write , (ii) W thr ! N wb , so that when the watermark threshold is overtaken there are always at least N wb write requests to form a batch, (iii) Q write À N wb < W thr , i.e., after issuing a batch the overall number of writes fall below the watermarking threshold and, (iv) the write buffer is large enough to allow it never becoming full. Furthermore, it is worth highlighting that if a read request targets data for which there exists a pending request in the write buffer, then it is served from the queue without accessing the DRAM. This is required to guarantee data causality.
Definitions. We conclude the section by providing some useful definitions. A request r x issued by r i suffers interference from another request r y when r y is served while r x is pending. Given a request r y , the interference it can cause to another request r x is either categorized as intra-bank or interbank interference. Namely, r y causes intra-bank interference to r x if r y and r x target the same bank and r x suffers interference from r y . Instead, r y causes inter-bank interference to r x if r x suffers interference from r y and they target different banks.
The inter-bank interference is further divided into direct and transitive. A request r x experiences direct inter-bank interference when it participates to the inter-bank arbitration (i.e., it is at the top of its intra-bank queue) and suffers inter-bank interference. On the opposite, a request r x experiences transitive inter-bank when it is not at the top of its intra-bank queue and suffers inter-bank interference. The latter phenomenon occurs when r x suffers inter-bank interference due to other requests causing intra-bank interference to it that in turn are suffering inter-bank interference.
A table of symbols is reported in the appendix, available in the online supplemental material.

A CLOSE LOOK AT THE MPAM SPECIFICATION
Next, we proceed in two steps. First, in this section we take a close look to the official MPAM specification as provided by Arm [17] for what concerns memory-bandwidth and memory-priority partitioning. The specification presents several mechanisms to improve the memory access predictability and isolation, but unfortunately only providing a high-level and informal descriptions that leave large room for interpretation to hardware vendors that want to actually implement such mechanisms.
Later, in Section 5, we propose some possible instantiations of the MPAM specification, with the goal of both enabling timing analysis and guiding hardware vendors towards the best design choices to maximize predictability. In our proposals, to focus on solutions that are practically viable and that can be integrated with existing architectural designs with limited efforts, the MPAM mechanisms are combined with other standard design choices typically employed in COTS platforms aimed at maximizing average-case performance and throughput.
Overview. The Memory System Resource Partitioning and Monitoring (MPAM) [17] specification released by Arm for Armv8-A is designed for allowing the partitioning of memory system components (MSCs) shared among different applications and virtual machines. The specification states that each MSC, such as caches, interconnects, and DRAM memory controllers, might support MPAM (page 47 of [17]). The resource partitioning is based on the PARTID, an identifier that can be assigned, for example, to a VM or a task (when the MPAM PARTID Virtualization is enabled). The PARTID is then propagated through the MSCs to allow resource control and monitoring. When a new memory request r arrives in an MSC implementing MPAM, the measured usage of the MSC resources is compared with the control settings configured for the PARTID to which r is associated to. The MPAM architecture defines different control interfaces for MSCs, discussed below.
The MPAM specification provides two main categories of techniques: those related to caches and those related to memory bandwidth regulation.
The first category includes cache-portion partitioning, a mechanism that allows allocating storage portions of the cache to partitions, and cache maximum-capacity partitioning, which specifies a capacity limit for the storage used by a PARTID in the cache. This category is not studied in this paper. Rather, this work focuses on the second category, which includes the following memory-access regulation techniques: memory-bandwidth minimum-maximum partitioning, priority partitioning, and memory-bandwidth portion partitioning. These three techniques are extensively reviewed below based on the information provided by the MPAM specification [17].

Memory-Bandwidth Min-Max Partitioning
The MPAM reference manual defines an interface for the management of memory bandwidth by configuring two values a M i and a m i for every partition r i . These values respectively represent the maximum and minimum bandwidth to be assigned to r i (e.g., to a task assigned to r i ). They can be specified by writing in the MPAMCFG_MBW_MAX and MPAMCFG_MBW_MIN registers, available in every MPAMcompliant MSC (e.g., a memory controller or an interconnect) that implements this partitioning strategy. Quoting the MPAM manual "The control parameters for bandwidth partitioning schemes are all expressed in a fixed-point fraction of the available bandwidth", which means the user can specify, by means of a percentage, how much of the total bandwidth can be used by each partition.
Since the MPAM specification does not provide a formal definition of bandwidth, in compliance with it, we assume it should be interpreted as the number of memory transactions in a given time interval. The size of such a time interval (for partition r i ) is referred to as accounting window and denoted as w i . Based on these premises, the bandwidth used by each partition r i can be measured by counting the number q i ðtÞ of transactions served in the current accounting window. This value can then be compared with the minimum and maximum budget To keep track of the bandwidth usage, it is necessary to define the time interval over which the traffic measurement is performed. To this end, the MPAM manual proposes two possible high-level strategies to define a time window: fixed accounting window and moving accounting window. The specification also states the vendor is free to implement its custom accounting window strategy, as long as it is "in line with the schemes described" above. Since our purpose is to enable timing analysis, which requires well-defined accounting windows, this paper considers the fixed accounting window scheme. Memory traffic is hence measured by counting the requests issued over a fixed period of time that periodically repeats. When an accounting window terminates, a new window begins with no history of the used bandwidth (i.e., if an accounting starts at t Ã , q i ðt Ã þ k Á w i Þ ¼ 0; 8r i 2 R; k 2 N). The MPAM specification offers a configurable register to customize the size of the window for each PARTID (MPAMCFG_MBW_WINWD), expressing its length in microseconds.
Next, we provide details on the maximum-minimum bandwidth partitioning, which may be either jointly or separately applied by an MPAM implementation.
Memory-Bandwidth Minimum Partitioning (MB-mP).MB-mP is a bandwidth partitioning strategy based on the design goal of enforcing a minimum bandwidth to every partition.
When using MB-mP, each PARTID r i is characterized by a minimum bandwidth a m i . A minimum budget Q m i is defined to be compared with the current q i ðtÞ value at runtime. The behavior of MB-mP complies to the following high-level rules, which have been extracted from the MPAM specification: min1 If, during the current accounting window, partition r i issued less than Q m i requests (i.e, q i ðtÞ Q m i ), requests from PARTID r i are preferentially selected to be served. min2 If, during the current accounting window, partition r i already issued at least Q m i requests (i.e., q i ðtÞ > Q m i ), "requests from PARTID r i may compete with other requests, as enabled by other regulation mechanisms implemented", if any (e.g., maximum-bandwidth partitioning).
As one may note, the MPAM specification leaves a lot of room for interpretation: for example, if q i ðtÞ Q m i holds for multiple PARTIDs, it is not specified how their requests compete with each other. To better specify the behavior of MB-mP, we later discuss some design choices that we deem reasonable in Section 5.
Memory-Bandwidth Maximum Partitioning (MB-MP).MB-MP is a bandwidth partitioning strategy based on the design goal of enforcing a maximum bandwidth to every partition. When MB-MP is enabled, each partition r i is characterized by a maximum bandwidth a M i assigned to it. Furthermore, for each PARTID, a boolean flag h i 2 fT; F g can be configured to control how requests are handled when the current bandwidth consumption is above the maximum. In the MPAM specification, this flag corresponds to the HARDLIM bit of the MPAMCFG_MBW_MAX register.
A maximum budget Q M i is defined to be compared with the current q i ðtÞ value at run time. The behavior of MB-MP complies to the following high-level rules (extracted from the MPAM specification): MAX1 If, during the current accounting window, partition r i issued an amount of memory requests between Q m i and Q M i (i.e., Q m i q i ðtÞ < Q M i ), requests from r i are served when "there are no competing minimum bandwidth requests to serve". Requests for PARTIDs r j such that Q m j q j ðtÞ < Q M j compete with each other to use bandwidth, as allowed by the other implemented MPAM regulation mechanisms. MAX2 If, during the current accounting window, partition r i already issued at least Q M i memory requests (i.e., q i ðtÞ ! Q M i ) and h i ¼ F , requests from r i "compete with other requests to use bandwidth only when there are no competing requests" to serve for PARTIDs r j 6 ¼ r i such that q j ðtÞ < Q M j . MAX3 If, during the current accounting window, partition r i already issued at least than Q M i memory requests (i.e., q i ðtÞ ! Q M i ) and h i ¼ T , requests from r i are not served and saved for later service, namely, until t 2 > t when q i ðt 2 Þ < Q M i . When minimum bandwidth partitioning is not implemented, the rules above still apply considering Q m i ¼ 0. Again: these rules are not precise enough in defining the behavior of MPAM to allow hardware vendors to directly implement them. To this end, additional design choices need to be considered, which are crucial to achieve a truly predictable behavior. Therefore, later in the paper we discuss different alternatives to implement the specification.

Priority Partitioning
The MPAM reference manual defines an interface for the management of memory requests via the configuration of a fixed priority for the traffic generated by a specific partition.
To model this strategy, a priority p i is defined to each partition r i . In MPAM, this priority can actually be assigned to the PARTID by writing in the INTPRI field of the MPAMCFG_PRI register of the MSC. In this paper, since the reference manual describes it as an implementation-defined behaviour, we assume that higher values of p i correspond to higher priorities.

Memory-Bandwidth Portion Partitioning
The MPAM reference manual describes a bandwidth partitioning strategy based on the division of the total available bandwidth into portions. Portions are not properly defined by the MPAM specification (it just says that "a portion is a quantum of the bandwidth"), and every vendor is free to implement a different strategy to enforce it.
Overall, this mechanism is unfortunately largely unspecified by the standard: Section 9.3.3 of the MPAM manual [17] describes it in just five lines (plus a few other lines in Appendix A.3 to describe the bitmap register, available in the online supplemental material).
Discussion. MPAM provides very general and flexible mechanisms that can be applied to different types of MSCs. Nevertheless, the implementation of the MPAM specification by hardware vendors requires exploring several design choices to be carefully evaluated. To better understand how to implement the MPAM specification to maximize predictability and providing recommendations to hardware vendors, we next explore some possible implementations of MPAM at the level of the DRAM memory. Then, we derive a memory contention analysis for each of them, which is finally used in the evaluation to compare different design alternatives. In this paper, we focus on the memory bandwidth maximum-minimum partitioning and the priority partitioning. Furthermore, we considered the first policy only for the case in which the HARDLIM bit is not set, so that there is no hard limitation of the maximum number of memory transactions that can be issued in an accounting window. This is because such a behavior would introduce a stall for the PARTID under analysis that requires complex analysis techniques and an even richer notation that we cannot fit in this paper. For the same reason, and given its very vague specification in [17], we leave as a future work the consideration of the portion partitioning strategy.

MODELING MPAM MECHANISMS
Next, we model in detail some solutions to actually implement MPAM for controlling the contention at the DRAM level. We start with the MB-mp/MP partitioning scheme.

MB-mp/MP Partitioning in the MC
The first design choice for implementing this policy consists in deciding whether to implement it inside the MC, i.e., as part of its scheduling policy, or outside the MC, making MPAM acting as a filter for requests arriving at the MC. In our proposal, we consider the former solution: indeed, the min-max policy involves a dynamic prioritization of requests from different PARTIDs that, if implemented outside the MC, would be superseded by the internal scheduling policy of the MC.
To study a practical choice from a hardware vendor point of view, it is highly advisable to integrate MB-mp/MP partitioning with the standard mechanisms that can be found within a COTS MC, where requests causing row-hits (also called open-requests, as they target a row that is already open, i.e., loaded in the row-buffer) are privileged because they result in a shorter latency and hence maximize throughput. In the baseline memory controller presented in Section 3, this typical design choice is specified by Rule MC2 (FR).
The baseline memory controller model has to be enriched to cope with the behavior of MPAM's MB-mP/ MP mechanisms specified by Rules min1, min2, MAX1, MAX2, and MAX3. We note that such a behavior can be modeled as an MPAM-related dynamic priority attributed to memory requests issued by each PARTID r i . The priority to be assigned to new requests changes over time, over three classes (i.e., three dynamic priority levels), depending on whether i , as three different priority levels that can be attributed to memory requests. The three conditions correspond to the priority levels o i ¼ 3 (maximum-priority), o i ¼ 2 (medium-priority), and o i ¼ 1 (low-priority). Once a request is assigned a priority, it remains unchanged during its entire lifetime.
To proceed, we need to define a set of accurate rules to describe how the dynamic priority of the MPAM min-max partitioning changes over time, and how the budget of transactions is managed: mM1 Counter reset. At the system startup and every w i time units since then, the counter q i ðtÞ is set to 0. mM2 Dynamic priority. If q i ðtÞ < Q m i , memory requests from r i arriving at the MC are assigned with priority i requests from r i are assigned with a priority o i ¼ 1; once a request is assigned to a priority, its priority does not change. mM3 Counter increment. Every time a memory request from r i enters the MC, q i ðtÞ is incremented by one.
Building on these priority classes, the intra-bank arbitration behavior is summarized by the rules that refine Rules MC1 and MC2 of the baseline memory controller (Section 3).
mM4. MINMAX. Each per-bank queue is ordered by MPAM-related dynamic priority. FCFS tie-breaking is used for requests with the same priority. mM5. FR-MINMAX. The overall intra-bank scheduling policy applies, in order, first rule MC2 and then rule mM4. The resulting intra-queue scheduling behavior privileges, in order, (i) row-hits over row-conflicts, and (ii) requests with a higher MPAM-related dynamic priority over requests with a lower priority.
Finally, we left both the inter-bank arbitration (Rule MC4) and write batching (Rule MC5) policies unaltered with respect to the baseline memory controller of Section 3.
Indeed, implementing the MB-mp/MP partitioning at the inter-bank level would privilege accesses to certain banks, possibly causing long delays to requests targeting such banks. For this reason, we deem a reasonable solution to keep the fair round-robin policy at the inter-bank level. Second, the considered baseline meta-memory controller handles write requests in batches, i.e., in a delayed fashion with respect to when they are issued. Since instead the dynamic priority of the MB-mp/MP policy changes over time, using it in a delayed fashion may make little sense.

Priority Partitioning in the MC
As discussed in Section 4.2, under priority partitioning each PARTID r i 2 R is assigned to a static priority p i . Therefore, for each r i 2 R we define the sets of PARTIDs r j 2 R n r i with higher, equal and lower priorities as hp i , ep i , and lp i , respectively. Requests inherit the priority of the issuing PARTID. The priority assigned to each PARTID has an effect in defining the intra-bank scheduling policy. Combining it with the prioritization of row-hits over row-conflicts typically used in COTS memory controllers, we consider two different possible variants for the intra-bank behavior of priority partitioning. The first, called PP-FR, retains the prioritization of requests directed to open rows and is specified by the following rules: PP1. Priority Partitioning. Requests in each per-bank queue are ordered according to the priority assigned to the corresponding PARTID. FCFS tie-breaking is used for requests with the same priority. PP2. The overall intra-bank scheduling policy applies, in order, first rule MC2 (Section 3) and then rule PP1.
The second variant, called FR-PP, privileges requests directed to open rows only when there is a tie in priority, and is specified by Rule PP1 and the following one: PP3. The overall intra-bank scheduling policy applies, in order, first rule PP1 and then rule MC2.
As for the previous case, we leave the inter-bank arbitration policy (Rule MC4) unaltered with respect to Section 3, to avoid privileging accesses to certain banks. Similarly, we leave the write batching rule un-altered with respect to the baseline MC (Rule MC5), as writes are handled asynchronously in MCs with write batching.

CONTENTION ANALYSIS: PRELIMINARIES
The analysis approach used in this paper is based on a linear programming formulation. Given an arbitrary schedule S and an arbitrary time window of length D, we define a set of variables and constraints to bound the memory contention experienced by read requests in the time window. As detailed in the appendix, available in the online supplemental material, the memory contention analysis can be then integrated with a response-time analysis to compute memory-aware response-time bounds [19], [25].
As noted in prior work [19], when deriving a responsetime bound, only the contention due to reads actually delays the task under analysis. Indeed, writes do not stall the processing pipeline and can be handled asynchronously by the MC (if the write buffer is large enough, as assumed in Section 3).
The proposed optimization problem maximizes the delays experienced by memory transactions. Constraints to exclude impossible contention scenarios are enforced so that the solution of the optimization problem yields a safe memory-related delay bound. This approach is highly compositional: each constraint can be proved in isolation (i.e., with local reasoning) and reused whenever the property encoded by the constraint holds in the analyzed setting.
In this work we extensively benefit of this analysis approach: given that the scope of this work is to analyze several different MPAM mechanisms, it is possible to independently study each mechanism, derive the corresponding constraints, and finally plug them in the optimization problem together with those of the baseline MC.
Before proceeding, we introduce some additional notation that is required for the analysis. As in prior work [18], [24], we further introduce the architectural constant N pend > 1 that defines the maximum number of outstanding read requests in the memory controller 1 .
The proposed analysis is general enough to be applied under different circumstances: for this reason, we do not tie our analysis to a specific task model. Nevertheless, as an example, in the appendix, available in the online supplemental material, we show how to analyze the memory contention of sporadic real-time tasks under fixedpriority partitioned scheduling, also providing specific definitions of the following functions. We consider each PARTID to be associated with a sequential computational activity that issues memory transactions over time that are modeled as follows. Functions RD j;u ðDÞ and WR j;u ðDÞ bound the maximum number of read and write requests, respectively, issued from r j to a bank b u 2 B in any interval of length D. These functions must include all the requests issued in the considered system. Functions RD j ðDÞ and WR j ðDÞ are defined for the same purpose, but considering all requests irrespectively of the target bank. Similarly, RI j;u ðDÞ and WI j;u ðDÞ bound the number of reads and writes from r j 2 R to bank b u 2 B in any interval of time D that are issued by r j when it is the PARTID under analysis. RI j ðDÞ and WI j ðDÞ bound the same quantities but irrespectively of the target bank. When clear from the context, the dependency on D is hereafter omitted for brevity. 1. For example, in [24] this constant is given by the size of the Miss Status Handling Register (MSHR) of the shared last level cache, as in architectures with caches accesses to the DRAM are generated only in correspondence of a cache miss. Therefore, the size of this register determines the maximum number of cache misses that can be handled simultaneously, and consequently the maximum number of outstanding requests in the memory controller.

BASELINE ANALYSIS FRAMEWORK
This section presents the baseline analysis framework. In particular, we introduce the modeling variables and the delay bounds that are common to all settings to be analyzed for MPAM. While we liberally take inspiration from the analysis framework of [19], the analysis proposed in this paper is fundamentally different. Indeed, [19] makes use of a different model for the cores and tasks. It assumes at most one pending read at a time from each core, tasks leveraging a three-phase execution model, and non-preemptive execution. Furthermore, it does not consider any bandwidth regulation technology. Conversely, this paper proposes a more general setting where multiple outstanding requests can be in the MC at the same time from different processing elements, as MPAM is designed for Armv8-A processors that leverage the out-of-order execution paradigm, and makes less stringent assumptions on the execution model.
We focus on the contention experienced by read requests. Indeed, in both our setting and in [19] writes do not stall the processing pipeline and the write buffer in the MC is assumed to never become full. Consequently, contention on writes issued by r i do not delay r i . Overall, the delay experienced by read requests is composed of (i) the contention due to other read requests, and (ii) the contention due to other write requests. Similarly to [19], we provide an optimization-based approach for bounding the maximum contention due to reads, and a closed form bound for the contention due to writes. Later, we use the baseline analysis as the capstone to analyze MPAM. Therefore, we define four different classes of variables of the optimization problem to model contention in an arbitrary schedule S and in an arbitrary time interval ½t; t þ DÞ, for each pair of bank b y 2 B, PARTID r j 2 R, and requests from r i to b u . Each class considers a different type of memory contention: X IP n;j;u 2 f0; 1g: (IP: intra-bank promoted) indicates whether the n-th request issued by r j 2 R to bank b u 2 B causes intra-bank interference to any request of PARTID r i by being promoted by the FR policy. X INP n;j;u 2 f0; 1g: (INP: intra-bank not promoted) indicates whether the n-th request issued by r j 2 R to bank b u 2 B causes intra-bank interference to any request of PARTID r i not because of the FR policy. X CP n;j;u 2 f0; 1g: (CP: cross-bank promoted) indicates whether the n-th request issued by r j 2 R to bank b u 2 B causes inter-bank interference to any request of r i by interfering with a request promoted by the FR policy. X CNP n;j;u 2 f0; 1g: (CNP: cross-bank not promoted) indicates whether the n-th request issued by r j 2 R to bank b u 2 B causes inter-bank interference to any request of r i by interfering with a request that was not promoted by the FR policy.
To ease the notation, we further define X I n;j;u ; X C n;j;u 2 f0; 1; 2g, such that X I n;j;u ¼ X IP n;j;u þ X INP n;j;u and X C n;j;u ¼ X CP n;j;u þ X CNP n;j;u , to denote the entire intra-bank and inter-bank interference, respectively. Note that a request r n can interfere with different pending requests from r i under analysis in different ways.
The objective function maximizes the overall memory contention and it is defined as follows: where T ¼ fIP; INP; CP; CNPg and T ðxÞ denotes the contention delay implied by x requests that generate type-T contention 2 . These functions have been established in prior works [18], [19] and are recalled in the appendix, available in the online supplemental material, for the sake of completeness. First, note that the first summation of the objective function excludes the requests from r i . This is necessary because these requests cannot be a source of interference for r i (e.g., if r i is associated to a task, their duration is already taken into account in the task's WCET). However, they must be included in the analysis, given that their presence may allow additional interference by requests from other PARTIDs. Second, note that the objective function considers the maximum among all possible types of contention an arbitrary interfering request r n may cause to r i . This is because, while r n can simultaneously interfere in different ways with multiple pending requests issued by r i , the time spent by r i waiting for the memory controller to serve r n simultaneously elapses for each request that is interfered and it cannot be longer than the overall time required to serve r n (e.g., think of a case in which r n interferes with both a request of r i in the same per-bank queue of r n and with one in another queue).

Constraints of Baseline MC
We present several sets of constraints: each of them acts from a different angle to exclude impossible schedules and hence improving the accuracy of the analysis. The proofs are available in Section 3 of the appendix, available in the online supplemental material.
The first constraint bounds the intra-bank interference due to requests promoted by FR scheduling.   (2) where NRðDÞ ¼ P bu2B ðRI i;u ðDÞ þ P r j 2Rnr i RD j;u ðDÞÞ; NW ðDÞ ¼ P bu2B WI i;u ðDÞ þ P r j 2Rnr i WR j;u ðDÞ; and WR is the maximum delay generated by a write request (defined in the appendix, available in the online supplemental material).

PRIORITY PARTITIONING
Priority partition inherits the same variable definitions, objective function, and constraints from the baseline analysis of Section 7. To start, we recall that the sets of PARTIDs r j 2 R (including r i itself) with higher, equal, and lower priorities are denoted with hp i , ep i , and lp i , respectively. In Section 5.2, two variants of priority partitioning have been defined: FR-PP and PP-FR, depending on whether the FR policy prevails on the priority ordering.
We introduce next a constraint, which is common to both FR-PP and PP-FR, to leverage the priority partitioning for limiting the interfering requests due to other PARTIDs. Proof. The LHS counts the number of reads from r j 2 lp i to b u that generate intra-bank interference to r i 's requests not because of the FR policy. By Rule PP1, reads from r j 2 lp i cannot be placed ahead to those issued by r i in the queue of b u . However, since requests are served non-preemptively, each read r from r i can be delayed by at most one request from r j 2 lp i that may have started to be served before r enters the MC. Since r i can issue at most RI i;u ðDÞ requests to bank b u , the constraint follows. t u Next, we distinguish between the FR-PP and PP-FR. Under FR-PP, the FR policy is applied before priority partitioning (Rule PP2). Therefore, Constraint 1 of the baseline analysis suffices to encode the FR part of the policy.
Conversely, under PP-FR, the priority partitioning policy prevails over the FR one. This information is used next to derive an accurate constraint on variables X IP n;j;u . Constraint 6. Under PP-FR, for each bank b u 2 B, P r j 2Rnep i P RD j;u À1 n¼0 X IP n;j;u 0: Proof. Due to Rule PP3, the priority partitioning policy is applied before the first-ready policy. Therefore, the requests promoted due to the FR policy must have the same priority of those of r i . The constraint follows.
t u The interference due to write requests is bounded as in Section 7, as this feature of the baseline MC is left unaltered by the priority partitioning policy.

MB-MP/MP PARTITIONING ANALYSIS
Next, we show how to extend the baseline analysis framework to work under the MB-mp/MP policy of MPAM (Section 5.1). Aiming at bounding the contention experienced by a PARTID r i 2 R under analysis, we start providing some bounds that are useful to formulate fine-grained constraints to limit the memory contention.

Bounds on the Number of Interfering Requests
We first derive two bounds for the requests issued by every PARTID r j that can be served at a given dynamic priority o j 2 f3; 2g in any interval of length D. The bounds can be obtained by considering that, in an arbitrary time interval ½t; t þ DÞ, each PARTID can issue a limited number of requests before its priority class is changed (rules mM1-mM3). This is caused by the fact that when r j crosses the thresholds of Q m j or Q M j transactions in the accounting window w j , its priority is decreased by MPAM.
Lemma 2. The number of transactions at priority o j 2 f3; 2g of a PARTID r j 2 R in any time interval ½t; t þ DÞ is bounded by We discuss the two bounds separately. I 3 j ðDÞ : At priority o i ¼ 3, the PARTID can issue at most Q m j transactions for each accounting window before its priority is decreased to 2 (rules mM1-mM3).
I 2 j ðDÞ : Similarly, at priority o i ¼ 2, the PARTID can issue at most Q M j À Q m j transactions for each window before its priority is decreased to 1 (rules mM1-mM3).
The lemma follows by noting that the number of requests that can be issued at each priority level during ½t; t þ DÞ is composed of: (i) at most D=w j AE Ç Á Q m j and D=w j AE Ç Á ðQ M j À Q m j Þ transactions, respectively, for the accounting windows starting within ½t; t þ DÞ , and (ii) at most Q m j and Q M j À Q m j transactions, respectively, for a carry-in accounting window started before time t and completing in ½t; t þ DÞ. t u

Bounding the Memory Contention
Consider the PARTID r i 2 R under analysis. With respect to the case of Section 7, the MB-mp/MP memory controller analysis needs to re-define variables X INP n;j;y , as they leverage rule MC1 that is superseded by Rule mM4. Furthermore, we need to introduce some new variables to model the behaviors that are specific to the MB-mp/MP policy. Therefore, for each partition r j 2 R, for each memory bank b u 2 B, and for each pair of MB-mp/MP priorities o i ; o j 2 f1; 2; 3g, we define: X INP;o i ;o j n;j;u 2 f0; 1g indicates whether the n-th request issued by r j to bank b u is served at dynamic priority o j and causes intra-bank interference to at least one request from r i assigned to dynamic priority o i because of the MB-mp/MP or the FCFS policies. The objective function is the same used in Section 7, provided that we redefine X INP n;j;y ¼ max o i ;o j 2f1;2;3g fX INP;o i ;o j n;j;y g. The maximum is required because, even though a request r n can simultaneously interfere with multiple pending requests from r i having different dynamic priorities, the time spent by r i waiting for the memory controller to serve r n simultaneously elapses for each interfered request.
We start introducing bounds for the newly defined variables. Constraint 7 enforces that, given a priority o i 2 f1; 2; 3g, an interfering request can have only one priority. is equal to 1 for a given request, then there exists at least one request that interfered with r i while having more than one priority. This is impossible by rule mM2, which states that the dynamic priority of any request cannot change over time.
t u However, Constraint 7 does not prevent to the same (arbitrary) request r n from r j to interfere, by taking more than one dynamic priority o j , with multiple requests of the PARTID under analysis r i assigned to different priorities (e.g., o i and another one in f1; 2; 3g n o i ). Again, this is clearly impossible because the dynamic priority of r n does not change over time due to rule mM2. Constraint 8 excludes this impossible case.

1.
Since multiple outstanding requests are allowed by the MC, r i can have simultaneously pending requests assigned to dynamic priorities 1, 2, and 3. By contradiction, assume there exists a request r n from r j to bank b u such that r n takes two different priorities o j and o k 6 ¼ o j over time. Assume also that r n interferes with multiple requests of the PARTID r i under analysis with priorities o i and o h 6 ¼ o i . It hence means that X INP;o i ;o j n;j;u ¼ 1 and that P o h ;o j n;j;u ¼ 1. Therefore, the constraint reduces to 1 0, which is impossible, reaching a contradiction. t u Constraint 9 leverages the bounds of Lemma 2 to limit the number of interfering requests. Next, we impose Constraint 10 to limit the interference due to requests at a lower dynamic priority.
Proof. The LHS of the inequality counts all the requests issued with a priority o j < o i . Due to Rule mM4, no request from r j can cause intra-bank interference at a priority o j < o i if there are pending requests from r i at priority o i . However, since requests are served nonpreemptively, for each request r from r i , at most a single request may contribute to the LHS of the inequality because it is started before r is enqueued. Since r i can issue at most RI i;u ðDÞ requests to bank b u , the constraint follows. t u The following constraint leverages the constant N pend to limit the maximum number of interfering requests at the same priority (FCFS tie-breaking). First note that requests from PARTID r j 2 R served at priority o ¼ o i ¼ o j and producing intra-bank interference to pending requests from r i at the same priority need to be issued before those of r i due to FCFS tie-breaking (Rule mM4). Whenever a request r with priority o is issued by r i , there can be at most other N pend À 1 requests that are pending in the MC with priority o, which can interfere with r because they were issued before.
The LHS counts the overall amount of requests by r j interfering because of the FCFS policy at priority o. These requests are no more than N pend À 1 for each request issued at priority o by r i . The constraint follows by noting that the number of requests issued by r i at priority o in the analysis window is bounded by I o i ðDÞ (see Lemma 2). t u

Interference to Requests with Dynamic Priority 1
We now focus on the interference to requests with priority 1. Note that such requests are not addressed by Constraints 10, and 11, while bounding the interference they suffer is of the utmost importance.
Indeed, it may happen that r i starts its execution with the budget of higher priority requests already exhausted, and all its requests enter the MC during this first accounting window. In this case, all of them would have priority 1. By using the previously discussed constraints only, it would be legal for interfering PARTIDs that all their requests enter the MC just a moment before r i and, by consequence, r i would suffer all the possible interference due to the FCFS policy, as if the MPAM's MB-mp/MP policy is not present. Ultimately, this would result in making the MB-mp/MP strategy useless, since the X INP;1;1 n;j;u variable would not be bounded by any MB-mp/MP specific constraint.
To address this issue, we need to introduce a mapping between interfering requests and the window in which they enter the MC. In this way, we allow the optimizer to decide how many requests from r i enter the MC in each accounting window, and to match them with interfering requests. To this end, we leverage a typical property of function RD j;u ðDÞ. Indeed, RD j;u ðDÞ, as shown in the appendix, available in the online supplemental material, is a monotonic non-decreasing function, most commonly in a step-wise periodic fashion. Therefore, not all interfering requests are available in all windows (e.g., think of the case in which the PARTID is assigned to a periodic task).
Bounding the interference that can be suffered by requests with the lowest priority is not trivial, and requires introducing additional notation.
For each b u 2 B, for each accounting window with index z 2 W i ¼ f0; . . . ; D=w i d eg, we define the variables: Y u;z 2 N denotes the number of requests by the PAR-TID under analysis r i to bank b u that enter the MC during the z-th window elapsed in the interval ½t; t þ DÞ.
We start dividing the total number of requests based on the accounting window they enter the MC.

Constraint 12.
For each b u 2 B: P z2W i Y u;z RI i;u ðDÞ. Proof. By definition, Y u;z counts the number of requests issued by r i to bank b u during the z-th accounting window. Provided that W i contains all the windows elapsed during the time period D, since the windows are not overlapped, it is straightforward that their sum must not be larger than the total number of requests issued by r i to bank b u . t u A bound on the requests that may be issued at priority 1 can now be derived in a specific constraint, which also introduces the auxiliary term H 1 i;u ðDÞ. Constraint 13. The maximum number of requests by r i to bank b u that can enter the MC at priority 1 in the analysis window ½t; t þ DÞ are bounded by: H 1 i;u ðDÞ ¼ RI i;u ðDÞ À P z2W i nf0g minfQ M i ; Y u;z g Proof. The total number of requests issued by r i to bank b u in the analysis window of interest is bounded by RI i;u ðDÞ, irrespectively of their dynamic priority. Among such requests, for each accounting window with index z > 0, by Rule mM2, Q M i requests have to be issued at dynamic priorities 3 and 2 before issuing requests at priority 1. Furthermore, by definition of variables Y u;z , no less than minfQ M i ; Y u;z g requests can anyway be issued at dynamic priorities 2 and 3. The same cannot be argued for the first accounting window (z ¼ 0) because it may have started outside the analysis window of interest, i.e., before time t. Hence, P z2W i nf0g minfQ M i ; Y u;z g requests of the RI i;u ðDÞ ones cannot surely enter the MC at dynamic priority 1. The constraint follows.
t u Thanks to the auxiliary term H 1 i;u ðDÞ introduced by Constraint 13, we can derive Constraint 14, which is analogous to Constraint 11 but considering requests at dynamic priority 1. The proof is analogous and thus is omitted.

Bounding the Number of Transactions per Window
Lastly, we need a constraint that prevents requests that are available only after a specific offset in the analysis interval (e.g., which are due to jobs that did not start yet if PARTIDs are associated with periodic tasks), from being accounted for in an accounting window that already ended. We start by bounding the number of requests by r i that can enter the MC at priorities 1 and 2, but only starting from a certain accounting window with index z 0 .
For each bank b u 2 B, for each z 0 2 W i , the number of requests by r i that enter the MC after or at the z 0 -th window with priority 1 is bounded by: Similarly, those with priority 2 are bounded by: First consider requests at priority 1. We distinguish two cases: (i) z 0 6 ¼ 0, and (ii) z 0 ¼ 0. Case (i). The accounting windows with indexes in W i ðz 0 Þ must all have started in the analysis interval ½t; t þ DÞ. By Rule mM2, to issue requests at priority 1 in one of the accounting windows with index in W i ðz 0 Þ, r i must have before issued Q M i requests with priorities > 1 in the same. By recalling the definition of Y u;z , the number of requests with priority 1 in such accounting windows is hence bounded P z2sW i ðz 0 Þ maxf0; Y u;z À Q M i g. Case (ii). The first accounting window (z ¼ 0) may have started before the beginning of the analysis interval ½t; t þ DÞ. Irrespectively of their priority, no more than Y u;0 requests can be issued in the first accounting window. The same reasoning of case (i) applies to the next accounting windows with index in W i ðz 0 Þ n f0g. Hence the constraint for term B 1;z 0 holds.
Consider now requests at priority 2 and let us distinguish the same two cases as above. Case (i). By Rule mM2, no more than Q M i À Q m i requests with priority 2 can be issued in each accounting window. Furthermore, for the same reasons discussed for requests at priority 1, to issue requests at priority 2 in one of the accounting windows with index in W i ðz 0 Þ, r i must have before issued Q m i requests with priority 3 in the same. Hence, Y u;z À Q m i also yields a safe bound for the number of requests at priority 2 issued in the z-th interval, with z > 0. Therefore, the total number of requests with priority 2 in such accounting windows is bounded by P z2sW i ðz 0 Þ maxf0; minfY u;z À Q m i ; Q M i À Q m i g. Case (ii). The same reasoning used for case (ii) related to requests with priority 1 can be used after nothing that minfY u;0 ; Q M i g bounds the number of requests with priority 2 in the first accounting window. Hence the constraint for term B 2;z 0 holds. t u With Constraint 15 in place, we can bound the interference produced by the requests from r j that entered the MC during the z 0 -th accounting window or later. Due to the FCFS policy, these requests can only produce interference to requests from r i that are enqueued after the beginning of the z 0 -th accounting window. where B 1;z 0 and B 2;z 0 are defined by Constraint 15.
The proof is available in Section 3 of the appendix, available in the online supplemental material.
The contention due to writes is bounded analogously as for the baseline MC, as this feature is left unaltered by the min-max policy.
Relaxation. Differently from other problems that require the value of specific variables (e.g., as in decision problems such as task partitioning), the expected output of the optimization problem is only the (maximized) value of the objective function. Therefore, from a practical point of view, the problem was implemented by relaxing all the binary variables representing the interfering requests as real variables in the interval [0,1]. This action does not affect the correctness of the computed response-time bound since, by relaxing the variables, the search space available to the optimizer is enlarged. Therefore, the maximum of the objective function of the relaxed problem can only be higher than or equal to the maximum of the original problem. Hence, the upper bound is still safe. Furthermore, to evaluate the difference between the two models, we performed a set of tests by running the same problem instance with and without relaxation (based on some configurations used in our evaluation, presented in Section 10). In all the tested cases, the optimal solution did not change.

EXPERIMENTAL EVALUATION
This section reports on experimental evaluation we performed to test the effectiveness of the MPAM mechanisms. To this end, we implemented the memory-aware responsetime analysis in the case in which a PARTID is a sporadic real-time task (see the appendix for more details, available in the online supplemental material). The proposed optimization problems have been implemented with IBM CPLEX on a machine with 128GB of memory and two Intel Xeon(R) CPU E5-2640 v4 @ 2.40GHz, with 40 cores in total. The evaluation focuses on a case study derived from the task set proposed by Bosch for the WATERS 2019 Industrial Challenge [27], which includes nine tasks (task 0 À 8). The task mapping is based on the challenge solution of [28]. Additional details about the tasks and the number of memory requests are available in Table 2 in Appendix, available in the online supplemental material. When considering memory contention in bounding the worst-case response time, most tasks were actually not schedulable with the original setting. We introduced two additional parameters R ; W 2 ½0; 1, to scale the number of reads and write requests, respectively. In the evaluation, we explored a vast range of values for R and W to evaluate different trade-offs. We considered the JEDEC timing constraints for a DDR3 MC running at 1333 Mhz, which are reported in the appendix, available in the online supplemental material, for the sake of completeness. We configured the MC parameters as in prior work [19], [24], namely, N thr ¼ 18, N wb ¼ 18, Q write ¼ 64, and N pend ¼ 24. We consider a DRAM memory with N B ¼ 4 banks.
In our experiments, we compare the memory contention under different configurations: (i) baseline: the baseline MC without any MPAM mechanism; (ii) priority partitioning with FR-PP; (iii) priority partitioning with PP-FR; and (iv) MB-mp/MP: the MB-mp/MP MPAM mechanism.
The first three charts focus on the comparison between the two variants of the priority partitioning strategy and the baseline system. To this end, we explore three different priority assignment configurations: (i) max improvement, which independently considers the case in which each task is assigned the maximum MPAM priority, while the other tasks are all assigned the same lower priority; (ii) rate monotonic, which assigns the priority inversely proportional to the tasks' period; (iii) request monotonic, which assigns the priority inversely proportional to the number of requests per job of each task.
All the values on the y-axis represent the overall memory interference, as a percentage of the memory interference obtained when using the baseline configuration (i.e., when MPAM is disabled). Fig. 2 reports on a comparison of the FR-PP and PP-FR policies under three different priority assignment strategies. The requests' scaling factors are R ¼ 0:35 and W ¼ 0:05. The same experiments were also carried out with other scaling factors and the results are presented in Appendix, available in the online supplemental material. Fig. 2a shows that FR-PP and PP-FR allows achieving a maximum improvement of up to 21% and 27% to the baseline, respectively. Furthermore, it highlights that PP-FR generally performs better than FR-PP in reducing the memory interference delay suffered by the highest-priority task. This is attributed to the fact that under PP-FR the prioritization prevails over requests targeting open rows, which allows posing additional constraints on the intra-bank interference. Thus, increasing the priority has a larger effect when using PP-FR. In each configuration, all other tasks (which have been assigned to the same low priority in this case) showed no improvement.
In Fig. 2b, instead, tasks are assigned priorities according to rate monotonic. In this case, tasks with shorter periods (e.g., tasks 1 À 4) achieve a more significant improvement (up to a 24% reduction with task 1 and PP-FR), while tasks with a longer period (e.g., tasks 5 À 8) have a very small improvement because they are assigned to lower priorities.
In Fig. 2c, priorities are assigned proportionally to the number of requests per job of each task. In this case, task 6 achieves the most remarkable improvement of 19%.
The next three plots consider the improvement that can be granted to the most memory-intensive task of the benchmark (i.e., t 6 ) while varying the parameters of the MB-mp/ MP strategy. An additional hybrid task set (details available in the Appendix, Table 3, available in the online supplemental material) was synthesized for these specific experiments, still starting from the WATERS 2019 one. This was deemed necessary to show some relevant configurations where the MB-mp/MP strategy provides some benefits from a worstcase analysis point of view, where in most configurations we observed very few improvements over the baseline. For the sake of completeness, other experiments that use the original task set are available in Appendix, available in the online supplemental material.
In the following, the number of read requests per job for a task t j is denoted as rd j ¼ P b u 2B rd j;u , the ratio between Q m j and Q M j of t j as g mM j ¼ Q m j= Q M j , the ratio between Q M j and rd j as b M j ¼ Q M j =rd j . In Fig. 3a, we show the effects of the variation of parameter b M j for each interfering task t j 6 ¼ t 6 (the same value of b M j is used for all interfering tasks). The plot reports three curves each corresponding to a representative configuration we selected by varying the ratio g mM j . All the other parameters were assigned as follows: Q m 6 ¼ Q M 6 ¼ rd j ; w 6 ¼ 0:03 Á T 6 ; w j ¼ T j , for j 6 ¼ 6. The values of w i were carefully selected after running a large number of experiments to allow Constraints 12,13,14,15, and 16 to have a remarkable effect in reducing the memory interference suffered by t 6 itself, thanks to a large number of accounting windows, and hence high-priority requests, for t 6 compared to the other tasks. Indeed, the MB-mp/MP strategy resulted hard to properly configure to achieve a substantial effect in reducing memory-contention delays, with little improvement under considerably different values of w i . Fig. 3a shows that t 6 has a considerable improvement (about 15%) with respect to the baseline MC only when b M j ¼ 0, which implies Q M j ¼ 0. Indeed, it can be noted from the figure how an increase of b M j causes a rapid reduction of the improvement. The only exception is for the case with g mM j ¼ 0, where MPAM is able to provide a 12% gain with respect to the baseline configuration for t 6 also for larger values of b M j . Note that in this case Q m j ¼ 0: hence interfering tasks cannot release any request with maximum priority. Fig. 3b shows the effect of the variation of g mM j for the interfering tasks. The plot reports three curves, for three representative configurations we selected by varying the ratio b M j . The results confirm how an increase of the maximum budget causes an up to 12% reduction of memory interference for the task under analysis. It is also interesting to notice how the improvement reaches 0% as soon as g mM j ¼ 0:3. This indicates that is necessary to keep the number of higher priority requests from interfering task really low to allow the MB-mp/MP strategy having a positive effect for t 6 in the worst-case scenario. Fig. 3c investigates on how changing the total number of read requests of the interfered tasks (indicated as a percentage of the base case) affects the memory interference limitation provided by the MB-mp/MP. In this case, task 6 has been assigned Q M 6 ¼ Q m 6 ¼ rd 6 , and the other tasks have g mM j ¼ 0:2. Three representative configurations are reported, corresponding to b M j 2 f0; 0:2; 0:4g. It is interesting to notice how the interference is very close to the baseline when the number of requests is at the two extremes of the tested interval, while they are more distant in the middle (up to a 16% improvement). First, to understand this behavior, please note that, since Q m 6 ¼ rd 6 , no other PARTID can issue request at priority higher than r 6 . We then need to take into account the fact that non-promoted intra-bank interfering requests (i.e., those due to variables X INP;o i ;o j n;j;u that are affected by MB-mp/MP) with lower or equal priority can interfere for two reasons: (i) they can have a lower priority than the interfered request and thus cause a delay due to the non-preemptive service of requests; or (ii) they can have an equal priority, and thus interfere because of the FCFS policy. Case (i) causes one interfered request to match at most one interfering request (see Constraint 10), while case (ii) causes each interfered request to be matched with at most N pend interfering requests (Constraints 11 and 16).
When the interfered (i.e., task 6) has a low number of requests, the number of requests due to interfering tasks t j 6 Fig. 2. Memory interference obtained using the FR-PP and PP-FR strategies under different priority assignments, as a percentage of the memory interference obtained with the baseline MC. The scaling factors are R ¼ 35% and W ¼ 5%.
¼ t 6 , i.e., the number of interfering requests available at the beginning of the interval (due to functions RD j ðDÞ) are enough to interfere with most of the task 6's requests because of the FCFS policy during the first accounting window (case (i)), and the solver does not need to move requests of the interfered tasks in subsequent windows to match more interfering requests. Thus, in this case Constraints 12, 13, 14, 15, and 16 have no effect and the MB-mp/ MP policy results in being ineffective.
This case does not improve the situation to the baseline, which always matches each interfered request with N pend interfering requests (Constraint 4). When task 6's requests increases (rd 6 between 20% and 80%), the solver moves some interfered requests to other accounting windows to increase the interference (due to Constraint 12,13,14,15,and 16), forcing the presence of more requests of t 6 at high priority, which instead can be interfered with at most 1 interfering requests each (due to case (i) only), instead of N pend . In this way, the interference decreases. However, when task 6's requests grow further (rd 6 over 80%), the interference gets again closer to the baseline case. For example, if b M 6 ¼ 0:4, this is because the number of interfered requests approaches the number of interfering requests, making it irrelevant whether they are matched in a one-to-one or one-to-N pend fashion because the number of interfered requests is anyway enough to match all the interfering ones. In all the configurations in Fig. 3, all other tasks except t 6 showed no improvement over the baseline.
Another aspect we investigated is the scalability of the proposed optimization problem with respect to the total amount of memory requests issued by the tasks. To better understand its run-time performance, we collected the execution times of the entire response-time analysis. The experiment was repeated by applying five different scaling factors R ; W 2 f0:2; 0:4; 0:6; 0:8; 1g, with R ¼ W , to the number of requests of the original task set. As previously stated, with a high number of requests the task set is actually not schedulable; however, during this test, this aspect was taken into consideration. Fig. 4 reports the results, which show how the optimization problem is significantly slower when using the MB-mp/MP strategy (up to 102 hours of execution). On the other hand, both the FR-PP and the PP-FR priority partitioning techniques have total running times similar to the baseline approach (up to about 2.5 hours), which is compatible with standard design activities that take place off-line.

RELATED WORK
The literature on memory bandwidth reservation mechanisms and methods to control and analyze contention is too vast to be exhaustively discussed within the space limits: therefore, a selection of the most closely related works is reported next.
One of the first proposals to regulate the memory bandwidth is due to Yun et al. [2] who implemented a per-core regulator called MemGuard, using performance counters to implement a budgeting mechanism. Farshchi et al. [29] proposed a custom hardware component, called BRU, to regulate the memory bandwidth of multiple cores at the same time. Sohal et al. [4] implemented a framework for analyzing the memory demand and to predict the timing of real-time workloads on CPUs and hardware accelerators. Several memory-bandwidth regulations mechanisms have been proposed also for hardware accelerators, such as GPUs or FPGAs [8], [9], [10]. Several techniques have been also proposed to improve predictability of cache memories: the interested reader can refer to the survey by Gracioli et al. [30]. Other authors [23], [31] proposed methods to reduce contention by adopting bank-aware memory allocations.
Several efforts have been spent in realizing custom memory controller designs to enhance predictability [14], [16], [32], [33], [34]. While being advisable designs in the context of real-time systems, they are not present in COTS platforms, thus strongly limiting their adoption. On the contrary, the MPAM specification has the potential of being present in all Arm platforms, and hence in billions of devices.
Another way of enhancing memory-access predictability is to enforce predictable execution models [11], [12], [35], [36] However, these schemes requires either hardware, OS, or compiler-level support, which may be not always available.
Finally, other works focused on bounding the memory delays in the presence of contention. The most close to our work are those considering DDR memories, i.e., those due to Kim et al. [23], Hassan et al. [18], [37], and Casini et al. [19]. Overall, no previous work studied the effects of the MPAM mechanisms on memory contention.

CONCLUSION
With Arm proposing the MPAM specification, COTS platforms have an unprecedented opportunity to improve their predictability on a large scale. In this paper, we deeply studied the MPAM specification by Arm, highlighting many points in which it is underspecified and leaves room for ambiguities. Then, we explored some possible instantiations of the specification at the level of the DRAM memory controller. For each of them, we derived a memory-contention analysis that has been used to compare different design alternatives.
Lesson Learned. From the extensive experimental evaluation we performed, it resulted in being easy to find configurations of the priority partitioning FR-PP and PP-FR configurations to foster predictability by reducing the memory contention delays, with improvements up to 41% to the baseline configuration. Conversely, it was hard to achieve reasonable improvements using the much more complex MB-mp/MP strategy, with improvements limited to 16% for very specific configurations. As it is common in real-time systems, this showed that simpler is better for predictability: MB-mp/MP exhibits a higher degree of dynamicity and more parameters to be properly configured, which unavoidably leads to a much more complex analysis and a more pessimistic analysis that arise from the consideration of a larger set of corner cases that cannot be excluded a priori. Nevertheless, we do not exclude the MB-mp/MP might lead to improvements in terms of average-case behavior. To assess this, future work will implement this policy into a state-ofthe-art DRAM simulator, e.g., [38].
Other possible research directions include the consideration of the HARDLIM configuration of MB-mp/MP and the portion partitioning strategy.
Matteo Zini received the graduate (cum laude) degree in embedded computing systems engineering, the master's degree jointly offered by the Scuola Superiore Sant'Anna of Pisa and University of Pisa. He is currently working toward the PhD degree with the Real-Time Systems (ReTiS) Laboratory of the Scuola Superiore Sant'Anna of Pisa. His research interests include the design and implementation of real-time operating systems and hypervisors, cyber-physical systems, and analysis of the effects of memory interference on schedulability.
Daniel Casini (Member, IEEE) received the graduate (cum laude) degree in embedded computing systems engineering, the master's degree jointly offered by the Scuola Superiore Sant'Anna of Pisa and University of Pisa, and the PhD degree in computer engineering from the Scuola Superiore Sant'Anna of Pisa (with honors), working under the supervision of Prof. Alessandro Biondi and Prof. Giorgio Buttazzo. He is assistant professor with the Real-Time Systems (ReTiS) Laboratory of the Scuola Superiore Sant'Anna of Pisa. In 2019, he has been visiting scholar with the Max Planck Institute for Software Systems (Germany). His research interests include software predictability in multi-processor systems, schedulability analysis, synchronization protocols, and the design and implementation of realtime operating systems and hypervisors.
Alessandro Biondi (Member, IEEE) received the graduate (cum laude) degree in computer engineering from the University of Pisa, Italy, within the excellence program, and the PhD degree in computer engineering from the Scuola Superiore Sant'Anna under the supervision of Prof. Giorgio Buttazzo and Prof. Marco Di Natale. He is associate professor with the Real-Time Systems (ReTiS) Laboratory of the Scuola Superiore Sant'Anna. In 2016, he has been visiting scholar with the Max Planck Institute for Software Systems (Germany). His research interests include design and implementation of real-time operating systems and hypervisors, schedulability analysis, cyber-physical systems, synchronization protocols, and component-based design for real-time multiprocessor systems. He was recipient of six Best Paper Awards, one Outstanding Paper Award, the ACM SIGBED Early Career Award 2019, and the EDAA Dissertation Award 2017.
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/csdl. Open Access funding provided by 'Scuola Superiore 'S.Anna' di Studi Universitari e di Perfezionamento' within the CRUI CARE Agreement