Detailed Modeling of Heterogeneous and Contention-Constrained Point-to-Point MPI Communication

The network topology of modern parallel computing systems is inherently heterogeneous, with a variety of latency and bandwidth values. Moreover, contention for the bandwidth can exist on different levels when many processes communicate with each other. Many-pair, point-to-point MPI communication is thus characterized by heterogeneity and contention, even on a cluster of homogeneous multicore CPU nodes. To get a detailed understanding of the individual communication cost per MPI process, we propose a new modeling methodology that incorporates both heterogeneity and contention. First, we improve the standard max-rate model to better quantify the actually achievable bandwidth depending on the number of MPI processes in competition. Then, we make a further extension that more detailedly models the bandwidth contention when the competing MPI processes have different numbers of neighbors, with also non-uniform message sizes. Thereafter, we include more flexibility by considering interactions between intra-socket and inter-socket messaging. Through a series of experiments done on different processor architectures, we show that the new heterogeneous and contention-constrained performance models can adequately explain the individual communication cost associated with each MPI process. The largest test of realistic point-to-point MPI communication involves 8,192 processes and in total 2,744,632 simultaneous messages over 64 dual-socket AMD Epyc Rome compute nodes connected by InfiniBand, for which the overall prediction accuracy achieved is 84%.

on a system purely based on multicore CPUs, the connectivity between the CPU cores has several layers. The cores that belong to the same CPU socket can communicate more efficiently than those between sockets, because the inter-socket memory bandwidth is lower than the intra-socket counterpart. At the cluster level, between any pair of compute nodes, the communication speed is even lower and can depend on the actual location of the nodes on the network setup. All these levels of interconnect heterogeneity will translate into vastly different values of the effective latency and bandwidth of point-to-point MPI communication.
Another complicating factor for many-pair, point-to-point MPI communication on today's parallel platforms is the potential competition between different MPI processes. This is because each CPU socket can, if needed, support a large number of concurrent MPI processes. Contention arises when multiple pairs of sending-receiving processes simultaneously communicate over the same connection. Such contention may exist, in different magnitudes, over the entire network. Moreover, the competition situation is often dynamically changing; for instance, an MPI process pair communicating a small message may complete before the other MPI process pairs, resulting in a reduced level of contention.
This paper aims to detailedly model the per-process overhead associated with realistic, many-pair, point-to-point MPI communication. We want to improve the state-of-the-art quantitative models of point-to-point MPI communication by a new methodology for modeling the bandwidth contention due to a large number of MPI processes that compete against each other. Here, communication can exist on different levels: intra-socket, inter-socket and inter-node. In addition, we target the real-world situation where different MPI processes can have different numbers of neighbors to exchange data with, while the size of each message is also highly non-uniform. One typical example of such a heterogeneous scenario arises from numerically solving a partial differential equation (PDE) over an irregular solution domain. The first step of parallelizing a mesh-based PDE solver is to partition an unstructured computational mesh, which covers the irregular solution domain, into a desirable number of subdomains each assigned to an MPI process. The usual partitioning result is that the number of the nearest neighbors varies from process to process, and so does the size of each MPI message.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Specifically, we will propose a new modeling methodology that quantifies both heterogeneity and contention. At the same time, we want to inherit a level of simplicity from the fundamental postal model [1], [2] that describes a single pair of MPI processes, and the successor max-rate model [3]. The elegantly simple postal model relies on only two parameters to quantify the cost of point-to-point communication, i.e., a start-up latency τ and a bandwidth value BW . The max-rate model lets BW depend linearly on the number of competing MPI processes while limited from above by a maximum bandwidth value, which is prescribed as the third parameter. Our new performance models are also based on a fair competition among the MPI processes, but the value of BW will depend dynamically on the actual number of competing MPI processes and how these processes affect each other across two specific levels: intra-socket and inter-socket. We focus our modeling and experiments on large messages, that use the rendezvous MPI protocol. The contributions of our work are as follows: r We extend the osu_bibw micro-benchmark of MVA-PICH [4] to easily tabulate the various τ and BW values, both depend on the connection type and the latter is also a function of the number of competing MPI processes. These tabulated values serve as a characterization of the communication performance of a heterogeneous interconnect.
r We improve the accuracy of the max-rate model [3] for the case of multiple MPI process pairs concurrently exchanging equal-sized messages. Specifically, the tabulated BW values replace an often idealized relationship between the achievable bandwidth and the number of competing MPI pairs. r We propose a "staircase" strategy to detail the contention between many MPI processes with varying numbers of neighbors and message sizes, when competing over a single level of interconnect.
r We extend the single-level "staircase" modeling to mixedlevel "staircase" modeling that also quantifies the interaction between two particular communication levels: intrasocket and inter-socket. The remainder of the paper is organized as follows. Section II introduces a new bi-directional multi-pair micro-benchmark, which can be used to pinpoint the achievable bandwidth as a function of the competing send-receive pairs. Section III is devoted to a new "staircase" modeling strategy that can be adopted to handle the various types of heterogeneity, more specifically, non-uniform message size, a varied number of messages per MPI process, and the interaction between intrasocket and inter-socket communication. Section IV tests the "staircase" strategy and the resulting new models in realistic cases of many-pair, point-to-point MPI communication. Section V places our current work with respect to the existing work on modeling MPI point-to-point communication, whereas Section VI provides some concluding remarks. The source code that implements the mixed-level modeling strategy can be found in the appendix, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/ TPDS.2023.3253881.

II. DETAILING BANDWIDTH CONTENTION
As mentioned above, the postal model [1], [2] provides an elegant and simple way of quantifying the time needed to pass a message between a single pair of MPI send-receive processes. Its formula is as follows: where the constant parameter τ denotes the start-up latency, s denotes the size of the MPI message, and the constant parameter BW SP denotes the communication bandwidth. The subscript "SP" stands for single-pair and thus highlights the applicability of the postal model. Experiments (see e.g. [5]) have shown that this two-parameter model can produce estimates of T (s) that agree very well with the actual single-message time usages, as long as the message size s is within the regime of the same protocol (i.e., short, eager, or rendezvous). It also means that each protocol is associated with its specific set of τ and BW SP parameters.
The max-rate model [3] extends the postal model by considering N competing MPI messages belonging to N send-receive process pairs. If all the messages are of the same size s, they will have the same time usage due to a fair competition for the bandwidth. The simplest formula of the max-rate model is as follows: In the above formula, BW MP (N ) is meant to model a shared bandwidth to be fairly competed among the N messages, and the subscript "MP" stands for multi-pair. The dependency of BW MP on N is considered by the max-rate model in its simplest form as follows: where BW max denotes the upper limit of the achievable communication bandwidth, i.e., a max rate. The existence of BW max illustrates a saturation effect, which applies to both communication over a network connection and communication through shared memory. An extended max-rate model was presented in [6] to consider variations in the message size and number of messages per process. It mainly targets inter-node communication Here, M is the maximum number of messages per process, s total and s max denote, respectively, the total messaging volume per node and the maximum per-process volume. We note that (4) only models the slowest process per node. While the max-rate model is simple to use, it has several weaknesses. First, the actual BW MP (N ) value may not be linearly proportional to N before hitting the upper limit BW max , especially for intra-and inter-socket traffic. Second, the bandwidth contention may not be accurately modeled when the processes have largely varying message numbers and/or sizes. Third, only Algorithm 1: Bi-Directional Multi-Pair Benchmark.
In the remainder of this section we will improve the max-rate model with respect to its first shortcoming. This will be achieved by extending a micro-benchmark of MVAPICH [4]. The other two shortcomings will be addressed in Section III.

A. A New Micro-Benchmark
We want to pinpoint the actual BW MP values through measurements obtained by a simple benchmark, instead of using the often idealized formula (3). Specifically, we will adopt a new "bi-directional multi-pair" micro-benchmark, as described in Algorithm 1. It is a modification of the osu_bibw benchmark of MVAPICH [4]. The new micro-benchmark involves P processes to form P 2 sender-receiver pairs, each simultaneously handling two equal-sized messages of opposite directions. The total number of competing messages is thus P . To avoid the unwanted side-effect of MPI messages being cached, each repetition of a message is loaded/stored from/to a different location in a pre-allocated long buffer.

B. Example: Measuring BW MP (N ) on Four machines
We have run the bi-directional multi-pair micro-benchmark (Algorithm 1) on four machines of dual-socket CPUs. The specific CPU types are: (1) ARM Cavium 32-core ThunderX2 CN9980; (2) ARM 64-core Kunpeng 920-6426; (3) Intel Xeon 26-core Gold-6230; and (4) AMD 64-core Epyc Rome-7742. OpenMPI v4.0.5 was used as the MPI installation. Compilation used GCC v10.1.0 with the -03 -march=armv8-a options on the ThunderX2 and Kunpeng machines, and GCC v10.2.0 with the -03 option on the Intel Xeon and Epyc Rome machines. On each machine, we measure the time usage for a series of message sizes in the regime of the rendezvous protocol. 1 This is repeated for some chosen numbers of MPI processes. For each chosen value N , the series of time measurements (for the different message sizes) undergo a linear regression to recover the values of τ and BW MP (N ) that can be used in the max-rate model (2) or later in our new models to be introduced in Section III. On a given machine and for a specific communication level, the recovered τ values are very similar for the different choices of N . So we have consistently used the τ value associated with N = 2 in Table I and later experiments. We remark that the models and experiments to be presented in this paper target message sizes in the regime of the rendezvous protocol. Our modeling approach remains the same for the other protocols.
The measurement-determined BW MP (N ) values for all the four machines are summarized in Table I, where we also distinguish the scenarios of intra-socket and inter-socket. The former means that all sender-receiver pairs are placed on the same socket, whereas the latter means that each pair is split across two sockets. In this table, and also in the remainder of the paper, the value N denotes the number of active MPI processes per socket that concurrently receive incoming messages, i.e., running the micro-benchmark of Algorithm 1 with P = N for the intra-socket measurements and P = 2 N for the inter-socket measurements. The reason for this definition of N is because the new performance models to be introduced in Section III consider a socket as the "base unit" when modeling intra-and inter-socket communications. The BW max values in Table I are obtained from running the same number of MPI processes per socket as the CPU cores. The intra-socket BW SP value is obtained by running two MPI processes on just one socket, with only one uni-directional message.
One clear observation from Table I is that the τ measurements associated with the intra-socket and inter-socket cases confirm that intra-socket MPI communication is faster than the inter-socket counterpart. Another important observation is that the intra-and inter-socket BW MP (N ) values do not follow the formula (3) of the max-rate model, clearly demonstrated in Fig. 1. These tabulated BW MP (N ) values will be heavily used in our new models of Section III that are capable of handling varying numbers of incoming/outgoing messages per MPI process, non-uniform size per message, and the interaction between intra-and inter-socket traffic.

III. NEW PERFORMANCE MODELS OF MANY-PAIR, POINT-TO-POINT COMMUNICATION
Apart from replacing the max-rate formula (3) with tabulated BW MP (N ) values that are determined by the new microbenchmark of Algorithm 1, another improvement is to more accurately model the bandwidth contention when the competing  MPI processes receive largely different volumes. We will thus develop in this section a new modeling strategy that detailedly quantify the per-process time usage for the cases where the message size is non-uniform and/or the number of neighbors per process varies. A more general situation is that many MPI messages of various sizes are concurrently communicated over multiple levels of a heterogeneous network. In total three new models of increasing complexity will be introduced.

A. General Many-Pair, Point-to-Point Communication
First let us define the general situation of many-pair, pointto-point MPI communication in Algorithm 2. More specifically, MPI process with rank i will simultaneously receive M in i messages from M in i different neighbors. Each process thus has a list of neighbor identities . The neighbors can be of mixed types, i.e., some may reside on the same socket as process i, others may reside on a different socket but inside the same compute node, whereas the rest may reside on other compute nodes. Each process is also assumed to send M out i outgoing messages to M out i neighbors (can be different from the M in i "inward" neighbors). All the messages can be of varying sizes, and the sizes of the "inward" and "outward" messages do not have to match. For this purpose, we have for each process two sets of message sizes . An example of the resulting sparse communication matrix is Fig. 2, which involves 16 processes. It should be remarked that Algorithm 2 describes irregular point-to-point MPI communications that frequently arise in scientific computations. One such example can be found in the copyOwnerToAll function of the DUNE software framework [7], [8], [9], in connection with parallelizing sparse matrix-vector multiplications.

B. A Staircase Modeling Strategy
Before developing new performance models of increasing complexity for quantifying the per-process time usage associated with Algorithm 2, we need to first introduce a "staircase" modeling strategy. The purpose is to quantitatively describe the dynamically changing scenarios of competition, i.e., the number of competing processes and the available communication bandwidth are both dynamic. The following principles are associated with this modeling strategy: 1) The time needed by process i to complete all its communication tasks is determined by either how soon it can finish receiving all its M in i incoming messages, or how soon all its M out i outgoing messages are received at the destinations. The maximum of the two time usages will apply. The reason for also considering the delivery of the outgoing messages at the destinations is motivated by the most common situation that uses the rendezvous protocol without intermediate buffering, where experiments show that a sending process cannot complete until all its outgoing messages are delivered.
2) The "expected" order of which the M in i incoming messages are received by process i is determined by the sizes of these messages. (The actual order is stochastic, thus not modelable.) The smallest incoming message completes first, followed by the second smallest message, and so on. The completion time points for the different incoming messages can be estimated using a staircase strategy, which will be detailed by formula (9). If the M in i incoming messages are of the same size, they are considered to complete simultaneously, due to fairness. 3) When multiple receiving processes, each with one or more incoming messages, compete for the same network connection, another staircase principle applies as will be described in formula (5). The process with the least amount of incoming communication will complete first, letting the remaining processes continue with their remaining communication. This procedure repeats itself until all the processes are completed. At all times, the bandwidth is shared evenly between the still active processes, while the actually available bandwidth can depend on the number of concurrently competing processes.

C. Model for One-Level, Single-Neighbor, Non-Uniform Message Size
The staircase modeling strategy will be first applied to the scenario of each MPI process having exactly one incoming message and one outgoing message. The messages sent/received by the different processes can be of various sizes. For this simple scenario we consider that all the MPI processes communicate on the same level, i.e., either all intra-socket, or all inter-socket, or all inter-node.
To model the per-process time usage, we first group the MPI processes that share the same bandwidth, i.e., the processes that reside on the same socket for the level of intra-socket communication, or the processes on one socket that share the same inter-socket bandwidth, or the processes that reside in one compute node that share the same inter-node bandwidth. We let N denote the number of processes in a group. The bandwidth under contention is characterized by a set of pre-tabulated values of BW SP , BW MP (2), . . ., BW MP (N ), as well as a latency constant τ .
Modeling of the per-process time usage will start with sorting the N processes of each group with respect to the size of the per-process single incoming message: Then, the per-process time usage T i can be estimated as a "staircase," more specifically The recursive formula in (5) accounts for a decreasing degree of bandwidth contention in the form of a staircase, when more and more processes complete. Also, we have used BW MP (1) = BW SP . In (6), t send i denotes the time needed for the outgoing message of process i to be delivered on the destination process with rank k = neigh out i (see Algorithm 2), which is considered the same as the receiving time t recv k on process k. We note that the time usage model in (5)-(6) is only valid for single-message per-process communication, but it can be extended to account for multiple message sources and destinations per process. This will be covered in Section III-D below.
C. Verification example: As the verification experiments for this new model, as well as for verifying the other new performance models later, we have always run a benchmark code that implements Algorithm 2. (The total number of MPI processes, the number of neighbors per process and the sizes of the messages may differ from experiment to experiment.) More specifically, each experiment executes Algorithm 2 ten times and the average time usage per iteration is recorded individually for each MPI process.
We have tested the new model (5)-(6) by quantifying the per-process time usage of 6 MPI processes running on one ARM Cavium ThunderX2 processor (all the messages are intrasocket). The 6 processes form three sender-receiver pairs, where each pair exchanges the same amount of data in the two opposite directions. However, the three pairs simultaneously work with three different message sizes: s 2 , s and 2 s, with s ranging between 2 bytes and 2 MB. The actual time usages for the three pairs are plotted against the model estimates in Fig. 3.
In Fig. 3 we can see that the model estimates agree very well with the actual time measurements for most of the message sizes. The model (5)-(6) is able to estimate the time usage of each individual process. The two processes of each pair have the same time measurements and estimates, thus we only see three sets of measurement-estimate curves. For each set, we can also observe two clear jumps in the time measurements and estimates, each corresponding to a change in the message protocol, first between the short and eager protocols, then between eager and rendezvous. For example, the eager-rendezvous protocol threshold at 8 KB is clearly visible for the mid-sized pair in the plot.

D. Model for One-Level, Multi-Neighbor, Non-Uniform Message Size
Now let us consider a more general situation where each process communicates with multiple neighbors. For this situation, we modify the recursive formula in (5) by replacing the single-message size s i with a per-process incoming message i denotes the number of "inward" neighbors for process i, and s i,j denotes the size per incoming message. The MPI processes are organized into groups as in the previous model. Again, N denotes the number of processes in a group, and the processes inside the group is now sorted with respect to The time usage estimate in (5) for single-neighbor, non-uniform message size communication can now be extended to model the multineighbor situation i,j denotes the delivery time needed by each outgoing message, and it equals the time needed by destination process (with rank neigh out i,j ) to receive this message. Since each process can have multiple incoming messages, we thus need to "theoretically" pinpoint the individual time points at which the different messages are received on each destination process. Take for instance process i. It is the destination for M in i incoming messages, and the total receiving time (without the latency overhead) has been calculated as t recv i using (7). The following recursive formula, which is again based on a staircase principle, will be used to pinpoint the individual time points at which the M in i incoming messages are received In the above recursive formula, we have sorted the M in i incoming messages for process i with respect to the message sizes We remark that the completion time point for the largest incoming message, which is theoretically expected to finish last, equals the total receiving time needed by process i, i.e., t recv Verification Examples: To demonstrate the single-level, multi-neighbor model (7)-(9), we have conducted three experiments, each consisting exclusively of intra-socket, inter-socket or inter-node traffic. All the results were obtained on a cluster of Epyc Rome CPUs, described in Section II-B. For the intra-socket experiment, a synthetic communication pattern with 64 MPI processes, spread over two sockets, has been used. As shown in Fig. 4(b), each process sends and receives 3 intra-socket messages of different sizes. The per-process time estimates are plotted against the actual time measurements in Fig. 4(a). For comparison, the plot also includes per-process estimates produced by the extended max-rate model (4), marked as purple dots. More specifically, we have used the following variation of (4) That is, the total messaging volume s total in (4) is replaced by min(V total , NV i ). The maximum per-process volume s max is replaced by per-process volume V i . Note that (10) is the same as (4) for the slowest process. The inter-socket experiment uses 64 MPI processes (also spread over two sockets) that exclusively send and receive inter-socket messages, see Fig. 4(d). The number of neighbors per MPI process and the message sizes arise from partitioning a realistic unstructured computational mesh into 64 pieces, where the maximum number of neighbors per process is 7, and two processes have no neighbors. The message sizes also vary considerably. The actual and estimated per-process times by (7)- (9) are displayed in Fig. 4(c). Max-rate estimates are also included in Fig. 4(c).
The inter-node experiment involves 256 MPI processes spread over four nodes, and the results and inter-node communication pattern are presented in Fig. 4(e)-(f). This example also arises from a realistic case of partitioning an unstructured computational mesh. Every MPI process only sends and receives internode messages, where the number of neighbors per process and the message sizes vary considerably. (The maximum number of neighbors per process is 58 whereas the minimum is 0.) The extended max-rate model (4) has been used to estimate the slowest process time usage per node. Per-process estimates are not included to prevent cluttering the plot.
To quantify the accuracy of our model (7)-(9), we calculate a total relative error defined as follows: where T actual i denotes the measured actual time usage on process i and T i denotes the per-process model estimate. Using this error definition, we have calculated the modeling error for the experiments shown in Fig. 4 to be 11.5% for the intra-socket experiment, 13.0% for the inter-socket experiment and 12.8% for the inter-node experiment, see Table II.
As shown in Fig. 4(a)-(c), the extended max-rate model under-estimates the per-process time for the intra-and intersocket scenarios. This is not surprising because it incorrectly models the bandwidth contention. Thus our one-level model (7)- (9) gives better accuracy than the extended max-rate models for intra-and inter-socket traffic. For the third experiment, shown in Fig. 4(e), our model has approximately the same accuracy as the extended max-rate model. This result is expected because two competing MPI processes can already saturate the maximum inter-node bandwidth on this system. A more elaborate model of the bandwidth contention thus does not pay off.

E. Model for Mixing Intra-and Inter-Socket Messages
The two new models (5)-(6) and (7)- (9) can be used to quantify the per-process time usage when all the point-to-point communication happens on one level. However, communications that concurrently take place on different levels can affect each other. We will thus develop a third model to handle a mixture of intra-socket and inter-socket (intra-node) messages. This is important in practice because point-to-point MPI communications on these two levels are particularly subject to the bandwidth contention, which is in essence the contention for the shared memory bandwidth.
The main idea is that the available communication bandwidth, which is to be shared among N competing processes, is neither the pure intra-socket BW on MP (N ) value nor the pure inter-socket BW off MP (N ) value. For process i, its achievable bandwidth will depend on the ratio between these two traffic types, more specifically where V on i and V off i denote, respectively, the total intra-and inter-socket incoming volumes on process i. To model the perprocess receiving time usage t recv i , when intra-and inter-socket traffic is mixed, we start with dividing all the MPI processes into groups based on which socket they reside on. As in the two preceding models, N denotes the number of processes in a group. The challenge now is that we can no longer easily see beforehand the exact order in which the processes will finish receiving their incoming traffic. This is due to a more dynamic competition for the available bandwidth. We will thus adopt a modeling algorithm that is based on yet another staircase strategy consisting of N steps, where in each step we can find out, "onthe-fly," which process will be the next one to finish.
For step k = 0, 1, . . . , N − 1 : Find an unfinished process q that gives Compute t recv q = k =0 t step ; (process q finishes after step k) (14) Compute (15) For all unfinished processes: In the above model, we have used V on,recv i and V off,recv i to record the accumulated volumes of intra-and inter-socket data received so far on process i. The values of V on,recv i and V off,recv i are increased during each step using (16)- (17). After each step, a process q will finish, as expressed by (13)- (14). On each process, its individual ratio θ i remains the same throughout the entire process. A source code that implements the above mixed-level model can be found in the appendix, available in the online supplemental material.

E. Verification examples:
We illustrate the mixed intra/intersocket model (12)-(17) with two simple examples, one using the dual-socket ThunderX2 machine and the other using the dualsocket Kunpeng machine (see Table I in Section II-B). In the first experiment we use in total 48 processes, where 32 processes (16 on each socket) send and receive intra-socket messages and the remaining 16 processes (8 on each socket) send and receive inter-socket messages. The results are presented in Fig. 5(a), and the communication pattern is presented in Fig. 5(b). The plot in Fig. 5(a) includes actual per-process time measurements as blue bars and the staircase model estimate as green bars. Notice that the mixed intra/inter-socket model correctly predicts that the off-socket processes complete after the on-socket processes. Using the error metric defined in (11), the total relative error is found at 6.6% for the staircase estimate.
In the second experiment, we use 64 processes spread over two Kunpeng sockets. The process connectivity is the same as shown in Fig. 4(b), but this time we deliberately "reshuffle" the process mapping (by the OpenMPI option -map-by socket) to incur a mixture of intra-and inter-socket communication. The per-process message sizes and types are displayed in Fig. 5(d). The actual time measurements and staircase model estimates are displayed in Fig. 5(c). The resulting total relative error is 3.6% .  TABLE III  EXPERIMENTS WITH REALISTIC COMMUNICATION PATTERNS ON THUNDERX2, KUNPENG, XEON-GOLD AND EPYC ROME. THE RESULTS ARE OBTAINED ON ONE,  TWO AND FOUR NODES, USING UP TO 256 PROCESSES ON THUNDERX2, UP TO 512 PROCESSES ON KUNPENG AND EPYC ROME, AND UP TO 208 PROCESSES ON  XEON-GOLD. THE TABLE DISPLAYS THE TOTAL NUMBER OF MESSAGES ( M), MAX PER-PROCESS MESSAGES, TOTAL COMMUNICATION VOLUME (IN MBS),  THE ON-SOCKET (INTRA-SOCKET), OFF-SOCKET (INTER-SOCKET) AND INTER-NODE PERCENTAGES OF THE TOTAL VOLUME, AND THE TOTAL RELATIVE  PREDICTION ERROR IV. REALISTIC TESTS So far, we have mostly used synthetic experiments, with the only exception being those shown in Fig. 4(c)-(f). In this section, we will apply the new performance models to realistic communication patterns.

A. Mixing Intra-Node and Inter-Node Messages
The most general point-to-point MPI communication can simultaneously take place at all the levels: intra-socket, intersocket and inter-node. Our approach is to use the one-level model (7)-(9) for estimating the inter-node cost per process as T inter−node i , whereas the mixed-level model (12)-(17) is used for estimating the mixed intra/inter-socket cost as T intra−node i . The total time per process is given by The above model of total time cost assumes that the intra-node and inter-node communication tasks cannot proceed simultaneously, whereas the intra-socket and inter-socket messages can compete for the same memory bandwidth (while dynamically affecting each other). We remark that summing up the inter-node costs with the intra-node costs is the same approach as adopted in [6].

B. Realistic Communication Patterns
For realistic communication patterns, we have adopted some examples of partitioning of realistic 3D unstructured meshes that have been used for reservoir simulations [10], [11], [12]. This is in connection with an MPI parallelization of a reservoir simulator, where Algorithm 2 needs to be repeatedly executed when numerically solving the involved partial differential equations. We refer the reader to [12] for the details. Given a decomposition of the mesh, based on a specified number of MPI processes, we can create the corresponding communication pattern of the irregular point-to-point MPI communication operations. Each MPI message is classified as on-socket, off-socket or inter-node.

C. Small-Scale Experiments
To test the performance model (18) on realistic communication patterns, we compare the per-process execution times of Algorithm 2 with the model estimates when running different numbers of MPI processes on four computing platforms: Thun-derX2, Kunpeng, Xeon-Gold and Epyc Rome (see Table I in Section II-B), using up to four compute nodes. One realistic example of per-process communication volumes can be seen in Fig. 6. Table III shows that the performance model (18) achieves good prediction results. The total relative error ranges from 7.5% to 18.9% on ThunderX2 and Kunpeng, from 12.2% to 15.6% on Xeon-Gold, and from 9.3% to 14.3% on Epyc Rome. A more detailed comparison for two ThunderX2 and two Kunpeng nodes is presented in Fig. 7. Here, the per-process execution time and model estimate are displayed in bar plots. In Fig. 7(a), execution times and model estimates for 128 processes on two ThunderX2 nodes are displayed, while the plot in Fig. 7(b) displays the timing results and model estimates for 256 processes on two Kunpeng nodes. We can observe good correspondence between the model estimates and execution times for most cases.

D. Large-Scale Experiments
We have also tested our model on the Betzy supercomputer [13], which is a BullSequana XH2000 system with in toal 1344 compute nodes. Each compute node has two 64-core AMD Epyc Rome CPUs for a total of 172,032 CPU cores. The system uses an Infiniband HDR 100 network connected in a Dragonfly+ topology [14], which consists of 14 groups each with 96 nodes. The Dragonfly+ topology improves scalability over the original Dragonfly, while maintaining the cost benefits when compared with the Fat-tree topology [15]. The Dragonfly+ topology also provides full bisection bandwidth for intra-group communications, but inter-group communications are oversubscribed and require the use of non-minimal routing to achieve good performance. This means that the Betzy system will see excellent inter-node communication for MPI processes running inside a single group, but performance may suffer with regards to both bandwidth and latency when the MPI processes are spread across two or more groups, depending on the overall network load.
The prediction accuracy of our model (18) has been tested on Betzy using communication patterns that include a larger proportion of inter-node messages arising from partitioning a large reservoir mesh that is ten times larger than that used for the experiments on ThunderX2, Kunpeng and Xeon-Gold in Section IV-C. Fig. 8 shows a breakdown of the per-node communication volumes for this large case when being partitioned into 1024 and 8192 subdomains, using 8 and 64 compute nodes on Betzy, respectively.
From Fig. 8 we can observe that the communication patterns attained by partitioning the large mesh result in a majority of onsocket communication. However, we also notice that there is a significant amount of off-socket and inter-node communication. The model prediction accuracy for our large-scale Epyc Rome experiments is presented in Table IV. In this table, we include results attained on 4 to 64 nodes. In the scenario of 64 nodes, the MPI processes are distributed (unevenly) across two Dragonfly+ groups in order to enforce non-uniform network connectivity and test performance accuracy when both inter-and intra-group communications are present. The total relative prediction error displayed in Table IV ranges    very simple, but its performance estimates are only accurate for fix sized, single messages on massively parallel processing (MPP) systems. These traditional MPP systems were homogeneous and there was only inter-node communication (i.e., no intra-node communication). The limitation of single messages was addressed in the more recent max-rate model [3], which extends the postal model to support N concurrent messages from N send-receive process pairs. Improvements of the max-rate model were introduced in [16] to clearly distinguish between inter-node, intra-node and intra-socket messages and to estimate network contention. Nevertheless, the max-rate model and its successor still lack the support for variable sized messages, plus the weakness of adopting an over-simplified "roofline" model for quantifying the aggregate bandwidth that is available to multiple sender-receiver pairs. In comparison, our work in the present paper adopts accurate aggregate bandwidth values that are measured by a simple bi-directional, multi-pair micro-benchmark, which was extended from the osu_bibw benchmark of MVAPICH [4]. The main difference between our work and the max-rate models [3], [16] is the adoption of a staircase modeling to quantify the per-process cost for the general cases where different MPI processes can have different numbers of neighbors and the messages are of different sizes.
Apart from modeling point-to-point communications on a single interconnection level, our models can specifically handle the dynamic competition between concurrent intra-socket and inter-socket messages.
Apart from the postal model and the max-rate models, another body of work on communication modeling contains the LogP family of models. This originated in the LogP model which was first proposed by Culler et al. in [2]. The LogP model adds communication overhead as an additional parameter to better estimemate the cost of sending and receiving messages, but is still constrained to fix sized, single messages and limited in accuracy for long messages. The latter was improved in the LogGP model [17] which extends the LogP model with support for linear modeling of long messages. Several other refinements also exist, including LogGPC [18] which adds parameters for network contention, LogGPS [19] which estimates synchronisation costs in the communications library, and LogGPO [20] which characterizes the overlap between communication and computation. A comprehensive survey of the different models was provided in [21]. The main drawback of the LogPfamily models is the adoption of hard-to-estimate parameters, such as network contention and synchronisation costs, in addition to the lack of support for variable sized messages and heterogeneous levels of latency and bandwidth. In comparison, instead of detailing the different contributing terms to the overall latency/overhead, which is the main modeling principle adopted by the LogP family, we let a micro-benchmark quantify the overall latency/overhead and focus instead on modeling the dynamic competition between variable sized, concurrent messages.
The particular modeling challenges due to the wide use of shared-memory nodes in modern compute clusters, together with the resulting concurrent messages, were investigated in the τ -Lop model [22] and its extension [23]. Our work follows the same notion of concurrent transfers from [22] and the reasoning about the resulting contention among multiple MPI processes for the same bandwidth. However, our "staircase" modeling strategy for bandwidth contention is completely different that used in [22].

VI. CONCLUSION
The basic assumption for the work presented in the current paper is that concurrent MPI messages that are transmitted over the same connection of a communication network should fairly share the aggregate bandwidth available. Moreover, when the messages are of different sizes, the principle of fairness will lead to that the shortest message should complete first, followed by the second shortest one, and so on. Such a dynamically changing situation of competition will, at the same time, lead to a dynamic change of the available bandwidth, which may not follow a simple profile (3) as suggested by the max-rate model. This has inspired the "staircase" modeling strategy that is heavily used in our new performance models.
In reality, the actual completion sequence of the different messages may indeed not follow the fairness principle. Various stochastic features will lead to different sequences even when an MPI program is repeated several times on the same parallel system while using the same configuration. We consider it impractical to properly model such stochasticity. Instead, we adopt an "idealized" expectation based on fairness. Despite this simplification, numerical experiments reported in Sections III and IV have shown a consistently good agreement between the model predictions and the actual time measurements. In particular, a realistic case that involves 8192 MPI processes and 2.7 million concurrent messages (see Table IV) has achieved an average overall accuracy of 83.8%.
Overall, our "staircase" modeling strategy is manageably simple. The models from Sections III-D and III-E can be efficiently implemented such as shown in the appendix, available in the online supplemental material. As preparation work before using our new models, the bi-directional, multi-pair micro-benchmark (Algorithm 1) needs to be executed for different MPI process counts and for three situations: within a CPU socket, between a pair of CPU sockets (with shared memory), and between a pair of nodes in a cluster. The measured values of τ and BW MP can be tabulated once and for all. To save the preparation work, the τ and BW MP values for the short and eager protocols can be skipped, if the message sizes of interest are known to lie in the rendezvous regime. In this paper, the possible costs that are associated with packing and unpacking MPI messages are not included. Such extra costs are necessary if the outgoing messages are composed of data values that lie non-contiguously in the sender's memory, or the values of the incoming messages need to be placed in designated, non-contiguous locations in the receiver's memory. There exist models, such as log n P or log 3 P [24], that include regularly strided memory accesses, which are the simplest operations for packing and unpacking messages. As future work, we will extend our models to cover the packing and unpacking costs, and we will address these from the angle of memory bandwidth contention. Specifically, we will consider the situation where a preceding computation produces the values before they are collected from irregularly non-contiguous memory locations and packed as outgoing messages. Moreover, the different MPI processes on a same node may carry out the "packing step" at different paces (e.g., some processes have less data to pack and can thus initiate the MPI commands earlier), such that the message transfer and message packing by different MPI processes may compete for the same memory bandwidth. As another topic of future work, we will consider modeling overlaps between computation (including message packing and unpacking) and point-to-point MPI communication. Contention for the same memory bandwidth, see e.g. [25], [26], will also be central in this regard.
One of the main benefits of using our "staircase" modeling approach is the capability of pinpointing per-process time usage, thanks to an elaborate modeling of the bandwidth contention between messages of various sizes. Realistic experiments have seen actual time usages greatly varying from process to process, thus the fastest finishing MPI processes may have considerable idle time. We plan to incorporate such modeling into future mesh partitioning strategies, with the goal of producing better partitioning results where the fastest finishing processes do not have to wait long for the slowest processes to complete.