PerfSim: A Performance Simulator for Cloud Native Microservice Chains

Cloud native computing paradigm allows microservice-based applications to take advantage of cloud infrastructure in a scalable, reusable, and interoperable way. However, in a cloud native system, the vast number of configuration parameters and highly granular resource allocation policies can significantly impact the performance and deployment cost. For understanding and analyzing these implications in an easy, quick, and cost-effective way, we present PerfSim, a discrete-event simulator for approximating and predicting the performance of cloud native service chains in user-defined scenarios. To this end, we proposed a systematic approach for modeling the performance of microservices endpoint functions by collecting and analyzing their performance and network traces. With a combination of the extracted models and user-defined scenarios, PerfSim can then simulate the performance behavior of all services over a given period and provide an approximation for system KPIs, such as requests' average response time. Using the processing power of a single laptop, we evaluated both simulation accuracy and speed of PerfSim in 104 prevalent scenarios and compared the simulation results with the identical deployment in a real Kubernetes cluster. We achieved ~81-99% simulation accuracy in approximating the average response time of incoming requests and ~16-1200 times speed-up factor for the simulation.


INTRODUCTION
C Loud Native Computing is an emerging paradigm of distributed computing that "empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid cloud" [1]. One of the key purposes of introducing this paradigm was to answer the increasing need for mitigating the efforts of applicationlevel clustering in the cloud and inline with the emergence of microservice architecture that promotes decoupling components of a software system into multiple independently manageable services, known as a microservice [2]. Since Google first introduced Kubernetes during the Google Developer Forum in 2014, as an approach for "decoupling of application containers from the details of the systems on which they run" [3], it becomes the de-facto enabler for utilizing microservice architecture based on technologies such as OSlevel virtualization, better known as containers [4].
Amongst the main advantages of cloud native computing is the possibility for allocating highly granular resources to large-scale chains of services in a cluster. This additional granularity, while facilitating the scalability of service chaining, imposes complexity in resource allocation, traffic shaping and placement of containers in a cluster. Therefore, as microservices networks grow larger, the need for tools and techniques that shed light on service chaining implications upon system performance becomes critical. Furthermore, by the rising trend of cloud native computing in containerized cluster environments, many researchers are nurturing new methods and schemes to implement new deployment [5] and performance optimization techniques [6] at various levels: starting from modern container scheduling [7]- [10] methods to predictive request load-balancing [11] and resource auto-scaling algorithms [12], [13]. Analyzing the performance behavior of a service chain in a real testbed gives the most reliable results. However, evaluating in real clusters is not always possible. In many cases, it might be too costly, notably time-consuming, and sometimes various skill sets are required to configure, run and manage the testbed. Moreover, in most performance optimization techniques, various scenarios need to be evaluated in a timely manner to eventually minimize a cost function. Performing such evaluations in a real testbed, while providing accurate results, imposes a dramatic burden for achieving a scalable and efficient optimization method.
Moreover, modern cloud native distributed systems have important performance affecting properties that earlier generation software systems such as monoliths (singletiered software systems consisting of multiple tightlycoupled components) didn't have to cope with as much. Properties such as agile horizontal/vertical scaling, highly granular resource allocation, context-awareness, contention with other services, service chaining, and dynamic loadbalancing between replicas.
To mitigate the aforementioned implications of using real testbeds for performance evaluation of cloud native microservice chains, we proposed PerfSim, a simulation platform that aims to approximately predict the performance of service chains under various placement, resource allocation, and traffic scenarios using profoundly limited resources of a laptop. We also proposed a systematic performance modeling approach to model the time-predictable endpoint functions of microservices using performance traces gen-erated by the profiling and tracing tools such as perf and ebpf as well as distributed network tracing programs such as jeager and zipkin. Using these models and a user-defined scenario, PerfSim can then simulates performance behavior of all service chains under a desired placement, resource allocation policy, network topology and traffic scenario.
Using profoundly limited resources of a laptop, we evaluated the simulation accuracy and speed of PerfSim under 104 prevalent scenarios by deploying and running them on a real Kubernetes cluster and comparing measured KPIs with the simulation results (i.e., average requests latency). We used sfc-stress, a synthetic service chain generation toolkit, for generating various service chains and microservicebased workloads. In our evaluation, we achieved ∼81-99% accuracy in predicting the average latency of incoming requests and ∼16-1200 times speed-up factor between the simulation time and actual execution time on a real cluster. With the same laptop, we also simulated a large service chain consisting of 100 microservices interconnected with 200 links over 100 hosts and showed that PerfSim can be effectively used for large-scale simulations.
To summarize, with PerfSim we contributed to the relatively new, but rich area of cloud native computing by enabling a fast, accurate and easy way for evaluating various user-defined policies in microservice-based applications.

RELATED WORKS
In this section, we briefly review existing works in the areas of (1) simulation platforms, (2) emulation tools and (3) analytical performance modeling approaches. Table 1 presents a comparison between the key properties of the most popular frameworks with PerfSim. In this table, besides common features, we also compared additional challenges imposed by simulating performance of microservices in cloud native environments.

Simulation tools
Cloud native systems have intricate provisioning and deployment requirements. Evaluating and predicting the performance of such services, studying the impact of provisioning policies, and correlating workload models with achievable performance are not straightforward due to the diversity and complexity of system interactions. Even though studying these interactions on real testbeds provides the most accurate results, several researchers and companies argue that computer simulation can be a powerful method to test multiple scenarios and evaluate various policies before enforcing them at different levels [14]- [17]. In recent years, several works aim to predict the services' performance using computer simulations to develop adequate resource policies and other decisions at different levels to meet the required Quality of Service (QoS) in a dynamic manner.
CloudSim [15] is one of the most popular simulators in this category that is designed to simulate cloud computing infrastructures. It allows modeling of various data centers, workload scheduling, and allocation policies. Using CloudSim in many scenarios can boost the development of innovative methods and algorithms in the cloud computing paradigm without the need for deploying applications in the production environments. In recent years, several tools and modules had been designed based on CloudSim. For example, the iFogSim toolkit [18] inherits the features of CloudSim, and extends them by the ability of modeling IoT and Edge/Fog environments. The authors in [19] also built upon CloudSim to simulate the specifications of edge/fog computing and support the required functionalities. There exist several other CloudSim-based simulators for simulating various use cases in fog or edge environments [20], [21].
CloudSim and modules/plugins/tools based on CloudSim provide a sophisticated and straightforward way for researchers to simulate cloud/edge/fog computing infrastructure for modeling service brokers, data-centers, and scheduling policies. However, they are not designed with the purpose of simulating rigorous performance testings of microservices under extremely granular resource allocation and placement policies that exists in containerized services in cloud native applications. Moreover, to effectively simulate a set of service chains, we need to precisely specify the links and connections between each microservice within all chains, and then route the requests based on the user-defined traffic scenarios and network topology model. Something that the CloudSim category of simulators has not been designed to address.
In addition to CloudSim, there are other simulators in the context of cloud simulation. Yet Another Fog Simulator (YAFS) [22] is a discrete-event simulator based on Simpy that allows to simulate the impact of applications' deployment in edge/fog computing environments through customizable strategies. YAFS allows to model the relationships between applications, infrastructure configurations, and network topologies. It uses those relationships to predict network throughput and latency in dynamic and customized scenarios, such as path routing and service scheduling. Even though YAFS offers a novel approach towards simulating performance of large network topologies, it's not designed to simulate the microservices' performance within the context of complex service chains over a set of hosts in a cluster.
Apart from cloud/edge/fog simulators, there exist other simulators that focus on specific aspects of cloud computing paradigm, such as energy efficiency or network modeling and optimization. For example, GreenCloud [14], is a packet-level simulator for capturing energy footprint of data center components with the aim towards providing an environment for researchers to design more energy-aware data centers [23], [24]. Other examples are NS-3 [17], OMNet++ [25] and NetSim [26] that are primarily designed to simulate various types of networks and topologies. Although these simulators can smoothly simulate specific aspects of cloud systems that they are designed for (e.g., networking and energy efficiency), they cannot be used to simulate cloud native applications' performance; or at least can only be used to simulate specific aspects of service chains, such as their network performance or placement efficiency.

Performance emulators
Another popular experimentation approach in evaluating performance of cloud systems is emulation. Using emulation, users can analyze the system performance behavior supported by the available hardware in a more realistic way T y p e T a r g e t E n v ir o n m e n t G ra n u la r R e s o u rc e A ll o c a ti o n V e r ti c a l S c a li n g H o r iz o n ta l S c a li n g C P U S c h e d u li n g M u lt i-th re a d e d E n d p o in ts C o n te n ti o n M o d e li n g F a s t P e r fo r m a n c e P r e d ic ti o n P T E In d e p e n d e n t/ L o w C o s t S e r v ic e C h a in s S u p p o r t M u lt i-c h a in S u p p o r t D y n a m ic C o n tr o l L o a d -b a la n c in g C o n g e s ti o n C o n tr o l A d v a n c e d R o u ti n g  After Mininet introduced the idea of using containers for network and process emulation, other works also started to adopt a similar concept. For example, Dockemu [33] adopted both Docker for emulating network nodes and NS-3 for simulating the network traffic. NEaaS [34], a cloudbased network emulation platform, utilizes both Docker and virtual machines to emulate various networking scenarios.
However, the widespread popularity of such emulators stands in stark contrast to the rather trembling fact that they are bounded to the available computation power and bandwidth of the underlying hardware they are deployed upon, and consequently, they cannot be efficiently used to predict the resource utilization aspects of large-scale and complex cloud native applications. Moreover, emulation cannot dramatically improve the speed of evaluating various resource allocation or placement policies, and therefore can't be used for exhaustive policy testing.

Analytical performance modeling approaches
There is a third category of approaches towards modeling the performance of cloud native applications which based upon an analytical framework. Contrary to the simulation/emulation techniques, which employ a bottom-up approach in modeling and simulating various performance aspects of cloud applications, analytical methods adopt a top-down approach for that purpose by collecting and analyzing application KPIs via running various stress tests on the system or using historical performance measurements.
As an example, the work proposed in [35] aims to model the response time of microservices based on stress testing and concurrently collecting performance traces for predefined intervals to learn a predictive auto-scaling model, such that the response time requirements are satisfied. Those performance traces are used to learn the resource provisioning policy model using a regression analysis approach. In [36], microservices are being tested individually using servicebased sandboxing to construct the corresponding model, the analyzed data and performance model are presented to the user. Considering modeling throughput and latency, in our previous work [37], we used a combination of stress-testing and regression modeling approaches to understand the impact of microservices' resource configurations, model the correlation between the KPIs and resource configurations in a smaller Performance Testing Environment (PTE) to predict the performance in a larger Production Environment (PE).
Although the aforementioned methods benefit from accurate performance measurements on real systems and can be considered as the most widely used techniques in performance engineering in production environments, they suffer from three major limitations. Their first limitation is the high cost of test environment preparation that can accurately provide performance measurements. As for gathering accurate performance insights, the tests need to be performed on either a production environment or a PTE that mimics it; that imposes a high cost of deployment in either case. The second limitation is the slow stress-testing procedure as for getting accurate results, several hours of performance testing is required. Last but not least of their limitations is applicability in utilizing advanced optimization techniques, such as deep learning or meta-heuristics based optimization heuristics, which requires numerous re-deployments to learn, train, or find an optimal policy.

PERFSIM DESIGN AND IMPLEMENTATION
In this section, we introduce the system architecture and implementation details of PerfSim and mathematical notation used in this paper (summarized in Table 2).

Modeling elements of cloud native systems
PerfSim has models for various elements of cloud native systems and their underlying infrastructure.
As described in the previous sections, the joint problem of placement and resource allocation of a service chain

Services and Endpoints
Set of all services r S ∈ R S Set of res. ctrl. params Set of replicas of SiV S (r S , Si),VS(r S ,ŝ i j ) Initial/current r S cap. ofŝ i j Π(ŝ i j ) ∈ H The host thatŝ i j is currently placed Set of endpnt. funcs of Si Set of service chains  over a cluster is complex as multiple dimensions impact the achievable performance and latency. However, several parameters have only marginal impact on the prediction quality. Consequently, we aim at providing a simplified model that is tractable but powerful enough in order to predict the performance of typical cloud native service chain architectures when deployed on several well-known container orchestration platforms such as Kubernetes.

Hosts
is a model of Linux-based physical machine that has h cores k number of CPU cores with h clock k clock speed (in Hertz). It has also a set of consumable resources, such as a memory capacity (in bytes), a NIC with a limited ingress and egress network bandwidth (in bytes per second), a local storage with a limited storage read/write bandwidth and storage capacity. The CPU resource is measured in millicores. Each host h k introspects the OS to determine the h cores k and then multiples it by 1000 to denote its total capacity (Eq. 1).
To facilitate our formulation, we denote hosts consumable resources as R H = {millicores, mem, in bw, out bw, blkio bw, blkio size}. For each resource r H ∈ R H , a host h k has an initial resource capacity denoted asV H (r H , h k ) as well as a current capacity denoted as V H (r H , h k ).

Network topology
A network topology, denoted as τ , has a directed acyclic graph representation of G(τ ) = (P, τ L) with routers ρ z ∈ P as nodes, and directed links τ l o = (ρ i , ρ j ) ∈ τ L between them as edges. To separately model egress and ingress bandwidth of interconnection between nodes, we hypothetically assumed there are always 2 links between each pair of routers with opposite directions. We denote the maximum ingress/egress bandwidth of a router as ρ in bw z and ρ out bw using a hypothetically separated egress/ingress directed links, denoted as H l o ∈ H L that have a pair of ordered host→router (h k , ρ i ) and router→host (ρ i , h k )) links.
Each router ρ z and link l o ∈ L, may respectively add extra latency of ρ lat z and l lat o nanoseconds to each request. These additional latencies differs from the delay caused by network congestion which we modelled separately.

Services
We assume that each chain of services is composed of a set of |S| containerized services S = {S i } |S| i=1 such that each S i ∈ S may have |S i | number of single-process/multithreaded replicas {ŝ i j } |Si| j=1 load balanced among |H| hosts and connected through |P | routers. A given service S i has a set of endpoint functions F i = {f i n } |Fi| n=1 that can be executed based on the type of incoming request. Each endpoint function f i n may spawns a set of threads t m ∈ f i n . We will explain properties of these threads in section 3.1.10.

CPU scheduler and resource controller
A service S i may optionally have resource constraints on its containerized replicas. We denote this set of resources as R S = {CPU requests, mem requests, in bw, out bw, blkio bw, blkio size}, which corresponds to resources R H in each host (R : R S → R H ). By default, there is no reservation or usage limitation on any resource.
In Linux kernel, a service replica is in a form of a container and is being managed by Control Groups (cgroups) [38] that is responsible for auditing and restricting a set of processes. The limits associated with cgroups isolate the resource usage of a collection of processes.
We partially modeled the behaviour of cgroups in Perf-Sim by allowing a service S i to initially reserveV S (r S , S i ) resource capacities for each of its replicas. When reserving CPU requestsV S (CPU requests, S i ), PerfSim allows a replica to use as much as CPU units available, even more than its reserved quota. We also partially modeled the behaviour of CPU bandwidth controller [39] of Linux by allowing to define S CPU limits i (in millicores) as an upper bound for replicas' CPU consumption. In kernel level, CPU request and limit is translated to cgroups cpu.shares, cpu.cfs_quota_us (in microseconds) and cpu.cfs_period_us (in microseconds) control parameters. We denote them as S CPU share i , S CPU quota i , and S CPU period i . However, for simplicity, we assume a fixed value for S CPU period i =100ms in our calculations and only considered CPU share and quota when estimating available resources for each replica. A usage example for this model is simulating Kubernetes Best Effort and Guaranteed QoS classes for CPU resources which we will cover in section 5.2.
TheV S (mem requests, S i ) specifies the initial memory allocation of replicas of S i (in bytes). We modelled the memory.limit_in_bytes control parameter in cgroups in a way that it may only affect the placement of replicas.
To control the traffic shaping and the maximum network bandwidth of replicas in S i , theV S (in bw, S i ), and V S (out bw, S i ) control parameters can be used (in Bps).
The storage size of replicas in PerfSim is controlled by a model of blkio cgroups controller and denoted aŝ V S (blkio size, S i ). We also denote the available bandwidth for blkio reads and write asV S (blkio bw, S i ). We intended to model and simulate the disk bandwidth throttling feature of cgroups blkio.throttle.write_bps_device.

Affinity controller
A service may optionally have a set of in-service affinity/anti-affinity rules for its placement. We define two binary decision vectors A Si and ‹ A Si that are respectively holding all the affinity and anti-affinity rules related to service S i . Each vector has a dimension of |S| × 1 where each row represents one of the services. An entry A Si [S j ] takes on the value of '1' if replicas of services S i and S j must be placed at the same host and '0' otherwise; and similarly an entry ‹ A Si [S j ] takes on a value of '1' if any replicas of both services must not be placed at the same host and '0' otherwise. It's also possible to define host affinities, but due to the page limitation we omitted the formal definition.

Placement algorithm
The placement of eachŝ i j among all hosts, takes place using a placement algorithm. By default, we implemented the Least Allocated bin packing algorithm which partially simulates the container scheduling mechanism of Kubernetes. This algorithm, favor hosts with fewer resource requests and is controlled with weights for each resource r H ∈ R H , denoted as W r H to be used for scoring nodes. We described the details of this algorithm in the next section.

Container scheduler
We use a binary matrix Π of size ( |S| i=1 |S i |) × |H| to keep tack of service replicas placement. In this matrix, p j,k ∈ Π = 1 if the replicaŝ j ∈ {S i } ∈S is placed in host h k and p j,k =0 if otherwise. Since each replica can only be placed on one host, Π(ŝ j i ) ∈ H indicates the host that replicaŝ j i is placed. After placing each replica, the amount of available resources in each host V H (r H , h k ) are recalculated for all resources.

Service chains
The goal of services is to serve requesters through a set of service chains C = {C l } |C| l=1 . These service chains represent the connection and traffic flow between endpoint functions of services and are specified as a directed flow network G(C l )=( l F, l E) where l F is the subset of endpoint functions which have a role in providing service chain C l , and l E is the subset of all virtual links belonging to the service chain C l . In other words, all endpoint functions within C l are connected through | l E| number of ordered directed virtual in bytes, which indicates the request size flowing from edge's first vertex to the second.
The first node in a C l is its only request entry point, known as the source node, that receives requests from potential request senders and the cluster load-balancer directs the traffic to a replicaŝ i j based on a desired load-balancing algorithm (default is Round-robin). A service chain has also one or more sink nodes with no immediate outgoing edges.

Traffic control
Assuming a service chain C l is active for C duration l seconds. The traffic rate of each C l is being controlled with 2 control parameter C rate l and C batch l ; every 1 C rate l seconds, a batch of C batch l requests will arrive to service chain C l . In other words, each service chain C l may be requested by a plurality of C batch l users with an average arrival rate of C rate l req s . Therefore, the number of requests | l U | for service chain C l over a period of time C duration l will be equal to

Service endpoint functions and associated threads
During the life cycle of the C l , each endpoint function l f n ∈ l F spawns | l f n | number of threads {t m } | l fn| m=1 on one of its replicas. Since each replica is placed on a host, we denote a subset of all threads running on a host h k asf h k .
Each thread t m ∈ l f n has a set of properties that specified in the original model and extracted based on measurements of the performance traces and monitoring tools during the modeling phase which we will explain in section 3.2.
A t m executes t inst m CPU instructions in which t maccs m number of them are due to memory accesses with t CPI m average CPU Cycle per Instruction (CPI). Moreover, t m has total of t crefs m cache references in which, whenŝ i j is deployed on an isolated and single core reference machine, t cmiss m number of those references will be cache misses with average miss penalty of t cpenalty m cycles. During the simulation, PerfSim dynamically re-calculates cache misses based on various factors such as co-located active threads, number of cores, CPU requests, and cache size.
A thread may also read/write a total of t blk rw m bytes of data from/to the storage device. Moreover, a thread might be idle for some time during its active life-cycle and we denote t m accumulated period of the idle time as t idle m . On a real Linux machine, threads are load balanced among all CPU runqueues in the host using the Linux's  Completely Fair Scheduler (CFS). We implemented a simplified version of CFS in PerfSim to imitate the impact of thread scheduling mechanism on the performance. We described the details of our implementation in section 5.

Modeling process
In order to accurately simulate the performance of service chains in a given scenario, PerfSim requires to have performance models of each and every active endpoint function within running services in the chain. To introduce such performance models in PerfSim, the user can either provide a pre-defined model file, or run a new modelling procedure. In the latter case, we proposed a systematic performance testing and modeling approach to extract the KPIs of endpoint functions, construct aforementioned modeling elements, and identify each equipment's resource capacities in the simulation. Once such performance models being extracted for an endpoint function, the models can be reused for simulating various types of user-defined scenarios without any need for a new profiling phase.
As presented in Figure 1, the modeling process starts by placing microservices of all service chains on a set of reference hosts in the Performance Testing Environment (PTE). Then, one by one the microservices get isolated on a single host to start stressing each of its endpoint functions by flowing a single request through the function, and in parallel, an automated script captures various performance traces and resource utilization measurements and stores the correlated traces it in a separate database.
While running the network traffic, various network tracing tools would be used to identify the connections between microservices and identify service chains within the cloud native system. For example, a service mesh layer based on istio [40], enables distributed tracing capabilities through Envoy [41] which "allows developers to obtain visualizations of call flows in large service oriented architectures" by taking advantage of tracing tools such as LightStep [42], Zipkin [43], and Jaeger [44]. Leveraging a service mesh layer allows Perf-Sim to extract service chain flow models which facilitates the modeling procedure. However, these models can also be provided manually via PerfSim's JSON model files or dynamic control objects.
After completing the profiling phase for all endpoint functions, a similar procedure can optionally be repeated with all the target network equipments, such as routers and NICs, using network performance measurement tools such as iperf. The intention is to separately measure available egress/ingress bandwidths of each network equipment and links latency to effectively simulate the network congestion.
Finally, based on these extracted models, various userdefined scenarios can be specified by the user. These scenarios includes resource allocation policies, affinity rulesets, placemnent conigurations and algorithm, equipments used in a cluster (i.e., active hosts and router), service chains, incoming traffic scenarios, and underlying network topology. To start simulating the given scenario, all the aforementioned models and simulation parameters can be fed into PerfSim using a JSON model file or a Python script (i.e. dynamic control object). We will cover the details of defining a scenario in section 3.3.1. Figure 2 presents the layered architecture of PerfSim. We employed a multi-tier and object-oriented approach to architect PerfSim: (1) The presentation layer (client), (2) business layer (PerfSim's core), and (3)  In the presentation layer, PerfSim user defines the simulation scenarios and/or provides the control object. The user can provide the scenario in 3 general ways:

System Architecture
• Using a JSON file for static simulations • Using a perfsim.Cluster object for dynamic simulations • Using the GUI for quick experimentation A scenario's structure in PerfSim is as follows: 1) Prototypes define the model of all microservices, endpoint function, threads, hosts, routers, links and traffics. 2) Equipments define the hosts and routers that are going to be used in the cluster scenarios. These equipments should be created based on the defined "prototypes".

5)
Resource allocation scenarios consists of resource policy templates that later on can be assigned to each replica in a cluster scenario. Affinity rule-sets define the affinity/anti-affinity ruleset scenarios that can be used in the cluster scenario.
"affinity_rulesets": {∀ affinity ruleset→ "{affinity_set_name}": Listing 7: Rulesets containing affinity/anti-affinity rules 8) Cluster scenario defines ultimate cluster scenarios and consists of a combination of all aforementioned sections, as well as additional configuration to define number of replicas for each service and network timeouts.
"cluster_scenarios": {∀ cluster scenario→ "{scenario_name}": { "service_chains":{∀C l ∈ C → "C l ": { "traffic_type": "{traffic_type_name}", "nodes_settings": {∀Si ∈ S → "Si": { "replica_count": |Si|, "res_scenario": "{res_scenario_name}"}}}}, "placement_scenario": "{placement_scenario_name}", "topology": "{topology_name}"}} Listing 8: The ultimate simulation scenarios object Similar to static JSON-based scenarios, a user can define the ScenarioManager object in pure Python and provide the control object to PerfSim to start the simulation. As shown in Listing 1, since the JSON-based scenario is humanreadable, a PerfSim user can easily modify scenarios and/or add new ones as needed. Moreover, a user can define and simulate various types of scenarios without any need for performance trace logs or profiling data. The aforementioned scenarios can be provided based on either extracted performance models mentioned in the previous section or based on user-defined models. A PerfSim user can then retrieve simulated insights of the final performance of the system in the given scenario.

Business and data layers
In the business layer, PerfSim creates all necessary objects to perform the simulation. The details of modeling and simulation in PerfSim is covered in the next section.
After performing the simulation, the extracting performance simulation resutls will be saved in a database. The user can choose between saving the result in (1) a JSON file or (2) store it in a database (e.g., neptune.ai or MySQL).

PLACEMENT, CHAINING AND ROUTING OF SER-VICES OVER THE CLUSTER
In PerfSim, the first step towards approximating the performance of service chains in a user-defined scenario is to place replicas on a cluster of hosts. After the placement, PerfSim simulates the traffic scenario by first processing the defined service chain flow graphs, identifying parallel subchains, and extracting the exact route between service replicas based on the given network topology τ .

Placement of service replicas on the cluster
By default, we implemented a simplified version of the Least Allocated bin packing algorithm for the placement of Algorithm 1: Simplified placement of service replicas among several hosts in a cluster service replicas, partially imitating the Kubernetes scheduler's placement strategy. The key idea behind this strategy is to ensure that replicas are placed on hosts with adequate resources and balance out the resource utilization of hosts. It consists of four steps: (1) enqueuing replicas, (2) filtering hosts, (3) scoring hosts, and (4) placement of the replica on the host. All these steps are managed by a schedule procedure. Algorithm 1 represents each of these steps and Figure 3 represents an example of such final placement.

Enqueuing replicas
Replicas are being placed one after the other. When a service requests for scheduling a replica on the cluster, scheduler adds that request to a queue (denoted as Q in Algorithm 1) and then attempts to find a suitable host for placing it.

Filtering hosts
Considering all affinity/anti-affinity rules as well as resource constraints of the service replica, the scheduler needs to filter out all the hosts that matches the specified limitations by the service. The filtered nodes, denoted as Hŝ i j , are potentially eligible to host the replicaŝ i j ∈ S i .

Scoring hosts
In the Least Allocated bin packing strategy, in order to fairly distribute a replicaŝ i j among several hosts in a cluster, all eligible hosts in the Hŝ i j are scored based on the request to capacity ratio of primary resources r S ∈ R S and r H ∈ R H considering the weight of each resource denoted as W r H . The score ψ h k s i j for each host h k related to a replicaŝ i j are between ψ min and ψ max are calculated as follows:

Placement of replicas
After calculating the score of all eligible nodes for hosting replicaŝ i j , the host h * s i j with the lowest score will be selected:

Service chain flow graphs
A network flow graph G(C l ) = ( l S, l E) of service chain C l is a directed ordered graph that can have any form, from a simple tree-like service chain to a complex multi-degree cyclic multigraph. To efficiently simulate these service chains, we should be able to estimate the processing time of each service replica lŝi j ∈ l S i ∈ l S ⊆ S, and then considering the outgoing request payload and network congestion, estimate the network transfer time between two services ( l S i , l S i ) ∈ l E. To accomplish these tasks, we should first identify the exact execution order of service requests. In any given service chain C l , there might be subchains that run in parallel and their execution order are entirely depends on their execution time; this in turn depends on a vast range of parameters, starting from the processing power of the host to allocated resources to each service.
To identify all subchains within a service chain flow graph G(C l ), we first form an alternative graph G (C l ) = ( l S , l E , l F ) from the original graph G(C l ) in which every service l S i ∈ l S is visited only once. We form G (C l ) by duplicating services in l S i ∈ l S that are visited more than once; in other words, duplicate service nodes where has indegree of δ − ( l S i ) ≥ 2, and rerouting connecting edges to newly generated nodes. Then, based on newly formed G (C l ), we now identify all n G (C l ) subchains in C l , denoted as { l c x } n G (C l )

x=1
. Each subchain l c x has a flow subgraph G ( l c x ) = ( l S x , l E x , l F x ) that has a source service, which is the first service of the subgraph, and a sink service, which is the last service of the subgraph. In case a G ( l c x ) has only one service, that single service will be both source and sink node. We form each subgraph G ( l c x ) of the subchain l c x by first identifying its source service. Starting from the the very first source service l S 1 ∈ l S that is initiating first subchain l c 1 , when a service l S i ∈ l S has an outdegree of δ + ( l S i ) ≥ 2, which means it has more than one outgoing edges {e l y = ( l S i , l S x )} ∈ l E , then all immediate services l S x will be marked as source services and will initiate a new subchain. A subchain ends where we meet a service node with an outdegree of δ + ( l S y ) = 1. A request l u o ends when all sink service nodes conclude their executions.
As an example, Figure 4 demonstrates all aforementioned steps. In subfigure (a), we see an example flow graph G(C l ) of service chain C l with all its services l S (circles) and their containing service replicas (cubes) together with all edges l E connecting services together (arrows) with the source service node S l 1 . In this graph, we see node S l 3 initiate two parallel subchains and has a indegree δ − (S l 3 ) = 2. Also, S l 6 forms a cycle with S l 7 with indegree δ − (S l 6 ) = 3. In the second subgraph (b), we form the alternative graph G (C l ) by duplicating S l 3 and S l 6 and rerouting connected edges e l,6 , e l, 8 , and e l, 10 . In subfigure (c), we identify all

Routing requests
When a subchain l c x of a service chain C l receives a user request l u o ∈ l U , based on the G ( l c x ), it'll be routed to one of the service replicas of the source service l S x 1 ∈ l S x ⊆ l S ⊆ S that is chosen based on a round-robin load balancing algorithm (neglecting the session affinity possibility). If the chosen replicaŝ 1 j ∈ l S x 1 is either placed in the same host as the requester's service replica or if l S x 1 is the source service replica S l 1 ∈ l S of the service chain C l , then, assuming l S x 1 is not a sink service, after the request is being process in the s 1 j , it will simply get routed to the next replica inside the same host. But in case the next replica is not in the same host, then it needs to get routed to the destination replica based on the network topology graph G(τ ). Considering G(τ ) is a directed acyclic graph and there should be one-and-only-one active path carrying information from a source to destination in a network, we extract the path between all pairs of hosts and form a set of |H| where P ha,h b is a set of routers between h a and h b that are connected using L ha,h b links. As an example in Figure 3, we have eleven hosts H = {h 1 , . . . , h 11 } that are connected together using six routers P = {ρ 1 , . . . , ρ 6 } with a tree topology τ and network topology graph G(τ ). Figure 5 represents the path fromŝ 1 1 toŝ 2 1 based on the network topology τ represented in Figure 3.

APPROXIMATING EXECUTION TIME OF THREADS IN A MULTI-CORE HOST
The next step after identifying and deploying service chains and network routers is to estimate the execution time of each endpoint function l f i n ∈ l F l ⊆ l F when a request arrives alongside other running processes in a host. In each cluster scenario, there exist a set of service replicas that each of them are capable of running a set of endpoint functions. When a request arrives, an endpoint function may propagate a few threads during its execution and each of its threads, when load-balances on a core's runqueue, may perform a few different tasks at a time. To be precise, a thread may perform one of the following tasks at any given time: • Execute a CPU intensive task on the processor • Read/write bytes of data from/to memory/cache • Read/Write a file from/to storage • Send/receive packets of information over the network • Be in the idle mode Additionally, other running threads inside a host have direct influence on the CPU time of an application because of the CPU scheduling mechanism in the operating system. For instance, Linux's CFS load-balance threads over dozens of CPU runqueues based on their load, CPU share, CPU quota and other parameters. Moreover, a typical multithreaded service have a main thread and a number of worker threads that coordinate with each other through synchronization primitives, that results additional overhead on execution time [45].
Even though predicting the exact execution time of a thread in a host can be very complex due to the aforementioned complexities, and considering that cycle-level simulation of multi-threaded services can be a very timeconsuming procedure, we can predict a rough approximation of its run time by properly categorize type of running tasks in a host and approximate the execution time of each task by considering each thread's parent cgroups' resource constraints as well as other parallel active threads in the host. Consequently, this approximation may neglects a few performance factors such as microarchitecture dependent variable (i.e.,CPU instruction sets and CPU architecture). However, during the modeling phase in a reference PTE, these effects will be measured and considered in the model, and therefore they contribute to the final approximation.
To approximate the total execution time of a thread, we categorize its tasks into 3 main types and accumulate measured values: 1) Accumulated CPU instructions to be executed 2) Accumulated stall cycles due to cache misses, memory access or block I/O 3) Payload to be sent to the next service over the network

Load-balancing of Threads over CPU runqueues
The first thing before being able to approximate the execution time of threads on a multi-core hardware, we need to estimate threads placement in a set of active CPU cores. This is crucial for the estimation, as the execution time is directly being affected by the other running threads in a core's runqueue. Therefore, we implemented a simplified version of Linux's CFS load-balancing algorithm [46] in PerfSim. We presented the details of our implementation in Algorithm 2. For simplicity, we assume there are only one NUMA node (one CPU socket with multiple cores/runqueues in it).
The CFS balances cores runqueues based on their load which is a measure that is a combination of threads accumulated weights (CPU shares) and average CPU utilization. To estimated weight of a thread t m ∈ f i n on a core, we first divide CPU shares of its parent service S i ∈ l S with total number of its running threads, and then divide the result with sum of all CPU shares currently running on the core: Assumingf h k is the set of running threads placed on h k : To calculate load of a thread: In which, t runnable sum m is the amount of time that the thread was runnable and t runnable period m is the total time that the thread could potentially be running. load_balance(current core,bussiest core)

while load balance was not successful
Additionally, the load-balancing procedure takes place by considering cache locality and its hierarchical levels called scheduling domains. Within each scheduling domain, load-balancing occurs between a set of cores, called scheduling groups. Since we assumed there is only one CPU socket in each host, there exist only two scheduling domains: NUMA node level and core-pair level. Algorithm 2 represents our simplified version of the CFS algorithm used in PerfSim.

Approximating CPU time
Here we explain PerfSim's approach towards approximating CPU time of threads when co-located in a host.

Calculating auxiliary CPU share of a thread
When a thread co-locates with other threads on a multi-core machine, its execution time may be affected by the threads on the runqueue. Since we do not intend to simulate CPU time in a cycle-by-cycle basis, we approximate the effect of CPU bandwidth and CPU quota of threads by first defining an auxiliary CPU share for each thread, denoted as t share ratio m . For simplicity, we assume threads run in either (1) the Best Effort mode, which implies there is no CPU quota and CPU bandwidth defined, or (2) in the Guaranteed mode where CPU shares are fixed and guaranteed. For every threads in a runqueue, we calculate t share ratio m ∈ (0, 1024] as follows: 5.2.2 Approximating cache miss rate based on cache stores and CPU size The cache miss rate of a single thread can get affected by various hardware dependent factors such as: • CPU L1, L2, and Last Level Cache (LLC) sizes • Support of CPU Cache Allocation Technology (CAT) allowing software control over LLC allocation per process • microarchiture-level details The cache miss rate may also get affected by following significant software dependant factors: • Allocated CPU size of process • Memory access rate of co-located threads in runqueue Since the focus of PerfSim was to model the performance of microservices in merely software-oriented scenarios, such as container's placement, container/host affinity/antiaffinity, and resource allocation policies, we assumed a fairly similar microarchitecture and cache implementation details between the CPUs of PTE and the PE.
Hence, during the performance modeling phase, the effect of both (1) CPU size and (2) memory access rate (store/loads) of co-located threads is measured and a logarithmic regression model is being trained to predict the excessive cache miss rate of each thread. We denote the first effect on t m ∈ f i n as t CMC m and the second one as t CMT m .
We then apply these excessive penalties to the measured miss rate on the reference PTE as follows:

CPU time approximation
We used the CPU Performance Equation promoted in [47] to approximate the CPU time of an isolated thread t m ∈ f i n on an idle core (in the Best Effort mode): Considering that t inst m is the accumulated number of instructions that thread t m executes during its lifetime, it also includes additional instructions that thread performs due to mem accesses, cache misses, blkio r/w and net I/O.
Assumingtt inst m is the remaining instructions at any given timet, we approximated the CPU time of a thread on a CPU runqueue at any given time as follows: Therefore, given a time period ∆T ≥tt CPU time m (nanoseconds), we can estimate the executed instructions in that period as follows:

Approximating the storage I/O time
Assuming a thread t m is placed on host h k with accumulated reads/writes of t blk rw m , we can approximate its storage I/O time as Equation 18.

Approximating execution time of a thread
As described in the previous section, the execution time of a thread consists of the CPU time, blkio rw time and idle time. Therefore, given a timet, we can estimate the execution time of thread on a core's runqueue as follows:

Approximating network transfer time
When a service replica lŝi x ∈ l S i ∈ l S attempts to send l e payload v bytes of data to another service replica lŝj y ∈ l S j ∈ l S, they are either placed on the same host and the network transfer time is equal to zero, or they placed in two different hosts h a = Π( lŝi x ) and h b = Π( lŝj y ). In the latter case, a request needs to traverse an ordered subgraph G(τ ) ha,h b = (N ha,h b , L ha,h b ) where N ha,h b consist of both hosts h a and h b and all routers in between them. Also, L ha,h b is all the network links between two hosts, including host links and topology links.
Depending on bandwidth usage of active requests flowing over a network link, the available bandwidth of a link changes over time. We therefore, denote the maximum available bandwidth of each link l o ∈ L ha,h b at any give timet as θ(l o )t and calculate the maximum available bandwidth between all links in L ha,h b as follows: . We then for any given timet, calculate the request bandwidth between lŝi x and lŝj y , denoted as θ( lŝi x , lŝj y )t, as follows: To calculate the network time between 2 service replica lŝi x and lŝj y , denoted as ω( lŝi x , lŝj y )t, we calculate the transfer time by dividing the payload size with available bandwidth, and then sum it up with routers and links latency as follows:

Putting all together: The simulation
During each iteration, PerfSim predicts the next event to simulate. We categorized events into five main categories: • Request Generation simulates the incoming traffic to each service chain by generating requests. • Threads Generation checks for new queued threads and generates them. • Threads Execution Time Estimation estimates the execution time of threads and checks whether their execution is going to end before the next transmission completes, or vice versa. • Threads Execution estimates consumed instructions in all threads and repeat load-balancing among available cores in each host. • Network Transmissions: Estimates the remaining payload in each active transmission in the network.

SIMULATION ACCURACY EVALUATION
In this section, we address the followings to highlight the accuracy, speed, applicability and significance of PerfSim. 1) Potential threats to evaluation validity (Section 6.2) to explain our approach for ensuring evaluation validity. 2) Performance modeling accuracy (Section 6.3) to reflect and report the PerfSim's simulation error using a comprehensive set of prevalent scenarios (Table 4). 3) Execution time of PerfSim prototype (Section 6.4) to report the amount of time required to simulate each scenario using an early-stage prototype of PerfSim. 4) Simulating large scale service chains (Section 6.5) to demonstrate the applicability of PerfSim for simulating large-scale service chains. 5) Challenges and limitations (Section 6.6) to highlight PerfSim's practical limitations as well as key challenges in simulating performance of computer systems.

Experimental setup
Cluster setup. To evaluate the simulation error of service chains' execution time between PerfSim and the real setup, we deployed a Kubernetes cluster using four physical machines (as compute nodes), and connected them through one or more Netgear 10Gbps routers (depending on the topology). All our servers are based on Intel Core i7 microarchitecture and their hardware details specified in Table  3. The latency of routers for processing a packet is 7.3e5ns, and links latencies are 4.2e5ns. Workload. To evaluate various scenarios described in this section, we used sfc-stress [48] that is a customizable synthetic service chain benchmarking suit capable of generating different types of service chains with various types of CPU/memory/blkio/storage intensive as well as userdefined workloads. In sfc-stress, workloads are comparable to benchmark suites used for microservices, such as DeathStarBench [49]. They are similar to real-world cloud native microservices in various ways, such as capability of invoking any requested number of threads when their endpoint function being called, REST API based, written in Node.js, cloud native, deployable on both Kubernetes and Docker, fully parametric, and easily customizable. Using sfcstress provided us with the flexibility of easily changing workloads type, size, threads count, execution duration, resource allocation settings per service (both vertically and horizontally), and most importantly design custom service chains and generate periodic automated requested based on given arrival rate (request per second). Simulation parameters. We used the pre-release prototype version (alpha-0.1) of PerfSim for this evaluation. To drive the simulations, we first extracted performance traces of various endpoint functions in sfc-stress and fed it to PerfSim as the simulation's initial parameters (Table 3). We only set W millicores and W mem to 1 and set other weights to 0 (similar to Kubernetes settings). Placement of service replicas. We used Kubernetes default kube-scheduler with weights specified in Table 3 to place and govern service replicas.

Potential threats to evaluation validity
As with any evaluation, potential threats exist to reduce its validity. We, therefore, identified several of these threats before starting our evaluation process and paid great attention to address them during our evaluation.

Evaluation consistency
During our initial experiments, we realized using only one experiment might not be representative of the evaluation scenarios performance behaviour. Thus, we repeated each experiment four times and calculated the average for each data point. In the end, we calculated moving average for both simulated and actual results as appropriate.

Procedural rigor
Collocating hosts in our testbed with other nodes in a cluster may introduce additional noise to the system. Moreover, we realized connecting various hosts to the same router may affect the performance of the network intensive scenarios.
To mitigate this problem and prevent any possible noise affecting the experiments results, we isolated the entire testbed by locating hosts in a separate rack and making sure no other hosts are connected to the routers. We also noted that keeping energy saving options in the BIOS may slightly affect the CPU performance by altering its clock frequency. To mitigate this effect, we disabled energy saving options in the BIOS and optimizied it for production.

Comprehensiveness of evaluation scenarios
Another threat to our evaluation was to neglect common scenarios used in today's systems. Nowadays a real distributed system may consist of different type of services, each with various resource demands, workload sizes, interconnections, and expected average traffic. Thus, to ensure the comprehensiveness of our evaluation scenarios when studying the accuracy of PerfSim, we designed 6 category of scenarios, each focusing on a distinct aspect of PerfSim. These categories, as presented in Table 4, includes CPU intensive, memory intensive, network intensive, scenarios with multiple replicas, and scenarios with multiple endpoint functions per service.

Latency prediction accuracy
In this section, we evaluated the accuracy of PerfSim when simulating 104 different scenarios in 6 distinct category.

Evaluating distinctive classes of scenarios
We perform our evaluation by executing various workloads on our real Kubernetes cluster for ∆ T = 60 seconds and compared average execution time of requests ( l u exe time o ) with the ones computed using PerfSim. We summarized the simulation error of all scenarios in Table 4 and illustrated the latency over requests line graphs in Figure 8. We also illustrated the average latency ( l u exe time o ) of each scenario as a bar graph in the same figure. We repeated each experiment four times to avoid any unaccounted artifact. Each scenario has (1) a service chain as illustrated in Figure 6, (2) a network topology as illustrated in Figure 7, (3) an arrival rate C rate l req s , and (4) a resource allocation setting. In scenarios 1-20, our goal was to evaluate PerfSim simulation accuracy when assigning different CPU sizes to a single CPU-intensive container. For example, in scenario#1 we assigned 100 millicores to the only replica available in S 1 and generated a traffic at a rate of 1 req s on C 1 and measured requests average execution time ( 1 u exe time o ). We repeat the exact same scenario using PerfSim and calculate the simulation percentage error by comparing extracted 1 u exe time o in PerfSim with the one measured in our real testbed.
In scenarios #2-19, we increase the CPU size by 100 millicores in each scenario and repeat the same procedure; in scenario #20, we also evaluated the case where all resources run in the best effort mode. In scenarios #21-40, we repeated all aforementioned procedures but for the memory intensive workloads (also implemented in the sfc-stress toolkit) to measure the accuracy of PerfSim for simulating memoryintensive workloads. Similarly, in scenarios #41-60, we focused on the disk-intensive workload.
To evaluate simulations involving networks and service chains, we designed scenarios #61-80. In the first half of this category (scenarios #61-70), we aimed to evaluate the accuracy of PerfSim when assigning egress bandwidth to an outgoing service in service chain C 4 (Figure 6), and in the second half (scenarios #71-80), our goal was to measure the τ 1 τ 2 Fig. 7: Network topologies used in the evaluation  accuracy when assigning ingress bandwidth to the incoming node S 3 . Scenarios #81-100 were designed to measure the same effect but when the system is deployed in slightly more complex network topology τ 2 (Figure 7).  To evaluate scenarios involving multiple replicas, we designed scenarios #101-102. In these scenarios, we assigned 4 replicas to S 1 , 2 replicas to S 2 , and 2 replicas to S 3 . We tested 2 different arrival rates on C 5 : (a) 1 req s in scenario #101, and (b) 3 req s in scenario #102. In scenarios 103-104, we aimed to evaluate deployment that involved multiple endpoint functions, as well as, cases where an endpoint function spawns multiple threads (e.g., f 1 2 ). Another goal of these scenarios was to test the performance of PerfSim in scenarios with complex service chains (i.e., C 6 ) with multiple loops and nested diamond-shaped connections.

Reflections on simulation accuracy
In the vast majority of our experiments, as shown in Figure  8, PerfSim predicted the latency trend with a high accuracy (with average error rate of 9% error). However, in all scenarios, the actual service latency has visible fluctuations across time (the orange line) while PerfSim's predictions (blue line) did not capture those slight variations. This is because PerfSim is a discrete-event simulator and captures changes in the system state based on events such as request gener-      ation/conclusion, thread spawn/kill, network transmission start/end, queue start/end, task scheduling start/end, etc. Therefore, as PerfSim events are not defined based on the CPU clock resolution (i.e., kernel clock) to maintain fast simulation speed, it doesn't consider the CPU noise-factor to simulate slight fluctuations (e.g., slight changes in CPU frequency or clock speed).
Another observation captured from the experiments results is the clear harmony between almost all simulation results and the reality. Nonetheless, in some intense scenarios (e.g., scenarios 21-23 or 41-42) where very low CPU resources has been assigned to a memory intensive or storage intensive services (100-300 millicores), even though PerfSim accurately simulated the trend of performance, we observe a gap between PerfSim's approximated latency and the actual obtained latency in reality. The nature of this variation is due to the use of a static model for each service in our experiments. In highly overloaded or intense scenarios, specially in a memory-intensive task, cache-miss rate may significantly affected and undergo substantial changes, which even though has already been considered in PerfSim's core, combining it with ever increasing rate of context switches and CPU migrations in such intense scenarios, creates a high competition between service threads from one hand and system/user threads on the other hand which puts the entire system in a chaotic situation. This problem can be addressed by utilizing more dynamic performance models per service and considering system level threads in the simulation.

Execution time of PerfSim
One of the key ideas behind designing PerfSim is to eliminate the barrier for utilizing advanced machine learning techniques (such as deep reinforcement learning) for optimizing the performance of large-scale service chains that require fast efficiency in estimating various resource allocation and placement scenarios. In this section, we evaluated the execution time of all 104 scenarios and compared them with the simulation time in PerfSim.
Due to the fact that the main users of PerfSim will be researchers and performance engineers with limited access to computation resources, and to highlight the lightweight nature of PerfSim, a single personal laptop has been used to measure the simulation speed. We ran PerfSim in singlethreaded mode using a MacBook Pro with a 2.6 GHz Intel Core i7 CPU and 16GB of RAM.
We represents the detailed speed comparison in Figure 9. As shown in the graph, a single-thread execution of PerfSim is 16-1200 times faster to evaluate the performance of a scenario when compared to running/evaluating the same deployment on a real cluster.

Simulating large scale service chains
When designing PerfSim, we initially aimed to use it for optimizing large-scale service chains. Therefore, one of the    Fig. 10: Simulating a large service chain with different types of workload, payload sizes and heaviness (e.g., large number of instruction, cache r/w) aspects of this evaluation has been focused on the ability of PerfSim to simulate large-scaled service chains. For this purpose, we simulated the performance of a large randomized service chain deployed over a cluster of 100 host. We illustrated both our generated service chain G(C l ) and extracted alternative graph G (C l ) in Figure 10. This service chain consists of 100 nodes and 200 edges with randomized payloads, workload heaviness and workload type.
Similar to the previous scenarios, we used a singlethreaded version of PerfSim for the entire experiment. We presents both simulation parameters and results in Table 5. As shown, PerfSim was able to simulate the performance of this large-scale service chain for up to 10 times faster than targeted execution time ∆T while using reasonable CPU and memory resources.

Discussion on challenges and limitations
In this section, we discuss main challenges of performance simulation, as well as, key limitations for using PerfSim.

Comments on time-predictability of tasks
Time-predictability of a task in a computer software mainly refers to the properties of the phenomenon execution time, including the execution pattern of instructions or spectrum of occurrence events related to a job's execution time [50], [51]. These properties directly affect the process of modeling a job's execution time and, in many cases, can make it impossible to extract any reusable pattern [52]. Hence, it is a vital property to hold in real-time and embedded systems due to their time and mission critical nature.
To shed light on the surface of the problem, we provide an example in Listing 2, representing 2 functions (f1 and f2), both calculating n number of MD5 hashes. Function f1 receives the value of n as an argument whereas f2 initiates n with a randomly generated number between 0 and 10 10 . For these functions, modeling f1 is relatively straightforward because its execution time is directly related to the input argument n; for f2 on the other hand, it is extremely hard to approximate the execution time because it depends on the randomly generated variable n. With the growing convergence of High-Performance Computing (HPC) systems and cloud computing paradigm, modern design patterns tend to progressively consider the properties of time-predictable computing in their approaches to be able to plan for the underlying resource allocation policies, infrastructure costs, as well as, to avoid Service Level Agreement (SLA) penalties. However, many organizations are still not ready for such a transition, and this imposes immense challenges in modeling and simulating the performance of their designed systems. 6.6.2 Modeling the software architecture The software architecture, as the key foundation of any software system, can significantly impact the performance of chain of services. As described in the previous sections, PerfSim can adequately model architecture of a software system by defining the connection between microservices and hosts, and translating each microservice into a set of endpoint functions while describing their properties.
However, one of the critical challenges in simulating large-scale systems is to model components that may change their behaviour based on the context. For example, some architectures foresee an HTTP cache server in the front-line of their main webserver to reduce the latency of requests or API calls. When a request arrives, these cache servers check their temporary storage for the requested content and only send the request to the upstream server when they have not found the requested content in the cache or when it is outdated. Modeling such services that can change their behavior based on the situation, although possible in PerfSim, requires designing content-aware modules for the simulator. For this, interested users can extend the corresponding classes to override the methods responsible for consuming resources, as we well as the pattern of executing endpoint functions and their corresponding threads.

Model portability
We rigorously evaluated PerfSim in various scenarios and demonstrated its accuracy in different settings. However, we find out that even though the extracted models can be used in different hosts and settings, they cannot be generalised/used to simulate the performance in other microarchitecture settings. For example, a model extracted for an Intel CPU architecture cannot be used to accurately simulate the performance of the ARM-compatible version of the software in an ARM-based architecture (e.g., Apple's M-series chips -even with dynamic binary translation enabled).

Simulating highly overloaded scenarios
Two paramount features of PerfSim are (1) its ability to simulate contention for various resources when multiple containers are packed in a host, and (2) simulating network congestion when microservices communicate over a network topology. However, when a process enters an highly overloaded state (i.e., when the rate of incoming requests is more than the processing capacity of serving processes ), the request queue will grow without a bound, and in the meanwhile, the system aggressively throttles CPU and network resources. This, in turn, will increase the rate of context switches, CPU migrations, and cache misses.
In large-scale systems, performance engineers generally define a rigorous set of constraints and resource management policies throughout a software's design and deployment lifecycle to prevent the system from entering such unpredictable situations [53]. Hence, accurate simulation of such chaotic situations are extremely challenging.

Simulating containers on system virtual machines
One of the widely used schemes for deploying container orchestrators (e.g., Kubernetes) on a cluster involves the use of hypervisor-based virtualization platforms, such as OpenStack or VMWare. Such deployment scheme becomes popular as system virtual machines (1) allows the use of various operating systems in one machine, (2) allows OSlevel customization for each service, and (3) provide better data and resource isolation between neighbouring services. However, study shows that virtualization suffers from noticeable performance overhead due to the additional OS layer [54], [55].
One of the main intentions of designing PerfSim is to simulate thread-level performance of services and to enable accurate prediction of endpoint functions performance when various threads compete over resources in a host. Thus, the additional complexity of OS-level threads in the aforementioned schemes, if not appropriately modelled, may affect the accuracy of simulation in scenarios where the virtualization overhead is considerably high.
6.6.6 Challenges of performance modeling As described earlier, the process of performance modeling occurs on a PTE that mimics a PE to (1) obtain a complete isolation for services to collect accurate measurements for each endpoint function and (2) to prevent interference with the experience of real cloud users. However, preparing a PTE may impose both technical and financial challenges that needs to be addressed in the design of DevOps process flows. Automating the entire process of performance testing after each release may dramatically reduce the cost of modeling and dramatically increase the accuracy of simulations.

CONCLUSION AND FUTURE WORKS
In this work, we presented PerfSim as a systematic method and simulation platform for modeling and simulating the performance of large-scale service chains in the context of cloud native computing. Using performance tracing and monitoring tools, PerfSim allows performance modeling of various microservices and their corresponding service chains in cloud native orchestration platforms (such as Kubernetes that we used in this article) and enables the possibility to simulate the effectiveness of different resource management scenarios and placement policies. We evaluated the accuracy of PerfSim in a set of prevalent scenarios and obtained ∼81-99% prediction accuracy as well as ∼16-1200 times speed-up factor in comparison to testing on a real system (excluding the time needed for setting up the system and configuring the cluster). We evaluated the capability of PerfSim in simulating large-scale service chains and showed that using a single-core of a laptop, it can achieve a 10 fold speed-up factor to simulate a scenario with complex service chains (composed of 100 microservices interconnected with 200 links deployed over a cluster with 100 hosts).
Our future work will cover three main categories: (1) improving speed, (2) improving accuracy, and (3) expanding use cases of PerfSim. To increase its speed-up factor, we plan to refactor it to harness the parallel processing capabilities of GPUs. To improve its accuracy, we will add more layers/features, such as the support of defining cache-size per container (to simulate various scenarios based on Intel's CAT) or adding the simulation of poll mode in NICs (e.g., scenarios involving DPDK). Because advanced performance optimization methods can harness PerfSim's fast scenario assessment capability to quickly perform numerous policy trials and errors, we will work towards proposing novel performance optimization methods. We will also use Perf-Sim to train deep reinforcement learning agents that can autonomously control different policies in the cluster. We will also implement accounting features in PerfSim that would allow detailed report on cost of deployment (e.g., when deploying on a public cloud) based on pricing/cost information provided within the scenario config files.