Multi-Provider IMS Infrastructure With Controlled Redundancy: A Performability Evaluation

In modern telecommunication networks, services are provided through Service Function Chains (SFC), where network resources are implemented by leveraging virtualization and containerization technologies. In particular, the possibility of easily adding or removing network resources has prompted service providers to redefine some concepts including performance and availability. In line with this new trend, we propose a performability study of a multi-provider containerized IP Multimedia Subsystem (cIMS), an SFC-like infrastructure used in the core part of 4G/5G networks to handle multimedia sessions. On the one hand, performance issues are tackled by modeling each cIMS node in terms of a G/G/m queueing system to derive the Call Setup Delay (CSD), a performance metric related to the user-end experience in multimedia communications. On the other hand, availability issues are addressed through the Multi-State System (MSS) formalism, to take into account different performance rates of the system. Then, we devise an algorithm called PE-MUGF (Performability Evaluation through Multidimensional Universal Generating Function) to identify the minimum-redundancy cIMS configuration which meets given performance and availability targets at the same time. Finally, an extensive experimental analysis based on Clearwater, a containerized IMS testbed, allows us to estimate most of system parameters whose robustness is evaluated through a sensitivity analysis.


Multi-Provider IMS Infrastructure With Controlled
Redundancy: A Performability Evaluation Luigi De Simone , Member, IEEE, Mario Di Mauro , Senior Member, IEEE, Maurizio Longo , Member, IEEE, Roberto Natella , Senior Member, IEEE, and Fabio Postiglione Abstract-In modern telecommunication networks, services are provided through Service Function Chains (SFC), where network resources are implemented by leveraging virtualization and containerization technologies.In particular, the possibility of easily adding or removing network resources has prompted service providers to redefine some concepts including performance and availability.In line with this new trend, we propose a performability study of a multi-provider containerized IP Multimedia Subsystem (cIMS), an SFC-like infrastructure used in the core part of 4G/5G networks to handle multimedia sessions.On the one hand, performance issues are tackled by modeling each cIMS node in terms of a G/G/m queueing system to derive the Call Setup Delay (CSD), a performance metric related to the user-end experience in multimedia communications.On the other hand, availability issues are addressed through the Multi-State System (MSS) formalism, to take into account different performance rates of the system.Then, we devise an algorithm called PE-MUGF (Performability Evaluation through Multidimensional Universal Generating Function) to identify the minimum-redundancy cIMS configuration which meets given performance and availability targets at the same time.Finally, an extensive experimental analysis based on Clearwater, a containerized IMS testbed, allows us to estimate most of system parameters whose robustness is evaluated through a sensitivity analysis.

I. INTRODUCTION AND CONTRIBUTION
S ERVICE function chains (SFCs) have revolutionized the way to provide telecommunication services thanks to the flexibility to reconfigure on-demand hardware and software resources needed to provide specific functionalities [1], [2].Since SFCs are arranged as a series of software-based nodes to be traversed in a predefined order, a network operator can decide to insert or remove one or more nodes from the chain aimed at modifying the service provisioning.Likewise, it is possible to add or remove redundant nodes to strengthen or relax performance and availability (or, simply, performability) requirements.Obviously, an increased redundancy implies higher costs, thus, an accurate modeling and planning of additional resources is desirable.
Accordingly, we propose a performability evaluation of a container-based version of IP Multimedia Subsystem (cIMS), a popular framework typically deployed as an SFC-like architecture, which is broadly exploited in the core part of 4G/5G networks to support multimedia communications [3], [4], [5].Due to its generality, our assessment can be easily employed to characterize performability of different chained structures which today are often implemented via softwarization paradigm.Examples includes: i) WAN softwarized chains, where the data flow may traverse in sequence softwarized elements such as an intrusion detection system, a load balancer, and a router, before arriving into the Internet core; ii) Radio softwarized chains, where the software defined radio paradigm allows to realize, completely in software, also radio access elements resulting in chains made of: base stations, radio network controllers, signalling/packet gateways, data network; iii) WLAN softwarized chains, where the wireless local data traffic can sequentially traverse (for example) an access point, a firewall and a Web server.
In line with modern cloud concepts, the considered IMS infrastructure is assumed to be shared among different providers, thus, we refer to a multi-provider cIMS.Remarkably, the multi-provider qualification is an opportunity offered by softwarized network systems where, aimed at a cost reduction, part of the infrastructure is in common.Today, multi-provider solutions include: [6], where a configuration known as Gateway Core Network (GWCN) allows different network operators to connect to a shared radio access network; [7], where operators can deploy access or core nodes in an independent way, but sharing with other providers a common infrastructure; [8], where an exemplary multi-operator IMS framework is deployed into a virtual data center, with different services offered by (different) operators having the same common physical infrastructure.
It is useful to decompose our analysis into two parts: the first one concerns the performance aspects that we take into account through the Call Setup Delay (CSD), a performance indicator [9], [10], [11] defined as the time difference between c 2023 The Authors.This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/ the first message sent from a caller (Invite message) and the first message received from a callee (Ringing message).
Obviously, the CSD is influenced by the number of nodes that the messages have to traverse, and by the time spent at each node for the message processing.According to ETSI [9] and ITU-T [12], CSD value should not exceed 400-500 ms.
The second part of the problem concerns the availability aspects, namely, the ability of a system to provide a service despite the occurrence of failures.Precisely, the availability requirement of a technological system (the cIMS in our case) can be measured in terms of number of "nines" that can be translated into a maximum annual downtime (MAD): for instance, a four nines availability (namely, a probability of 0.9999 that the system is working) corresponds to a MAD of 52 minutes and 36 seconds, whereas a five nines availability (namely, a probability of 0.99999 that the system is working) corresponds to a MAD of 5 minutes and 15 seconds.This latter requirement, often known as high availability, is usually included into the Service Level Agreement offered by all modern telecom infrastructures [13].
A joint performance and availability assessment allows to pinpoint the optimal-redundancy cIMS setting that: i) exhibits the best performance (with CSD below a critical threshold), ii) is able to satisfy a given availability requirement (e.g., the five nines); iii) has the minimum cost, namely, the minimum number of redundant elements.
The present paper represents a substantial extension over our previous conference paper [14], both from a methodological and an experimental perspective as detailed in the final part of Section II, with key contributions summarized as follows: • To characterize performance, we propose a G/G/m queueing model of a cIMS node by exploiting the Krämer/Lagenbach-Belz approximation.• To characterize availability, we: i) exploit the Multi-State System (MSS) formalism to model a cIMS node; and ii) devise an ad hoc algorithm to find the minimumredundancy cIMS setting which fulfills the performance and availability requirements.• We put in place an extensive experiment through which we are able: to i) estimate some system parameters (e.g., service times, repair rates); and to ii) perform a dedicated sensitivity analysis.The remainder of the paper is organized as follows.Section II proposes an overview of related work both concerning performance and availability aspects, including our previous conference work [14], where we highlight the differences with respect to the work presented here.In Section III we provide an architectural perspective of cIMS and of the Clearwater testbed, and we introduce the concept of containerized function (CF), the basic modeling element of a cIMS node.Section IV analyzes in depth the state-space model of a CF through the MSS formalism.In Section V we provide analytical details about the adopted G/G/m queueing model and the pertinent approximations.Section VI introduces the PE-MUGF algorithm to evaluate the optimal cIMS settings.Section VII presents the experimental part along with original results, and Section VIII concludes the paper along with some future research hints.

II. RELATED WORK
Recently, there have been many efforts and attempts to model both performance and availability aspects concerning the service function chains in the realm of network management [15], [16], [17], [18].Thus, we find it more convenient to keep separated the two aspects, so as to highlight differences and improvements of our proposal with respect to the existing technical literature.

A. Performance Issues
As regards the performance issues, a large part of literature is aimed at characterizing the effect of delays introduced by single nodes belonging to a chain, impacting unavoidably the overall SFC delay.Relevant studies include: a performance evaluation of chained services through a solution strategy named MaxZ [19]; a mathematical formulation of an optimization problem which takes into account the delay guarantees provided by SFCs [20]; an SFC orchestration solution with the objective of minimizing the cost of the composing virtual network functions (VNFs) [21]; service rate control problems in SFC requests scheduling [22]; a technique (named Network Queueing Assessment) to detect bottlenecks in SFCs based on the network queue occupation [23]; a solution (called eRESERV) to evaluate performance of SFCs [24]; a delay-based performance of SFC along with the problem of CPU allocation [25]; a reliable SFC placement problem in softwarized 5G networks [26].
All the aforementioned works adopt M/M/ 1 queueing models to characterize the elements belonging to a service chain.On the one hand, such models offer the comfort of a mathematical closed form amenable to be managed.On the other hand, they could fail to represent some real-world situations since they assume predefined (exponential) distributions both for inter-arrivals and service times, and assume single-server nodes, where in many cases each node could be able to manage more than one service request at a time.
Other studies [27], [28], [29], [30] admit the presence of multi-server nodes to model VNFs, but they adopt M/M/m queueing systems that restrict the generality of the model.Similarly, previous studies [14], [31] characterized each SFC node in terms of an M/G/m queueing model.
Differently from all the mentioned works, we propose a G/G/m queueing model to characterize each node of the SFC.It represents the most general case, which can deal with realistic use-cases where classic assumption of exponential distributions (both for inter-arrival and service times) is inaccurate.

B. Availability Issues
As regards the availability aspects of SFCs, the technical literature proposes a number of useful techniques to optimize the redundancy.For example, Petri-based formalisms provide a compact way to model the availability of chained structures through the analysis of the state changes.Among the works which exploit such a formalism we find: [32], including a VNF migration strategy where the underlying SFC has been modeled according to the Petri formalism; stochastic Petri networks (SPNs) have been exploited in [33] to set an automatic method useful to evaluate the availability of SFCs; authors in [34] propose a comparative analysis of different SFC configurations exploiting the stochastic reward networks (SRNs) formalism, a variant of classic stochastic Petri networks with a reward function; SRN have been used also in [35] to characterize from an availability view point homogeneous and heterogeneous deployments of SFCs; stochastic activity networks (SANs) have been adopted in [36] and [37] to assess the availability of an end-to-end NFVaware network service; generalized stochastic Petri networks (GSPNs) have been employed in [38] to model availability problems in data centers in charge of managing SFCs.
The compactness of the described methods represents both an advantage and a disadvantage at the same time: they are benefiting since provide high-level expressiveness during the modeling stage, but, make it difficult to access analytical details that we exploit in our MUGF-based approach.Another limitation is that Petri-based techniques typically require specific software tools such as SHARPE [39], SPNP [40], TimeNET [41], WebSPN [42].The Universal Generating Function (UGF) technique (the non-multidimensional version) has also been exploited to deal with availability aspects of virtualized environments [43], [44].A limitation is that such a method is not suitable for the application to a multiprovider environment, which we are able to address through the proposed multidimensional UGF technique.
We conclude this section by pinpointing the main methodological and experimental advances over [14].As to the former: i) we adopt a more general queueing model for a cIMS node, namely G/G/m, where: the inter-arrivals of cIMS sessions (the first G) are assumed to be Gamma-distributed with different shape parameters accounting for a broader set of possibilities; the generic service times (the second G) are estimated through experiments; and m containerized instances are managed by P providers which share a cIMS node; ii) we introduce a formalization of series and parallel structures useful to mathematically justify the Multidimensional Universal Generating Function (MUGF) method; iii) we devise an effective ad hoc algorithm named PE-MUGF (Performability Evaluation through MUGF) to search for the minimumredundancy cIMS setting which meets given performance and availability requirements.On the experimental side, we conduct an extensive campaign based on Clearwater, a containerbased IMS deployment, through which we are able to: i) obtain on-field estimates of relevant model parameters (e.g., service times, repair rates); ii) elaborate on possible variations of redundant cIMS configurations; iii) conduct a dedicated sensitivity analysis focused on some critical parameters.

III. MOTIVATING EXAMPLE: MULTI-PROVIDER CIMS
As an exemplary SFC infrastructure we consider a container-based version of IP Multimedia Subsystem (cIMS) realized through the open-source project Clearwater release 130 [45].The leftmost panel in Fig. 1 shows the nodes that we have implemented in our testbed: • Proxy-CSCF (P-CSCF) 1 : the ingress point of the cIMS architecture which exposes its SIP2 -based interface to the external world.The corresponding Clearwater name is Bono.• Serving-CSCF (S-CSCF): is responsible for the multimedia sessions control, including authentication and routing procedures.The corresponding Clearwater name is Sprout/S.• Interrogating-CSCF (I-CSCF): it enables IMS requests to be routed towards the correct S-CSCF.The corresponding Clearwater name is Sprout/I and is co-located with Sprout/S.• Home Subscriber Server (HSS): it stores information about IMS subscribers (including authentication keys).
The corresponding Clearwater name is Homestead.In line with the decoupling logic of softwarized infrastructures, each cIMS node is realized through a 3-layer structure that we call Containerized Function (CF) shown in the middle panel of Fig. 1.The CF upper layer (application layer) hosts the specific cIMS logic (e.g., Proxy, Serving, etc.) embodied into containers; the middle layer (docker layer) provides support for containers and is realized through the popular docker daemon engine; the lower layer (infrastructure layer) models all the physical parts including CPU, power supplies, etc.It is useful to disclose that a cIMS node can be made of one or more redundant CFs connected in parallel to improve the availability, as will be detailed in Section VI.Finally, the rightmost panel of Fig. 1 shows that each CF can be shared by P different providers.Each provider p (p = 1, . . ., P ), modeled as a G/G/m p queue, represents a set of containerized instances (briefly, instances) each of which able to manage the cIMS requests in queue.

IV. A MULTI-STATE SYSTEM APPROACH FOR THE AVAILABILITY MODELING OF
A CONTAINERIZED FUNCTION The Multi-State System (MSS) formalism was introduced to overcome the limitation arising from the binary models [46] where, from an availability perspective, a system can be characterized according to two extreme cases: perfect functioning or complete failure.Conversely, in many real-life situations, the systems and their components can assume a certain range of performance rates between the two aforementioned extreme cases [47].By applying the MSS modeling to service function chains, it is possible to: i) evaluate the performance rates of the single components (e.g., the nodes) ruled by failures and repair operations, and ii) employ the MUGF to recombine, through simple algebraic operations, the performance rates of single components and derive a macroscopic performance model of the whole chain.
Figure 2 represents the MSS model of a single CF, where each performance rate can be mapped into a given state, providing information about the operating (working or failed) condition of a specific component.A failed component is indicated by 0 and a working (or repaired) component is indicated by 1.
The inter-arrival failures are treated as independent and identically distributed (iid) random variables, and, more precisely, as exponentially distributed random variables with parameter λ, whereas the times taken for repair are treated as exponentially distributed random variables with parameter μ [48], [49].For example, by starting from a completely working system with state (s 1 , s 2 , . . ., s P ), the failure action observed when one instance of provider 1 fails is ruled by failure rate λ C 1 and brings the system towards the state (s 1 − 1, s 2 , . . ., s P ).In contrast, the system comes back into the completely working state when the failed instance of provider 1 gets repaired with parameter μ C 1 .Remarkably, when the docker layer fails, each state of the system (excepting for state (0, 0, . . ., 0) I ) is forced to reach the state (0, 0, . . ., 0) D with failure rate λ D .Likewise, when the infrastructure layer fails, each state of the system is forced to reach the state (0, 0, . . ., 0) I with failure rate λ I .Please also note that, as usual in real-world systems, both repairs of docker and infrastructure layers conclude with a recover of the whole system with repair rates μ D and μ I , respectively.
The overall state space of the MSS can be defined as ω = ω C × ω D × ω I , where: • ω C = P i=1 {0, 1, . . ., s p } represents the state space of the application layer (with containerized instances) of all providers; • ω D = {0, 1} D represents the state space of the docker layer, where 0 indicates the docker failure condition and 1 indicates the docker working condition; • ω I = {0, 1} I represents the state space of the infrastructure layer, where 0 indicates the infrastructure failure condition and 1 indicates the infrastructure working condition.At this point, to completely characterize the MSS, we need to formally define two descriptors.The first one is the performance rate r p,σ = γσ p being γ the so-called serving capacity, namely, the number of cIMS requests that containerized instances of provider P can concurrently manage when σ p instances are currently available.Thus, r σ = (r 1,σ , . . ., r P ,σ ) represents the stochastic vector containing all performance rates included in the set The second descriptor is the structure function ψ which provides a mapping between all possible combinations of states of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the system components ω = (σ, z D , z I ) and the whole system state.Thus, we have: (2) where: ψ(σ , z D , z I ) = σ for z D = 1 and z I = 1 (namely, docker and infrastructure layers are both working), ψ(σ , z D , z I ) = (0, . . ., 0) D for z D = 0 and z I = 1 (namely, the docker layer is failed), and ψ(σ , z D , z I ) = (0, . . ., 0) I for z D = 0 and z I = 0 (namely, docker and infrastructure layers are failed).We can conclude that the MSS performance rate can be expressed as the vector stochastic process , where: ) is a ω C -valued process representing the failure/repair condition of the application layer, and process representing the failure/repair condition of the docker [infrastructure] layer.
Being built on the MSS of Fig. 2, R(t) is a process having a finite number of states amounting to and is irreducible, meaning that each state can be reached by each other state.Moreover, since all the λ and μ parameters do not change with time, the process R(t) is an ergodic Continuous-Time Markov Chain (CTMC) whose probability vector π (t) can be obtained by solving: being Q the infinitesimal generator matrix [50].We have also to consider the normalization condition σ π σ (t) = 1, where π σ = lim t→∞ π σ (t) = lim t→∞ Pr{R(t) = r σ }.Since we are interested in the steady-state behavior of the MSS, we can safely conclude that the set {π σ , r σ } uniquely describes the steady-state behavior of the containerized function.

V. PERFORMANCE OF A CONTAINERIZED FUNCTION THROUGH
As shown in the rightmost panel of Fig. 1, each provider P at the application layer is represented through a queueing system made of a set of software instances able to manage the cIMS sessions.In order to treat the problem in the most general way possible, we propose a G/G/m p (p = 1, . . ., P ) queue modeling of a CF.For this model we have: generic interarrival times (the first G), generic service times (the second G), and a number of finite m p containerized instances per provider p.Before delving into details of such a queueing model we want to highlight an important fact.In principle, the number of m p instances in charge of managing multimedia sessions can vary across the time since, as specified in the previous section, some of them may fail, thus we would have a G/G/m p (t) queueing model.Having also defined R(t) as the performance rate of the MSS, the queueing model can be denoted by G/G/R p (t) so as to stress the dependency from a particular state.
Interestingly, we note that failure time scale completely dominates queueing time scale since failure times are in the order of thousand of hours, whereas service times are in the order of few milliseconds (see Table I further ahead).This condition leads to a decoupling of the two time scales that, as well explained in [51], allows to neglect the transient effects of the dominated times scale (the queueing times scale in our case).In other words, the steady-state condition of queues are achieved much faster than the steady-state condition of faults.Thus, we can safely assume a G/G/R p model, where the time dependency is "absorbed".This notwithstanding, when we need to stress the time dependency from a particular state, we will occasionally use the G/G/R p (t) notation.
At this point, it is useful to recall that the CSD performance indicator is directly related to the amount of time that a cIMS request spends at each CF waiting to be processed.For example, in our IMS case study, the CSD is given by the total latency across the four stages (i.e., the CFs) in the service chain (P-CSCF, S-CSCF, I-CSCF, HSS).Intuitively, higher sojourn times at each CF imply higher CSD that, in turn, implies worse performance.
In line with these considerations, we characterize the average sojourn times of cIMS requests which depend on the provider p and on the particular state σ.Remarkably, G/G/m p queueing systems do not admit analytical closed forms, thus, some approximating formulas are required.To address this issue, we introduce an equivalent M /M /m p model with Poisson inter-arrivals with rate α, and exponential service rates with mean β.According to the classic queueing theory [52], we can express the mean sojourn time at a single CF as: where: q p,σ represents the mean waiting time spent by a cIMS request in the queue of provider p in state σ, and 1/β is the mean service time spent by the cIMS request to be processed.This latter quantity can be experimentally estimated (see numeric values in Table I further ahead) for each node.
In contrast, q p,σ is a random quantity which can be obtained by approximating the corresponding mean waiting time of the equivalent M /M /m p queueing system with: where: π m is the steady-state probability of the equivalent M /M /m p queueing system [52], ρ = α/(β • m) is the utilization factor, V A and V S are the coefficients of variation (σ(•)/E (•)) for inter-arrival and service times, respectively, and F is a correction factor which implements the Krämer/Lagenbach-Belz approximation formula [53], [54], [55] for a G/G/m p queueing model obeying to the following relation: Thus, by substituting ( 6) in ( 5), we obtain the mean sojourn time per CF modeled as a G/G/m p queueing system.Moreover, since the time spent by a cIMS request also depends on the particular state reached by the MSS in Fig. 2, we can easily define the structure function ψ Δ : ω → {R + ∪ {+∞}} P specialized to the mean sojourn times.Similarly to the definition introduced in (2), we have that (when docker layer is not working, we have infinite delay), and ψ Δ (σ , z D , z I ) = (∞, . . ., ∞) I for z D = 0 and z I = 0 (when infrastructure and docker layers are not working, we have infinite delay).
Remarkably, the structure function ψ Δ is useful to characterize the mean sojourn time in each possible state through the vector stochastic process Δ(t) = (Δ 1 (t), . . ., Δ P (t)) = ψ Δ (σ , Z D (t), Z I (t)).Moreover, similarly to the R(t) process, also Δ(t) is an ergodic CTMC process, where the set of pairs {π σ , δ σ } determines the steady-state performance behavior of a CF in terms of mean sojourn times.

VI. PERFORMABILITY OF A SERVICE CHAIN:
THE MUGF APPROACH Since we are dealing with a chain of nodes where each node is made of replicated CFs for availability purposes, we want to stress that: i ) a series connection implies that the whole chain is functioning when each node n ∈ N is functioning, where N = {P-CSCF, S-CSCF, I-CSCF, HSS} is the set of cIMS nodes; ii) a parallel connection implies that each node n is made of redundant CFs.Specifically, CF (n, ) represents the parallel CF ( = 1, . . ., L n ) associated to node n.Since we assume that all CFs composing a node have to share the load among them, the redundancy is supposed to be "hot standby" (this working hypothesis is also known as flow dispersion hypothesis [47]).The resulting series/parallel structure is shown in Fig. 3.Such a model is meant to capture a high-level architectural perspective, by not considering synchronization problems nor links availability (links are supposed to be always-on).At this point, we evaluate the mean CSD introduced by a chain, denoted by Δ c (t), through the definition of two operators: the series structure function and the parallel structure function.We find it more convenient to start by defining the latter operator.
Since the call flow traverses the chain in series, the overall mean CSD Δ c (t) can be obtained as the sum of mean CSDs introduced by each single node.Such a quantity can be evaluated by introducing the following: Proposition 2 (Series Structure Function): We define a series structure function ψ ser : ω Σn Ln → R P ∪ (+∞, . . ., +∞).The overall mean CSD Δ c (t) = (Δ c 1 (t), . . ., Δ c P (t)) introduced by the chain is given by: From ( 8) we can derive the steady-state mean distribution of sojourn times for node n, viz., where: δ ) is the mean sojourn times vector of node n in state σ , and π σ } the corresponding limiting probability.Likewise, from (9) we can derive the steady-state mean CSD distribution of the whole chain, viz., where: δ c σ = (δ c 1,σ , . . ., δ c P ,σ ) is the mean sojourn times vector of system in state σ , and π c σ = lim t→∞ Pr{Δ c (t) = δ c σ } the corresponding limiting probability.
In order to solve the proposed model in a computationallyefficient way, we use an approach based on the Universal Generating Function (UGF), also known as u-function [56].The UGF is a hierarchical technique to compute steady-state Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.performance distributions of complex MSSs characterized by series/parallel interconnections among the components.The UGF of the steady-state performance metric Y, whose distribution is given by the set of pairs {π j , y j }, is the polynomial-shape function being y j the j-th performance rate, and π j the corresponding steady-state probability.The UGF of the whole system (i.e., the combination of multiple CFs in series/parallel) can be obtained by combining the UGFs of individual CFs through simple sums and products.Thus, the steady-state probabilities for the entire system can be obtained from the combined UGF.Since we deal with multiple service providers in the system, we adopt a multidimensional extension of (12) dubbed MUGF [31] along with ( 8) and (9).Thus, we are able to obtain the MUGF of the whole cIMS chain where: i) 4 nodes are connected in series, ii) each node is made of replicated CFs to guarantee redundancy, iii) each CF is shared among different P providers and can be in a particular state σ .In summary, we can write the MUGF u c (z ) of the whole chain as the product of MUGFs of single nodes, viz.
with ω tot = n ω Ln .In practice, u c (z ) represents a polynomial-shape function in z 1 , . . ., z P indeterminates, where each term corresponds to the mean CSD vector δ c σ (exponents of z), whereas its steady-state probability π c σ is the pertinent coefficient.Such quantities can be used to calculate the steady-state availability of the whole service chain as explained below.
First, we denote by S a particular cIMS setting where each node n is made of a number of redundant CFs ( =1, . . ., L n ).Yet, we denote by δ * = (δ * 1 , . . ., δ * P ) a P-dimensional vector which contains the maximum steady-state tolerated values of mean CSD.Thus, we define the steady-state availability of a particular cIMS setting S as where 1(•) is a function which amounts to 1 if condition holds true and 0 otherwise.We note that π c σ and δ c p,σ in ( 14) are directly derived from the MUGF expression (13).
We also stress the fact that ( 14) provides the steady-state availability of a generic cIMS setting S, but we are interested to find the steady-state availability of the setting with the minimum cost, namely, with the minimum number of redundant CFs.
Accordingly, denoting by E (n, ) the cost (or expenditure) of the −th CF belonging to node n, we can define the cost of a cIMS setting S as E c (S ) = n Ln =1 E (n, ) .In summary, we search for the solution of the following optimization problem: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

A. PE-MUGF Algorithm
Algorithm 1 describes PE-MUGF, an algorithm devised to evaluate all the possible cIMS settings in terms of steady-state availability and cost.Such a choice is due to two reasons: first, it is impossible to pinpoint beforehand the optimal cIMS setting with no knowledge of its composition (in terms of redundant CFs); then, with more cIMS settings, a network provider can make different choices or compare several settings according to customized criteria.Thanks to the MUGF approach, the steady-state availabilities can be computed efficiently even for a large number of combinations.PE-MUGF has been realized with Wolfram Mathematica TM and is available upon request.
The initial line of the algorithm specifies the input parameters.In Section VII, we show how these values can be defined in the context of an experimental use case.Lines 1 − 13 report a function called BuildSetting useful to build, in a combinatorial way, all the possible cIMS settings made of N nodes, where each node is made of a maximum number of CFs amounting to L n .We note in passing that, to highlight the recursion into BuildSetting function, we adopt the notation S N in place of S to indicate that a given setting is made of N nodes.In such a way, except for the case N = 0 (line 2), each N-node setting can be obtained by preposing a number i (i = 1, . . ., L n ) to a (N − 1)-node setting.
Line 15 is the MUGF per cIMS setting derived as a combination of the MUGF applied per single node.Such an expression allows to evaluate the steady-state availability per setting (line 16).Then, a cost assignment per setting obtained as the sum of costs per CF is performed at line 17, and the optimal cIMS setting is provided as the output (line 19).
As mentioned before, to evaluate the optimal setting S * , the PE-MUGF algorithm must evaluate all the built settings which are a byproduct of the procedure.
From a time complexity perspective, it is useful to notice that the MUGF construction is very fast since it relies on simple algebraic operations such as sums and products.In contrast, the BuildSetting function requires more time since it has to build all the possible settings, thus leading the complexity of PE-MUGF to O(L N n ).This notwithstanding, for typical values of N and L n in real-world applications (see Section VII for numerical values), PE-MUGF is reasonably affordable.
Even if not explicitly stated, our evaluation can be easily applied to a single-provider architecture being a special case obtained by posing P = 1 in ( 13), and providing a drastic simplification of the MSS in Fig 2 .Indeed, in the single-provider case, we have to take into account only the containerized instances (that can be in working or failed conditions) of the provider under analysis, and the performance rates vectors in (1) reduce to scalars.

VII. EXPERIMENTAL RESULTS
We present results from a complex experiment that incorporates real-world hardware and software technologies.Our testbed is based on the Clearwater IMS that was previously introduced.We deployed the three main Clearwater nodes (Bono, Homestead, Sprout/S-I represented in the leftmost panel of Fig. 1) on three dedicated server machines, each of which equipped with: Intel Xeon TM (16-Core, 1.80 GHz), 64 GB of RAM, 2 SATA HDD each of 500 GB, 1 NetApp Network Storage Array (32 TB of storage and 4GB of SSD cache).The operating system on top of each node is based on Linux kernel 4.4.0 with Docker engine version 19.03.5.All the nodes are connected through an Ethernet network switch supporting a maximum throughput of 1 Gbps.Moreover, an additional node hosts SIPp, a SIP stress tool that we use as workload generator.
Now, we find it convenient to split the remainder of this section into three parts: in the first one, we deal with the estimation of parameters to insert into MSS and queueing models (in particular, repair times and service times); in the second one, we describe how to use PE-MUGF to find the best cIMS settings; in the last part we evaluate, through a sensitivity analysis, the robustness of the obtained settings when some critical parameters deviate from their nominal value.

A. Parameters Estimation
Through our testbed we are able to estimate two classes of parameters.The first class pertains to the service times distributions obtained by analyzing the logs of all cIMS nodes.In particular, we have stressed the cIMS architecture with 10000 SIP requests from 10000 subscribers automatically generated via SIPp, and we have built the empirical mean service time distributions per node and derived the corresponding mean values.The results are reported into the first part of Table I expressed as the inverse of service rate (1/β).We note that all the mean service times are in the order of 10 −3 seconds except for the mean service time of the I-CSCF which is in the order of 10 −2 seconds.We will see experimentally that, in view of (5), such a higher value will adversely affect the availability of the whole cIMS if not adequately contrasted through specific redundancy strategies.From the empirical distributions of service times we are also able to derive the coefficient of variation V S for each node, being expressed as the ratio between the standard deviation and the mean value calculated from such distributions, whose values are also reported in Table I.
The second class of parameters pertains to the repair (or reboot) times for each layer of a CF.To perform such estimation we have implemented a fault injection routine [57], [58] which automatically injects into a CF three types of faults including: container faults (simulated by I/O exceptions and resource exhaustion to force containers to crash and to reboot), docker faults (simulated by forcing an abrupt termination of the dockerd process), infrastructure faults (simulated by a physical machine crash).In total, we have performed 360 fault injection experiments (30 fault injections for 3 layers and for 4 types of CFs).Once a layer restores after a fault, we experimentally measure the pertinent repair time.Specifically, as regards repair times differentiated per layer and per CF type, our experiments reveal how slight differences in repair times for each single layer might arise, as shown in Fig. 4.Such a behavior is obviously due to the technological differences among CFs.For instance, container and docker layers of the HSS-type CF exhibit a slightly longer reboot time (w.r.t.reboot times of container and docker layers of remaining CFs), due to the internal database structure that requires more time to be restored.For the sake of simplicity, in Table I we report only the average values of repair times per layer without specifying the CF type they refer to.
Moreover, through dedicated scalability tests, we have also investigated the behavior of application layer reboot times when multiple containers are deployed on top of a CF.In particular, we have deployed onto a I-CSCF-type CF a number of 16 containers 3 in line with the number of cores in our server machines.Then, we configured the containers to optimize the performance and recovery of the system (following the best practice from real-world systems [59], [60]), by configuring CPU affinity policies to avoid CPU contention between containers, and common-mode failures due to a CPU failure.This configuration makes the recovery time insensitive to the number of containers.As shown in Fig. 5, despite some variability due to shared resources between the containers (e.g., communication with the container manager process), the average recovery time does not significantly vary even if we increase the number of containers.Thus, the estimated container repair time can be reasonably considered constant and scarcely dependable on the number of containers deployed in parallel.The remaining parameters, namely, the mean time to failures per layer (1/λ C , 1/λ D , 1/λ I ) have been derived by scientific literature [48], [61], [62], [63]. 3Similar results were observed on CFs for P-CSCF, S-CSCF, HSS.

B. Optimal cIMS Settings
The second part of the experiment aims to demonstrate how to determine the optimal cIMS setting in accordance with (15).We have implemented an exemplary cIMS model where each CF can support P = 2 providers and whose corresponding MSS is shown in Fig. 6.We assume that provider 1 is able to support 2 instances, and provider 2 is able to support 3 instances.It is easy to note that, in accordance with (3), the total number of states amounts to 14. Precisely, we have 12 states (S 1 , . . ., S 12 ) embodying the failure/working status of each instance and 2 states (S D and S I ) embodying the failure/working status of docker and infrastructure layers, respectively.Analyzing the MSS in Fig. 6 we can see that, starting from the initial state that is the completely working state S 12 , we can reach S 11 with 2λ C 1 failure rate, since one of the two working instances of provider 1 may fail.In contrast, when returning into S 12 from S 11 , we have a repair rate of μ C 1 since only one failed instance needs to be repaired.All the pairs {π σ , δ σ } associated to the considered MSS can be found by solving the differential equation ( 4), along with the i=1 π i (t)+p D (t)+p I (t), and considering the limit t → ∞.The infinitesimal generator matrix Q derived by the MSS in Fig. 6 can be expressed the compact form (16), shown at the bottom of the page where the where: diagonal of the Q has been separately reported in (17), shown at the bottom of the previous page and where all numerical values of parameters are drawn from Table I.At this point, we need to set the remaining input parameters for the PE-MUGF algorithm.As arrival rates for the two providers we set somehow arbitrarily α 1 = 100 s −1 and α 2 = 200 s −1 .Furthermore, we choose one and the same value for the maximum steady-state tolerated mean CSD, i.e., Such a choice (one order of magnitude less than ETSI values) is justified since, in a local testbed, we neglect all the propagation delay contributions arising in wide geographical networks.As the maximum number of CF redundant replicas L n , we set 3 for all nodes in the cIMS, and as the availability target A 0 , we set the classic five nines 0.99999.Yet, N = 4 (we have 4 cIMS nodes).
The CF cost parameter is an arbitrary value that can be customized with no loss of generality.For the sake of simplicity, we assume that each CF has a unitary cost E (n, ) = 1.
The last parameter to be provided to PE-MUGF algorithm is the coefficient of variation of inter-arrivals V A that, we recall, in a G/G/m queueing system, depends on the particular shape of the inter-arrivals distribution.Differently from the coefficient of variation of service times V S that we have estimated from the empirical service time distributions, empirical interarrivals cannot be simply emulated since they strongly depend on the behavior of users.This notwithstanding, also in line with some credited literature [64], [65], generic inter-arrivals can be modeled by exploiting the versatility offered by the Gamma distribution.
In particular, we employ the distribution Γ(θ, 1), namely a Gamma distribution with a variable shape parameter θ and the scale parameter set to 1 (as suggested in [64]).By varying the shape parameter of the Gamma distribution, we observe different distribution shapes of inter-arrivals, including the exponential distribution obtained for θ = 1 (corresponding to V A = 1) which represents the M/G/m queueing model.Figure 7 shows a set of inter-arrival cIMS request distributions corresponding to 7 different values of θ as much as of coefficient of variations.We choose the exponential case as the benchmark (black dashed curve with θ = 1 and V A = 1) we spanned some around such a benchmark value.We note that, for θ < 1 the coefficient of variation decreases and the corresponding distributions stretches out.In contrast, for θ > 1 the coefficient of variation increases and, as expected according to (6), this increase will adversely affect the availability as we will numerically show in a while.
We run PE-MUGF as many times as V A values.For the sake of simplicity, let us start with the reference value V A = 1.As mentioned before, PE-MUGF returns the optimal cIMS setting (namely, the one exhibiting the maximum availability at the minimal cost) and a list of sub-optimal settings.Among the listed settings, we choose 6 of them (S 1 , . . ., S 6 ), where S 1 represents the optimal one since it has the highest availability value at the minimum cost.Table II summarizes the composition of such 6 settings by specifying, in the second column, the number of redundant CFs per node.For instance, with respect to the optimal setting S 1 , the P-CSCF node is made of 2 redundant CFs (CF (P ) = 2), the S-CSCF node is made of 2 redundant CFs (CF (S ) = 2), the I-CSCF node is made Fig. 7. Gamma-distributed inter-arrival times and corresponding coefficients of variation (V A ).

TABLE II REDUNDANCY DEGREES FOR THE SIX EXEMPLARY SETTINGS (FOR
of 3 redundant CFs (CF (I ) = 3), and the HSS is made of one CF (CF (H ) = 1).The third column reports the cost of each setting simply obtained as the sum of unitary costs of each CF per node.The fourth column reports the corresponding steady-state availability value.Now, by re-running PE-MUGF with a set of V A values chosen among the most significant ones shown in Fig. 7, we re-evaluate the availability of the same six settings to make useful comparisons.Such results are shown in Fig. 8 where, for the sake of comfort, y-axis reports (log scale) the unavailability values (1 − A c (δ * , S ), lower is better) of the six settings. 4We also draw three availability thresholds as horizontal black dashed bars at: 10 −4 (four nines), 10 −5 (five nines), and 10 −6 (six nines).For example, when a bar lies above the 10 −5 threshold, it means that the five nines steady-state availability requirement is violated.
For each setting we report 4 cases corresponding to different values of the coefficient of variation V A .The first case includes a range of values obtained for V A ≤ 0.7 (blue bars).Focusing on this case, we see that S 1 (the optimal setting for the exponential case V A = 1), S 2 , and S 6 meet the five nines requirement (S 6 even satisfies the six nines requirement since the blue bar lies below the 10 −6 line).Among the remaining settings, it is interesting to note that S 4 does not meet the five nines requirement even if its cost is higher than the S 1 Fig. 8.
Effect of the inter-arrival times variation on the steady-state availability of S 1 , . . ., S 6 settings.
cost (E c (S 1 ) = 8, and E c (S 4 ) = 9).As mentioned before, such an apparently counterintuitive behavior is due to a bad allocation strategy of redundant CFs for S 4 .In particular, few redundant CFs have been assigned to the I-CSCF which, as can be seen from values in Table I, is slower than the other nodes to serve the IMS requests.This into values of mean sojourn times due to (5), with consequent impact on the overall availability according to (14).
By increasing V A up to 1 (representing the benchmark case and whose availability values are reported numerically in Table II), we can notice that the availability of S 1 remains stable (0.999992), whereas the availability of S 2 trespasses the five nines threshold achieving 0.999985.We observe an availability decrease also for S 3 (from 999984 to 999957), S 4 (again from 999984 to 999957) and for S 5 (from 999975 to 999949), whereas S 6 continues to be stable.Similar considerations hold true for a V A value of 1.2 (yellow bars) where in some cases the steady-state availability remains stable (S 1 , S 4 , S 6 ), whereas in the remaining cases it undergoes a deterioration.Finally, V A = 1.3 (violet bars) seems to be a critical value since S 1 violates the five nines condition and S 5 even violates the four nines condition (by achieving the three nines), whereas, surprisingly, S 6 continues to fulfill the six nines condition.Thus, a network operator could decide to deploy S 6 (even if it not optimal due to its cost) since it appears to exhibit a great robustness to the variation of the inter-arrival times.At this point we can summarize some useful facts.First, we have seen how the availability is adversely affected when the inter-arrival times distributions show a greater variance (namely, V A increases).To contrast such an effect we have two ways: the first one is to increment the redundancy paying the price of higher cost; the second one is to improve the service times (so as to reduce the impact of V S in (6) up to a certain extent) but, also in this case, this translates into higher costs because more computation resources are needed.
The second fact is that the allocation strategy of CFs is crucial to obtain high availability values.In our case, in fact, S 1 achieves five nines even if no redundancy at all is provided for HSS.In contrast, S 3 which is obtained from S 1 by moving a CF replica from I-CSCF to HSS violates the five nines condition for all values of V A .This is due to the fact that in S 1 we give more robustness to I-CSCF (with 3 CF replicas) which suffers from the slow service time.
The last fact is that, by paying a little more cost, we can obtain a very robust setting (S 6 in our case) with two advantages: first, it achieves the challenging six nines requirement (MAD of 32 seconds), and, then, it appears to be particularly insensitive to the variation of V A which, as seen before, is detrimental for the steady-state availability.

C. Sensitivity Analysis
As the last analysis, we are interested in evaluating how the availability values are impacted when failure and repair parameters deviate from their nominal conditions (i.e., due to estimation errors or to non steady behaviours).Namely, we perform a sensitivity analysis wherein we fix the value V A to 0.7 and compare the behavior of settings S 1 and S 2 since, for V A = 0.7, both guarantee the high availability condition with the same value (A c (δ * , S 1 ) = A c (δ * , S 2 ) = 0.999992, for V A = 0.7).
The three uppermost [lowermost] panels of Fig. 9 show the availability behavior in response to the variation of failure [repair] times for container, docker, and infrastructure layers.Each panel reports the horizontal blue dashed line as the five nines threshold.Moreover, the red circle includes the nominal value of the parameter as drawn from Table I.At first glance, we can notice that the different responses of the availability for S 1 and S 2 can be appreciated at the application layer, whereas the behaviors of the two considered settings tend to be the same at the docker and infrastructure layers.The reason is that failure and repair values for application layer span across a smaller range with respect to the remaining layers.More in detail, we can see that S 1 is more robust than S 2 to the variation of 1/λ C (topmost-left panel).Precisely, for S 1 [S 2 ], 1/λ C can be reduced of about 80% [65%] of its nominal value without violating the high availability condition.As concerns docker and infrastructure layers (topmost-middle and topmost-right panels), we note that parameters (both for S 1 and S 2 ) can be relaxed of about 16% and 50% of their nominal values, respectively.Likewise, the same robustness of S 1 w.r.t.S 2 is evident for 1/μ C , whereas, no practical differences emerge in relaxing docker and infrastructure repair parameters for S 1 and S 2 .Once again, it is useful to remark that, even if the number of CF replicas employed for S 1 and S 2 is the same (implying the same cost), the greater robustness of S 1 is explained through a better replicas allocation: the only node with no redundancy is the HSS which, in terms of mean service times, exhibits the best performance.
We finally note that such an analysis could be useful to decide between S 1 and S 2 that, according to PE-MUGF evaluated with V A = 0.7, have the same availability and the same cost (0.999992 and 8), respectively.In fact, S 1 could be preferred since it turns out more robust w.r.t.variations of parameters λ C and μ C .Fig. 9. Sensitivity analysis to evaluate the steady-state availability variation for two settings S 1 and S 2 , when failure parameters (topmost panels) and repair parameters (lowermost panels) deviate from their normal behavior.

VIII. CONCLUSION
In this paper, we have examined in detail both performance and availability aspects of a container-based IMS (cIMS) infrastructure implementing the service function chain logic.In the first part, we have formalized an MSS model of the containerized function to cope with availability issues, and a G/G/m queueing model to deal with performance aspects.In the second part, supported by an ad hoc devised algorithm named PE-MUGF, we were able to derive the optimalredundant cIMS setting where given performance (in terms of mean call setup delay) and availability (in terms of number of "nines") requirements are satisfied at the same time.The results allow to highlight that the allocation strategy of redundant cIMS elements is crucial to guarantee an availability value that depends as little as possible on the variations of the system parameters.Some hints for future developments may include: the possibility of further decomposing the MSS to take into account additional components (e.g., the hypervisor in case of virtual machine deployment); the possibility of embodying Quality-of-Service indicators to differentiate the cIMS requests according to some service classes (e.g., gold, bronze, silver); the possibility of examining the availability variations when the system is under particular stressed conditions (e.g., simulating busy hour requests).

Fig. 1 .
Fig. 1.Architecture overview: the testbed based on the Clearwater platform (left panel); the 3-layer Containerized Function (CF) constituting a cIMS node (middle panel); the G/G/m queueing model for each provider p (right panel).

Fig. 3 .
Fig. 3. Series/Parallel cIMS architecture: each node is connected in series and is made of a number redundant CFs connected in parallel.

Fig. 4 .
Fig. 4. Mean times to repair per layer and per CF type.

Fig. 5 .
Fig. 5. Stress tests on containers repair times (boxplot representation).The number of concurrent running containers does not dramatically affect 1/μ C .

Fig. 6 .
Fig. 6.MSS model of the exemplary CF with 2 providers and 14 states.