Validation of Biologically Inspired Tactics to Increase Multi-Agent System Resilience

There is an urgent need to increase multi-agent systems’ resilience. Efforts, however, are hampered by three gaps: current approaches are case study specific, may depend on infrastructure investment, and are often unshared. In response to these needs, we propose 14 biologically inspired Tactics for individual agent design and communication protocols. The Tactics are generalizable and do not depend upon infrastructure investments. The Tactics are applied to a hybrid system dynamic and agent-based model of an electric motor manufacturing supply chain case study. The approach for applying each of the Tactics is clearly described to support future application. Tactic use successfully increased resilience by an average of 6.6%. By reporting a context-neutral solution approach to increase resilience, focusing on agent intervention, and clearly reporting the implementation to a supply chain case study, this article takes an important step toward our goal of using biologically inspired design to increase multi-agent system resilience.


I. INTRODUCTION
Resilient systems can respond to faults with limited degraded performance and minimal loss of service [1], [2], [3].Due to the difficulty of increasing resilience, however, designers often invest in other "ilities" (reliability, sustainability, and performance) [1], [4].As a result, systems are often resilient to a small set of expected faults, but susceptible to unexpected faults [1], [5], [6].Examples of low resilience to unexpected faults include the Columbia Shuttle disaster and 2003 North American power outage [5], [7].Resilience engineering is a new field, and standard design-for-resilience approaches are needed for analysis and detailed design [3], [8].
The following three challenges, however, limit design-forresilience.
1) GAP 1: Approaches to improve resilience are frequently nongeneralizable.Studies often focus on a single case study or a specific set of flaws using a point design approach [6], [9].Thus, developed approaches cannot be transferred to other cases [1], [6], [10].2) GAP 2: Current infrastructure design-for-resilience approaches often focus on altering system network structure (e.g., component hardening or adding redundancy) [4], [11], [12], [13], but resilience can also be impacted through agent-level design.Increasing network density may be expensive, impractical, insufficient, or even reduce resilience by enabling new cascading faults [1], [14], [15].For example, previous redesigns to increase resilience have altered the number of pipe connections (e.g., 25 to either 34 or 43) or electrical distribution lines (e.g., from 20 to 80) in a system [16], [17].The proposed cyber-defense agents do consider agent characteristics [18], [19], [20], but these efforts focus on interdiction and repair from an intelligent adversary, not unforeseen faults.3) GAP 3: Successful examples of design-for-resilience are often unshared between use-case communities.Resilience research spans power distribution, ecology, psychology, and engineering with each having different terminology, concepts, and intervention strategies [21], [22].Increasing the resilience of multi-agent systems (MASs) is nontrivial.Independent agents act autonomously, but collaboratively to solve a common problem [23].MAS performance can be difficult to predict.Also, agent intervention to a fault could result in greater harm than inaction [24].Agent behavior impacts the system's ability to anticipate, identify, and recover from adversity.Thus, the key research question of our research is: How can biological inspiration be used to design agent behavior for increased MAS resilience?
To explore this question, a previous investigation examined eusocial insect colonies, such as honeybees or ants [25].Biologically inspired design (BID) is a recommended area for current resilience research [26].Eusocial insect colonies demonstrate resilience when they react to external threats (e.g., changes in food availability).In total, 14 Tactics were identified through a process of functional decomposition and analogical transfer.Tactics are specific approaches to apply a transferred function from biological inspiration to an MAS.These Tactics provide guidance for individual agent design and the communication network between them.
Our previous work, however, did not present simulation or real-world evidence that the proposed Tactics results in increased resilience.Our goal in this work is to verify and validate the proposed Tactics (see Fig. 1).This is significant, because the 14 Tactics provide an approach to proactively design-for-resilience, rather than reactively correct resilience deficiencies.Following current guidance, the agents respond autonomously and do not require human teaming [20], [24].This goal led to our central hypothesis: If we apply functional behaviors from eusocial insects to an MAS simulation, then we will see increased resilience.
To test this hypothesis, the 14 Tactics are applied to a model of an electric motor manufacturing supply chain (EMMSC).The resulting change in resilience to unexpected link removal, as measured by the disturbance in performance and recovery dynamics, is analyzed.Link removal is considered due to the importance of constituent interfaces and to scope this validation.Fault impact is reported through the expected financial impact.While the previous work focused on deriving these Tactics [25], this article focuses on testing the Tactics and presenting them as a validated solution-neutral approach to increased resilience (see Fig. 1).
This article presents the following three contributions.
1) The 14 Tactics are a solution-neutral approach to increase MAS resilience.Theoretically neutral approaches are an identified need for resilience engineering [13].The Tactics are phrased in terms of agent-based modeling (ABM) construction; thus they are not tied to a specific application domain (GAP 1).The Tactics are also connected to the five functions needed in autonomous defense agent architecture proposed by a NATO Research Group [19], [20].2) By promoting the use of agent-based strategies to increase resilience rather than infrastructure investment there is potential for significant cost savings (GAP 2).
3) The 14 Tactics are validated and applied to a supply chain case study.This provides evidence that these Tactics provide a viable approach to increase resilience.An important note: The goal of this article is not to identify or determine the optimal use of each Tactic, rather to show that these Tactics can increase resilience (GAP 3).The rest of this article is organized as follows.First, Section II discusses resilience, MASs, agent-based modeling, and the previously identified design-for-resilience Tactics.Next, Section III provides a description of the case study, the modeling approach, resilience measurement methodology, and testing procedure.Section IV analyzes the improvement when applying each Tactic.Finally, Section V concludes this article.

II. BACKGROUND A. CURRENT APPROACHES TO INCREASE RESILIENCE
Resilience refers to the ability to anticipation, absorb the impact of, respond to, and recover from faults [1], [4], [13], [21], [24], [26], [27] The Tactics tested in this article focus on the absorb, response, and recovery phases, while current research often focuses on the anticipation stage (e.g., plan/prepare stage) [13].Resilience is different than the related concepts of robustness and reliability.A resilient system may display decreased performance after a fault, while a robust system would not.Reliable components prevent faults from occurring [4], while resilience also includes system recovery [1].It is necessary to also absorb, respond, and recover from faults, because as systems grow in complexity it is impossible to prevent all faults [13], [26].In addition, reducing the impact (absorb phase) alone may not result in satisfactory performance without considering the response dynamics (response and recovery phases) [13].Resilience reflects both the transient response to faults, but also the prevention or mitigation of long-term damage [13].
One area of considerable interest is resilient MAS for cyber-defense applications [18], [24], [26], [33].It is often infeasible to use humans-in-the-loop for cyber-defense operations due to accessibility or the needed rapid response [18], [19], [24], [26], [33].Therefore, some efforts have focused on developing autonomous resilient agents that can both detect and respond to cyber-attacks [20], [33].Designing autonomous cyber-agents so that their interventions do not further degrade the situation is a significant challenge [20], [24].Like this article, these efforts often focus on intentional agent design before an expected fault occurs, not human intervention during a fault [15], [24].Agent approaches can be divided into four domains: physical, information, cognitive, and social [13], [36].The Tactics examined in this study focus on the information and cognitive domains (how to share information and how agents should make decisions), while bypassing the often-costly physical interventions without advance knowledge of the threat specifics (i.e., type of fault or duration).Cyber-defense agent research has not considered eusocial-insect inspired interventions, rather it focuses on defining architectural requirements to guide and assess cyber-agent development [19], [20].
Biologically inspired design has been used, however, to increase the resilience of systems by seeking to mimic ecosystem structures in artificial networks.This includes applications, such as reconfiguring power grid connections, through ecological network analysis of food web structures [17], utilizing graph theory metrics based on ecological principles for supply chains [37], [38], and optimizing resource cycling in water distribution patterns.For example, a case study focusing on water distribution achieved substantial cost savings during both normal operations and fault recovery by examining a robustness metric [16], [17].Another study examining over 60 000 simulated supply chains makes specific topology recommendations based on desired performance and ecosystem inspired metrics [38].The degree of system order ecological metric has also been shown to enhance both the resilience and cost effectiveness for the design of a hypothetical combat and context-neutral system-of-systems [39], [40].These efforts, however, do not leverage biologically inspired agent behavior to increase system resilience (GAP 2).

B. MULTI-AGENT SYSTEMS: CHARACTERISTICS, BENEFITS, USES, AND CURRENT LIMITATIONS
MASs have several unique characteristics.MAS combine independent entities (agents) to solve a problem beyond the ability of an individual [23], [41].Autonomous agents communicate with each other, sense their environment, and act rationally in their own self-interest toward goals [19], [42], [43].An agent can be a physical artifact or exist purely in software [26].A recently proposed architecture for autonomous intelligent cyber-defense agents identified five key functions: sensing, planning, collaboration, acting, and learning [19], [20].The Tactics tested in this article impact the execution of these functions.Although the previously proposed architecture focuses on cyber applications, their research is a subset of the general question we examine: MAS resilience.
Biologically inspired design has also been applied to MAS, although not with the goal of increasing resilience through agent behavior.Eusocial insects have inspired swarm against swarm combat approaches, improved collective decision making under uncertainty, Internet server allocation, and ant colony optimization routines.Research beyond eusocial insects inspiration include synchronization approaches inspired by fireflies and frogs, improved area search routines inspired by territorial bird behavior, and using flocking behavior for a data sorting and classification approach [47], [48], [49], [50].

C. AGENT-BASED MODELING
The simulations used in this study were performed in Any-Logic, a Java-based modeling platform commonly used in ABM investigations [51], thus some software specific terms in the following discussion are particular to AnyLogic.
In ABM, local agent decisions are represented in statecharts [14], [52].Statecharts describe each agents' internal transitions between predefined states (e.g., tasks or behaviors).States are annotated in italics (e.g., Search).An agent's movement between states is controlled by conditional statements, which can be activated by the environment, communication, completion of a previous task, or internal logic.Internal logic triggers include an elapsed timer, probabilistically, or following predefined transition rules.The conditions associated with the triggers are analogous to the use of algorithmic constraints in cyber-defense agent design [24].An agent's triggers are impacted by the modeling parameters selected prior to the model execution.One advantage statecharts have over machine-learning-based agent control is increased explainability [24].

D. PREVIOUS INVESTIGATION INTO EUSOCIAL INSECT RESILIENCE
In our previous work, a literature review identified five functions that eusocial insects performed to increased colony resilience to link removal (see Table 1) [25].
1) Apply Genetic Variation: This function creates selforganized task allocation and enables more efficient task performance for individuals.2) Implement Individual Learning: Learning impacts an agent's task performance as well as their willingness to perform the same task again in the future.3) Incorporate Time Dependency: Regulating communications temporally ensures that agents do not act upon old data and ensures that agents maintain an acceptable level of awareness.4) Communicate in Decentralized Networks: This function limits fault propagation while supporting data transfer.

5) Include Mechanisms for Information Amplification:
This function allows repeated messages to percolate throughout a population and influence decision making.The functions can be divided into two broad areas where the Tactics can be applied: individual agent design and communication protocols.Individual agent design refers to features that vary from individual to individual, based on an agent's limited capabilities, knowledge, and experience.An example is the varying response thresholds of ants to sugar.Communication protocols refer to aspects that impact how agents communicate with each other.For example, P. longicornis ants incorporate a "refresh" rate when making group decisions as they carry a large load back to their nest [53].
These functions can be applied through 14 Tactics (see Table 1).The Tactics are described in terms of ABM implementation to provide a solution-neutral approach to increasing MAS resilience.Framing the Tactics in terms of ABM modeling divorces the Tactics from any specific case study.

III. METHODOLOGY A. CASE STUDY DESCRIPTION: EMMSC
The case study examined in this article is an EMMSC (see Fig. 2).EMMSC has 20 constituents divided into six roles (suppliers, subassemblers, final assemblers, distributors, in-use, and a recycling center).The suppliers receive raw materials from outside the system boundary and use it to produce one of the four primary components (rotor, stator, shield, or base).These components are sent to one of four subassemblers that produce drive-assemblies or case-assemblies.The two final assemblers then combine the assemblies to create motors.After assembly the motors travel through four distributors and into use.Upon end of life, the motors are either sent to the recycling center or to the landfill (exports).The recycling center takes apart the motor and sends the refurbished primary components back to the subasssemblers and the nonrecyclable components to the landfill.
EMMSC was previously used in studies examining approaches to measure and improve supply chain resilience [37], [54], [55].Modifications made for this article include the addition of constituents 19 and 20, as well as the circular flow created by the recycling center.The recycling center was added to increase case study complexity and add cascading faults.
The steady-state stock and flow data for EMMSC are provided in Tables 2 and 3.The units are motor equivalent where a component (i.e., a stator) is considered 1/4 motor equivalent and a subassembly is considered 1/2 motor equivalent.The components are weighted equally because the lack of either  component prevents subassembly or final assembly manufacturing [37].The supply chain is assumed to be balanced [37], [54].The total output (sum of the flows toX 19 ) is 20 000 electric motors per week, and each motor is sold for a profit of 3225 USD [55].We assume that the lifespan of the motor is ten years and that each constituent within the supply chain maintains a two-week safety stock.Each constituent is capable, however, of storing onsite a maximum of four weeks of safety stock.The initial recycling rate is 50%, and 50% of the motor parts entering the recycling center can be refurbished.We assume that 500 individuals (agents) work within the system.The individuals can work for different constituent systems.The number of individuals working at each constituent impacts the maximum production possible.The individuals can choose to change constituents to respond to faults.The approach to change roles and information considered depends on the Tactic being tested.
The values chosen for motor lifespan, safety stock, recycling rate, and recycling effectiveness will impact the resilience of EMMSC.For this investigation, however, we are interested not in the specific resilience value of EMMSC, but how EMMSC resilience changes when implementing the Tactics in Table 1.

B. CASE STUDY MODEL FORMULATION
The EMMSC is formulated as a hybrid model.Flows between the constituents are modeled using a systems dynamic approach (see Table 4), while agent behaviors are captured using an ABM.The agents respond to system level performance, and system level performance is impacted by agent behavior.
The model was simulated in Anylogic 8.5.2University edition and the model unit time was in weeks.Simulation runs were conducted on a personal laptop with an Intel Core i7-10750H CPU operating at 2.50 GHz and 16.0 GB of RAM.
If the role of agents is not considered, material flow between the constituents is given in Table 4.Additional model considerations include that if the Link between X 19 and X 20 is broken, then all flows from X 19 become exports (motors do not simply stop breaking).Similarly, reduction in flows from the recycling center to the subassemblers also results in increased exports from the recycling center.
Agents impact system performance by shifting roles between constituents in response to faults.A constituent with more agents assigned to it has a higher possible output (higher productivity).Upon model initialization, 500 agents are spread equally between 20 different roles.The agents are either employees at constituents 1-18 and 20 or they are part of the company's recycling campaign (see Table 2).Those who work at a constituent impact the possible output of that constituent, modeled using a logistic curve: where P(N) is the productivity of a constituent system as a function of the number of agents, N, working there.P(N) impacts the maximum production rate possible.The initial productivity of each constituent is 1.0 with the original 25 agents assigned.P(N) reaches a maximum value of two (with 50 workers) and a minimum value of zero (with no workers).P(N) captures both the diminishing returns when adding extra workers, but also reflects that when very few workers are present the constituent will barely function [P(3) = .03].Thus, changing worker assignment changes constituent productivity, but not without bound and not linearly.In addition, changes in system productivity and costs are symmetric about the steadystate factory capacity (the productivity gained by adding one additional worker is offset by the productivity loss elsewhere in the system).The economic value of a change in productivity depends on the system (due to the systems having different motor equivalent export flow rates).For example, adding five agents to constituent 1 (for a total of 30) would result in a productivity of 1. 46  The actual flow will depend upon sufficient inputs or safety stock.Agents have no impact on how long motors stay in use (i.e., the flow from X 19 to X 20 in Fig. 2).

C. RESILIENCE METRIC-(SOSRM)
Systems of Systems Resilience Metric (SoSRM) measures the change in resilience when using each Tactic [56].The EMMSC is both an SoS and an MAS.SoS are combinations of networked constituent systems that have three characteristics [56].First, SoSs exhibit distributed control.Each constituent (X 1-5 ) is free to respond independently to supply chain disruptions according to their own self-interest.Second, SoS demonstrate emergent behaviors.Supply chains exhibit emergent behaviors, such as the bull-whip effect (demand distortions that travel through a supply chain) [57].Third, SoS have evolved structures.Supply chains may add or subtract suppliers or distributors.SoSRM measures the expected resilience of an SoS to unexpected link removal between constituents.SoSRM focuses on link failures due to the critical role of system interfaces in an SoS [1], [58].To calculate SoSRM, the model is run without any faults (all links intact).Then, the model is rerun with each link being broken for time, T T , the link is resorted, and the model performance is observed for an additional time T T .T T is the amount of time required for 63.2% of an energy input to flow through the system during nonfaulted operation.For EMMSC T T is 13.7 years.Observation over a period of years is often required as the impacts from a fault resonate through the system.For example, in [59] the system is observed for 100 years.
Thus, SoSRM reflects the dynamics of absorbing (decrease in motor production), responding (reassigning agents), and recovering (reestablishing equilibrium by reassigning agents) from the fault while having the fault duration and recovery timing dependent on network architecture (e.g., comparing resilience across fixed energy cycling).This is similar to those that advocate for measuring resilience against a fixed adversary effort [26].SoSRM provides a needed approach to compare resilience between different systems [13].SoSRM analyzes a standard set of faults (each link) for a duration dependent on network structured (T T ) in order to avoid some of the concerns with time-integration based resilience measurements [26].SoSRM's focus on recovery and adaptation improves a noted weakness of current resilience measurement approaches [26].Because SoSRM provides a single numerical value the impact of each stage is hidden when compared with other approaches, such as resilience matrices [13].
The process of breaking and restoring faulted links is repeated for the remaining links within an SoS, and the result is a ratio of average link faulted performance to no-fault performance, usually a value between 0 and 1. SoSRM calculation is as follows: Where k is the number of SoS links in the SoS (29 for EMMSC).The summation over k faults reflects that SoSRM is evaluated for each of k (number of links) being disrupted, one at a time.n is the number of measures of performance (MOP) in the SoS evaluated during each trial run.MOPs are defined by the system engineer.It is challenging to select MOPs that accurately reflect system performance [13], [26].MOPs need not to be continuous functions, but as recommended in the literature can be selected to reflect "mission parameters," such as friendly unit survivability [26], [56].EMMSC has four MOPs, the flows from the distributors to in-use (X 15,16,17,18 → X 19 ).Q g,fault (t) is MOP at time t impacted by one of k evaluated faults.Q g (0) is a constituent's steady-state MOP prior to fault occurrence and provides normalization for each of the MOPs considered.The fault occurs at time t 0 and T * is the trial run time per link (two times T T ).Thus, SoSRM is an index approach that combines several indicators of interest into a single value that can be used for comparison [13].
All links in (2) are by default weighted equally because each constituent makes rational decisions in their own self-interest.Weighting of each link is also not required because cascading faults travel through the SoS after each link removal.SoSRM is a threat-agnostics resilience quantification approach, so SoSRM does not require identification of causes or associated probability of occurrence (i.e., 10% chance that flooding occurs in a region during the next year).SoSRM assumes that faults in the links are possible and of equally likelihood (due to being an early design stage evaluation).This also allows for resilience evaluation without potentially expensive or biased expert evaluations or without a known threat profile [13], [26].As an additional benefit, this approach bypasses the concern that resilience measurements could be biased by the magnitude of the fault evaluated for each system [26].If, however, a system were to have links that could not be broken or a highly skewed occurrence distribution, then the problem should be reformulated (e.g., with a new k).SoSRM is intended for when detailed potential fault analysis is not available or desired.
Because SoSs vary widely in size and scope, it can be helpful to frame SoSRM in terms of the expected financial losses to provide a sense of physical significance as given in the following: $MOP NO FAULT is the USD equivalent value of the SoS' MOP during no-fault operation.Equation (3) assumes that all MOPs have equal values and are all impacted the same amount per change in the SoSRM unit.For this case study, all MOPs are the balanced output of the four distributers (X 15-18 ), SoSRM is thus the proportion of motors produced following faults, and MOP value is the total profit from selling the motors leaving the distributors.For EMMSC over 27.4 years (T * ), $MOP NO FAULT is 92.1 billion USD in profits.Thus, E[$SoSRM] provides an indication of expected losses over the duration of the SoSRM calculation.For more information about SoSRM and SoSRM validation testing, see [56], and [60].

D. TESTING APPROACH
Testing the 14 Tactics compares five different cases, summarized in Table 5, given in the following.
1) EMMSC without agent behavior (NONE) as described in Section III-D1.2) EMMSC with default agent behavior-no communication (DEFAULT AGENT ) as described in Section III-D2.
For this case, we assume no communication is needed (i.e., all agents have universal knowledge of the MAS)."Universal knowledge" for EMMSC is how many agents are assigned to each constituent system and each constituent's safety stock.This could be implemented in EMMSC through an online dashboard that reports the status of all constituents.3) EMMSC with individual agent design Tactics implemented (Tactics 1-7 of Table 5) as described in Section III-D3.We expect that applying each Tactic will result in higher resilience than DEFAULT AGENT .Improvement significance was validated using a nonpaired two-tailed t-test with unequal variance (p<.01).4) EMMSC with default agent behavior and communications.(DEFAULT COMMS ) as described in Section III-D4.For this case, we assume agents only receive information through communication from other agents.The information conveyed is the number of agents assigned to that agent's constituent system and the safety stock for that agent's constituent system.Communication could be implemented in EMMSC through web-based messaging or even pagers.5) EMMSC with communication protocol Tactics (Tactics 8-14 of Table 5) as described in Section III-D5.Tactics 8-14 are compared with the default agent performance case (DEFAULT COMMS ).Significance is validated using a two-tailed t-test with unequal variance (p<.01).The agent behavior is stochastic (see Section III-D2), so evaluations required Monte Carlo approaches.In total, 25 trials were selected to accommodate both the lengthy run time required (a minimum of 40 min per trial) and because the results have a very narrow distribution (the standard deviation is less than 1% of the mean SoSRM value after only five trials).This narrow distribution is because the applied Tactic dominates performance, despite the individual runs demonstrating chaotic behavior and complex dynamics, such as the bull-whip effect.

1) NO AGENT BEHAVIOR (NONE)
EMMSC is without agent action (e.g., the resilience due to network structure and default dynamics alone).Material flow is as given in Table 4.There is no agent stochasticity, so a single trial is sufficient to estimate resilience.The equations in Table 4, are still relevant for all other cases considered (Section III-D2-III-D5), but for the remaining cases, constituent output is also impacted by (1).

2) DEFAULT AGENT BEHAVIOR WITHOUT COMMUNICATION (DEFAULT AGENT )
In DEFAULT AGENT , agents can change which system they work for [impacting (1)], but no Tactics are implemented.DEFAULT AGENT provides verification that including agent behavior increases EMMSC resilience and provides a baseline to compare with Tactics 1-7.
DEFAULT AGENT 's statechart is shown in Fig. 3. Prior to each model run, the agents are initialized by entering state Setup through either Link A, C, or P. Link B's transition to state Genwork occurs over a uniform (0, 5) weeks.During the initial 0-5 week delay the agents contribute to (1).Once in GenWork, every five weeks the agent will transition to state Search through Link D. In state Search, the agent will determine if it should work at a different constituent (see Fig. 4).The logic in Fig. 4 is similar to the constraints proposed in [24] as a form of human supervision of agents.If the agent does not need to change roles, it transitions back to the Genwork State through Link E, otherwise the agent transitions through Link F to state ChangeRole and Link L back to GenWork.Transitioning Link L requires four days during which the agent is no longer contributing to (1).This delay captures the associated loss of effectiveness due to changing roles.The explanation of Links G through O is saved for Section III-D3 Inventory in Fig. 4 is calculated with the following and is a measure of current stock relative to no-fault stock levels: The economic impact of inventory depends on the no-fault safety stock of the system being evaluated.An inventory of 1.1 is an additional 1.2 million USD in safety stock at constituents 1-8 and 3.2 million USD at constituents 9-20 (not 19).

3) INDIVIDUAL AGENT DESIGN TACTICS (TACTICS 1-7)
Tactic 1-Vary Parameters in transition triggers in identical statecharts: One approach to implement genetic variation is to vary agents' statecharts.Rather than identical logic trees in state Search (see Fig. 4), each agent draws values from a uniform distribution to determine how easy it is to change roles (see Fig. 5).We expect that implementing Tactic 1 will result in increased resilience over DEFAULT AGENT .
Tactic 2-Begin agents in different composite states: One result of genetic diversity within a colony is specialization [which impacts calculation of (1)].Specialization can be applied using composite states (e.g., state Work within "Spec-Work" in Fig. 4).Specialized agents are more efficient at their role [greater impact on (1)], but less flexible (unable to change roles).While in state Work, specialized workers provide an impact of 1.3N rather than 1.0N for (1).Thus, one specialized worker does the same amount of work as 1.3 generalized workers.Tactic 2 is consistent with Holland's hypothesis that under peaceful conditions, increasing specialization should be observed [61].
To implement Tactic 2, we begin agents in GenWork and SpecWork, rather than all in GenWork.This is analogous to starting your workforce with a portion that has specialized training and only performs a subset of the possible tasks.This Tactic is applicable for a new MAS standup or reorganization and would be centrally coordinated.In total, 24% of the agents (six workers per constituent) are assigned as specialists.We expect that specialization will increase resilience over a population of generalized agents (DEFAULT AGENT ).
Tactic 3-Limit the number of specialized agents: Despite specialization improving individual performance, colonies do not specialize their entire population [62].This is because overly specialized populations loose flexibility to respond to crisis.To implement Tactic 3 practically would require a central authority to design a staffing plan that limits the number of specialized workers.
Tactic 4-Implement agent learning: For Tactics 1-3, agents are either designated as generalists or specialists and do not change designation during the model run.Tactic 4 implements learning to allow generalized agents to become specialized [influence on (1) of 1.3 versus 1.0].Learning occurs in this model when the agents have searched multiple times to determine if they should switch roles (Links D and E).
Once the agent has searched for two years and has not needed to change roles, it then becomes specialized through Link H. Two years were chosen because it is the same amount of time a specialized agent would have checked to see if it should change roles.Learning impacts the effectiveness of task completion, not adaptation to previous faults, revising an agent's understanding of their operating environment, or the use of machine learning (a focus of some resilient cyber-defense agent research) [18], [24], [33].Learning also influences the set of actions available to the agents [19], impacting the planning function of the agents.
Tactic 4 incorporates Links I-K (see Fig. 3).Link I functions similar to Link D, but instead of checking every five weeks to see if the agent should change roles, the specialized agent performs this check every two years.The logic in CheckStatus is identical to Search, but if the agent should change roles, it is removed from the specialized composite state through Link K. Otherwise, if no change in role is needed, the agent transitions back to the Work through Link J.We expect that Tactic 4 will increase resilience over DEFAULT AGENT .
Tactic 5-Evolve parameters within Statecharts: Agents can learn from their previous experiences [19].Tactic 5 alters the probabilities associated with deciding to transition to another state (see Fig. 4, 20% and 50% occurrence probabilities).Each time an agent chooses to change roles the occurrence probabilities are increased by 1% of their current value (i.e., 20%-20.2%).Likewise, choosing not to change roles causes the probabilities to decrease by 1% of their current values.Thus, experience in changing roles will make it easier (or harder) for the agent to change roles in the future.We expect that implementing Tactic 5 will result in a higher resilience than DEFAULT AGENT .
Tactic 5 (evolve parameters) and Tactic 1 (vary parameters) impact Fig. 4 transition logic differently.Tactic 1 changes the occurrence probability initially for all agents.Tactic 5, however, initializes all agents with the same initial transition logic (see Fig. 4), and then allows transition logic to change in response to individual experience.
Tactic 6-Use learning to reduce the difficulty of transitioning to composite states: Tactic 6 implements learning to the two-year timeout for Link H. Like Tactic 5, every time they successfully transition to the composite state "Spec Work," the threshold used to transition through Link H is altered.Each time an agent travels down Link H, the required unsuccessful search time decreases by 10% (e.g., from 2 to 1.8 years).Thus, if an agent becomes specialized once, it is easier to become specialized in the future.Tactic 6 is a form of adaptive coupling, changing to improve response to repeated threats [13].The threshold for Link H is reset to two years when entering Setup.We expect that Tactic 6 will outperform Tactic 4 as well as DEFAULT AGENT .
Tactic 7-Use timeouts to exit composite states: Tactic 7 incorporates an automatic removal from specialized to generalized through Link O after ten years to promote population flexibility.Ten years allows the specialized agent to check five times to see if there was a need to change roles.The model began with 0% initial specialization.Transition to specialization occurs through Link H after two years of unsuccessful searching to change role.We expect Tactic 7 to result in higher resilience than DEFAULT AGENT .

4) DEFAULT AGENTS WITH COMMUNICATIONS (DEFAULT COMMS )
Unlike DEFAULT AGENT , agents often do not have universal knowledge of all constituent systems' performance.Thus, DEFAULT AGENT no longer serves as a fair comparison.Rather, when investigating Communication protocols, we develop a second nonbiologically inspired baseline: DEFAULT COMMS.
The communication network in EMMSC DEFAULT COMMS is a uniformly distributed network.Each agent is initially connected to two agents in every constituent (including its own initial constituent).Each agent has 50 connections and can receive messages from their connections requesting assistance.As the communication network evolves (due to agents changing roles), the information available to each agent also changes.
The DEFAULT COMMS statechart is shown in Fig. 6.Links T and Q are timeout transitions.When entering Distress, the agent sends a distress signal to all agents it is connected to if its constituent system's inventory is less than 1.0 and the value of P(N) is less than 1.9.After sending the distress signal, the agent transitions back to state GenWork after 1 day (Link T).Link S is unused for DEFAULT COMMS .
The timeout associated with Link D is shown as follows: and restated as follows: In ( 5) and ( 6), Time D is a discrete function where the kth transition depends upon the current stock of the constituent system that the agent is assigned to (N i ), the current inventory, and the instantaneous rate of change as approximated by the slope of the inventory.The frequency at which the slope is evaluated is a function of how often Link D is traveled.The subscript i in ( 6) is the constituent number (see Fig. 2).The distress signal is sent more frequently when there are fewer agents assigned to a constituent, its inventory is worse, or its inventory is decreasing rapidly.
After five weeks, the agents transition to state Search through Link D to see if there are any distress signals they should consider.The agents will not respond to distress signals (Link R) unless they receive them while in a state Search.Internal logic in Search verifies that the received signal is valid.A valid distress signal is from another constituent with productivity <1.2 and is received by an agent eligible to change roles (its own constituent's inventory is increasing or stable or there are excessive agents assigned to that constituent).
After determining that the message was valid, the agent must evaluate if the valid message should be considered in the decision logic.This deliberation reflects individual agent stochasticity and decision making.It is more likely for the agent to consider a vote greater than its own constituent's productivity.If the message is to be considered, the agent tallies a "vote" to transition to that constituent.The agent will not consider the message if its own system's productivity is less than .05, to ensure that some agents remain assigned to every constituent.After receiving ten messages through Link R, the agent attempts to change roles to the constituent receiving the most votes through Link F. Link F includes a 1 hour delay.

5) COMMUNICATION PROTOCOL TACTICS (TACTICS 8-14)
An important methodological difference when comparing Tactics 8-14 to DEFAULT COMMS and comparing Tactics 1-7 to DEFAULT AGENT is that in Section III-D3 additional features were added to DEFAULT AGENT to see if SoSRM increased.For several of the tests in this section, however, DEFAULT COMMS already had the feature needed to test implemented.Thus, we removed these features to verify that performance degraded as expected.
Tactic 8-Incorporate timeout safeguards for message triggers: Timeout safeguards ensure that the messages received by an agent are only considered for a set duration.Tactic 8 is analogous to the evaporation, which occurs when ants communicate with pheromone deposition.Timeout safeguards are incorporated in Links E and L by discarding old message information.To test Tactic 8, EMMSC was evaluated without message history being discarded when traveling down Link E. We expect that the Tactic 8 will have lower resilience than DEFAULT COMMS .
Tactic 9-Avoid hierarchical communication chains: Tactic 9 recommends not using a hierarchical communication structure.This Tactic is consistent with complex network research that compares hierarchical and heterarchical (e.g., distributed) supervisory approaches.Although heterarchical approaches may prevent system wide coordination and reaching of a global optimal performance, they are usually able to react to faults with greater flexibility [41].
For this test case, hierarchy is determined by the constituent's position in Fig. 2. Connections per agent are given in Table 6.Of note, Tactic 9 yields slightly more average connections per agent (58 instead of 50) than for DEFAULT COMMS .We expect the performance with a hierarchical communication chain to be lower than the performance with a distributed communication network (DEFAULT COMMS ).
Tactic 10-Avoid wisdom of the crowd approaches: Wisdom of the crowd approaches seek a consensus among a population before making a choice, rather than basing decisions on individual perceptions.To implement a wisdom of the crowd approach the "vote" is put into a globally accessible pool.When executing Link F, the agent considers the pooled votes from all agents, rather than just the votes it received.We expect that Tactic 10 will result in lower resilience than DEFAULT COMMS .Using a "wisdom of the crowd" approach also adds a single point of failure, negating a benefit of distributed decision making in MASs [20].Tactic 11-Incorporate unconditional transitions: Unconditional transitions occur when the situation is so dire, that waiting for repeated requests for aid would result in harm.To implement unconditional transitions, Link S is activated.If an agent receives a request for help from another agent whose constituent system's inventory is less than 1 and less than 25 agents are assigned to that constituent, then as long as the receiving agent system's inventory and productivity are greater than 1.2, then the receiving agent will automatically change roles to come to that constituent's aid.This transition bypasses the normal search logic through Link S. We expect Tactic 11 to have higher resilience than DEFAULT COMMS .
Tactic 12-Use parallel statecharts for unconditional transitions: In Tactic 11, Link S bypasses the normal search logic.Tactic 12 incorporates the unconditional transition into the normal search routine (from Search to Link F).We expect Tactic 12 to have higher resilience than DEFAULT COMMS , but lower resilience than Tactic 11.
Tactic 13-Use guards in message-triggered transitions: Guards help ensure that agents do not immediately act when it is not required.An important guard in DEFAULT COMMS is on Link R.This guard ensures that ten distress signals are received before the Link F and transition logic can be activated.We designate this threshold as GUARD MSG (10 for DEFAULT COMMS ).GUARD MSG is removed for Tactic 13 so that after a single message, transition logic is evaluated.We expect Tactic 13 to have lower resilience than DEFAULT COMMS.
Tactic 14-Use learning to alter guards: Individual learning can also impact guards for message logic.For Tactic 14, GUARD MSG evolves due to agent experience.Each time an agent successfully transitions Link L, GUARD MSG is reduced by one (minimum value of zero).It becomes easier for agents to transition Link L the more times Link L is used.We expect Tactic 14 to have higher resilience than DEFAULT COMMS .

IV. RESULTS AND DISCUSSION
A summary of the results is given in Tables 7 and 8. Individual agent design Tactics had less uncertainty (average standard deviation of .002versus .004),but all standard deviations were less than 1% of the sample mean (1.04% for Tactic 9).

A. RESULTS: INDIVIDUAL AGENT DESIGN TACTICS (TACTICS 1-7)
Tactics 1-7 provide insight into the effectiveness of Individual agent design when applied to the EMMSC (see Fig. 7).Tactic 2 had the largest positive impact on SoSRM (.064, 7.1%).All six Tactics tested resulted in statistically significant t-test results.The average improvement when using Tactics 1-7 (not 3) was .045(a 5% improvement).The only Tactics whose samples had greater than .002standard deviation were Tactics 4 and 6, which both implemented agent learning.This indicates that strategies that incorporate agent learning are more susceptible to the stochastic impacts of individual choices.E[$SoSRM] for the smallest improvement (Tactic 1) was 2.3 billion USD.This provides noteworthy savings because Tactic 1 (vary transition parameters) is simply a procedural change (changing when each worker should change roles) and does not require additional equipment or training.The main cost associated with Tactics 2-7 is the training required to specialize some of the workers.Tactic 2 indicates that six workers per system (120 total) is sufficient to observe gains.We expect that the training for these 120 individuals to be far less than the 165 million dollars (average for Tactics 1-7) in savings per year during SoSRM evaluation.
Tactics 1 and 5 provide two different approaches that result in a "genetic distribution" within the population.Tactic 1 initializes the population with different transition probabilities, while Tactic 5 used learning.Of note, both Tactics 1 and 5 are superior to DEFAULT AGENT , but there is minimal difference in the performance between them (Tactic 1 mean .923,standard deviation .002;Tactic 5 mean .925,standard deviation .0003).This may be because all agents have universal knowledge available, and thus learning does not provide a different outcome than random genetic initialization.The t-test result (p<.01) when comparing Tactics 1 and 5 results indicate, however, that we can reject the null hypothesis that Tactics 1 and 5 results are drawn from the same distribution.
Surprisingly, Tactics 4 and 6 resulted in the same improvement in SoSRM over DEFAULT AGENT .We expected Tactic 6 to result in greater improvement than Tactic 4, because the implemented learning impacted both and ease of specialization.Additional tests were done to see if the rate of learning impacted this result (improving ease of transitioning by 10%, 50%, and 80% per transition for Link H), but the results remain unchanged.Potential explanations for this effect include that the initial network distribution may reduce the impact of agent learning.Future research will investigate if there are specific conditions necessary for learning that impacts transition logic to increase resilience.
Tactic 3 (limit the number of specialized agents) was evaluated with a sensitivity analysis, rather than comparing to DEFAULT AGENT (see Fig. 8).As expected, increasing levels of specialization resulted in increased resilience, but up to a limit.Additional initial specialization beyond 15 agents (60%) did not result in an increased resilience.These results do not propose that 60% specialization is always ideal.We suspect that the ideal amount of specialization will depend on the problem being examined.
Although proven effective, there are additional research questions worth investigating.Future work will examine how genetic distribution is manifested in nature, as well as testing additional types of initial random distributions (Tactic 1).For Tactic 5 (parameter evolution) future studies will focus on investigating the role of evolution rate on resilience improvement.The impact of timeout rate for Tactic 7, also warrants future investigation.These questions, however, do not undermine the contributions of this article: to demonstrate a context-neutral approach, focused on   7. Graphical inset provided to allow detailed view.agent-level intervention, which is recorded clearly for the system engineering community.

B. RESULTS: COMMUNICATION PROTOCOL TACTICS (TACTICS 8-14)
Tactics 8-14 provide insight into the effectiveness of the communication protocols Tactics (see Fig. 9).The average improvement when using Tactics 8-14 was .063(8.1%).Tactics 8-10 had the largest impact on SoSRM (14%, 14.9%, and 15.8% change).Tactic 8 ensures that old information is no longer considered, Tactic 9 impacts which agents receive information from, and Tactic 10 shows the impact of information aggregation prior to decision making.These results indicate that where the information comes from and how it is handled have a significant impact on resilience.The degraded performance during Tactic 9 (.879 versus .765) is especially of interest since this is a more connected network (58 versus 50 connections per agent).Even with (potentially) more information sources available, a distributed information source had higher resilience.In addition, Tactic 9 had the largest standard deviation of any examined (1.04% of mean value).
Tactics 11 and 12 both involved unconditional transitions and the results indicate the importance of using a parallel statechart structure.Unconditional transitions with a parallel statechart resulted in increased SoSRM (.879 versus .886),but without a parallel transition path (Tactic 12), performance was worse than DEFAULT COMMS (.879 versus .870).It is insufficient to merely incorporate unconditional transitions into the normal decision-making, there must be a mechanism that causes the agent to "skip" normal processes and proceed directly to the transition (Link S).This may be because the normal processes would already recommend transitioning, so incorporating unconditional transitions into the existing transition logic merely adds a redundant check, slowing decision-making without changing the outcome.
Additional study is also warranted for the communication protocols.Tactic 8 (communication timeouts) could be refined by investigating if there are fundamental biological principles that can be used to determine information retention policy.Additional investigations into Tactic 9 (wisdom of the crowd approaches) could determine when wisdom of the crowd is appropriate.Fully incorporating ecosystem communication structure with multiple types of graph structures (small world, random, scale free, and ring bus) could refine Tactic 10.
Tactics 8-14 also showed potential for savings with an average E[$SoSRM] of 5.1 billion dollars over the 24.7 years.Tactics 11,12, and 14 had the smallest savings so application should be weighed against implementation costs.Tactics 8-14 only require procedural changes and thus should be relatively low cost to implement.

V. CONCLUSION
This article explored the research question: How can biological inspiration be used to design agent behavior for increased MAS resilience?We hypothesized that 14 designfor-resilience Tactics could be applied to both individual agent design as well as communication protocols.This investigation yielded three contributions, providing progress for three key gaps.
First, current approaches to increase resilience are often case study dependent.In this article we present solutionneutral Tactics in terms MAS statechart adjustments.Although demonstrated on a supply chain MAS, the Tactics can be applied to any MAS whose agent behaviors can be represented through statecharts.To clarify, the authors do not claim that these Tactics will be efficient for any MAS with statecharts (although the results of EMMSC testing are promising), rather that the Tactics can be widely applied.Future work will validate when these solution-neutral Tactics result in increased resilience.
The second contribution of this work is an approach that focuses on agent-based strategies rather than network design to increase resilience.Agent-based approaches often only require small changes to individual behaviors (operating procedures), which could be less expensive than approaches that require infrastructure re-routing or upgrades.The Tactics tested are autonomous, do not use human teaming, and focus on the absorb, adapt, and response of the system.As an additional benefit, when compared with current cyberresilient agent approaches [15], [24], [63], the demonstrated approach does not require machine learning onboard the agents, knowledge of the fault, or implementation of game theory.The cyber-resilient agents, however, often consider adaptive, intelligent adversaries, which is beyond the scope of this investigation.In addition, several of the Tactics impact the five functions identified in a previously published generic architecture for autonomous cyber defense agents: sensing (Tactics 8 and 10), planning (Tactics 11-13), collaboration (Tactics 9 and 10), action (Tactics 7, 11, and 13), and learning (Tactics 3, 5, 6, and14) [19].
Finally, successful examples of design-for-resilience are often unshared.In this article, we both present 14 Tactics, but also clearly record how they are implemented into a hybrid model of an EMMSC.This article provides both documentation of improved resilience for this case study, but also the approach used for implementation.The approach is laid out to enable other system engineers to apply these Tactics to their own MAS scenario.
Although a promising first step, this work has several limitations, which will be examined in future investigations.This study used the best practice of beginning validation with a simulation study [24], [26], but now the effectiveness of the Tactics need to be tested against real-world case studies.Detailed implementation cost analysis should be performed to assess savings due to each Tactic after considering cost of implementation and shifting roles.This analysis assumed no negative effects on the individuals due to changing systems (such as fatigue or emotional stress), future studies that integrate management concerns would strengthen these findings.The resilience measurement metric used (SoSRM) assumes equal link fault probability, a feature of being applied early in the design process.Future studies should validate the performance of these Tactics against specific threats and expected occurrence frequency.These threat details will be available later in the design process.In addition, future work will consider resilience measurement approaches that incorporate node or agent failure as well as link removal.Future work will also include which Tactics should be employed together.
By reporting a solution-neutral approach to increase MAS resilience, focusing on agent intervention, and clearly reporting the implementation to a supply chain case study, this article takes an important step toward MAS design-forresilience.The modern world contains a multitude of MASs, our goal is to continue to promote and use biologically inspired design to identify approaches that can minimize the impact of disruptions, protect users, and enable continued operation in the face of adversity.

FIGURE 1 .
FIGURE 1. Anticipated role of the design-for-resilience tactics in system engineering.

FIGURE 3 .
FIGURE 3. Statechart for EMMSC DEFAULT AGENT .State names are shown inside each block.Link names (A-P) are next to each link.

FIGURE 4 .
FIGURE 4. Agent transition logic in state search.An agent will change roles (assigned system) if its own system's inventory is high, it cannot contribute to the MAS due to a broken link, or a 20% probability AND another system is performing worse, AND a 50% probability AND the other system has less than 50 workers assigned to it.The 20% and 50% probabilities add stochasticity to agent response.

FIGURE 6 .
FIGURE 6. EMMSC DEFAULT COMMS Statechart.Links performing the same function as Fig. 3 have the same letters applied.

FIGURE 7 .
FIGURE 7. Plot of effectiveness of individual agent design tactics with graphical inset.Vertical error bars are the standard deviations reported in Table7.For results of Tactic 3, see Fig.8.Graphical inset provided to allow detailed view of results and error bars.

FIGURE 8 .
FIGURE 8. Varying levels of resilience compared with initial number of agents specialized (Tactic 3) with graphical insert.Note, zero specialized agents are DEFAULT AGENT .Error bars plotted are one standard deviation.All standard deviations are on the order of magnitude of 10 -4 and are thus difficult to discern.Graphical inset provided to allow detailed view.

FIGURE 9 .
FIGURE 9. Plot of effectiveness of communication protocol tactics with graphical inset.Vertical error bars are the standard deviations reported in Table7.Graphical inset provided to allow detailed view.