Welfare Maximization Algorithm for Solving Budget-Constrained Multi-Component POMDPs

Partially Observable Markov Decision Processes (POMDPs) provide an efficient way to model real-world sequential decision making processes. Motivated by the problem of maintenance and inspection of a group of infrastructure components with independent dynamics, this letter presents an algorithm to find the optimal policy for a multi-component budget-constrained POMDP. We first introduce a budgeted-POMDP model (b-POMDP) which enables us to find the optimal policy for a POMDP while adhering to budget constraints. Next, we prove that the value function or maximal collected reward for a special class of b-POMDPs is a concave function of the budget for the finite horizon case. Our second contribution is an algorithm to calculate the optimal policy for a multi-component budget-constrained POMDP by finding the optimal budget split among the individual component POMDPs. The optimal budget split is posed as a welfare maximization problem and the solution is computed by exploiting the concavity of the value function. We illustrate the effectiveness of the proposed algorithm by proposing a maintenance and inspection policy for a group of real-world infrastructure components with different deterioration dynamics, inspection and maintenance costs. We show that the proposed algorithm vastly outperforms the policies currently used in practice.


Introduction
Sequential decision-making is an integral component of many real world problems like machine maintenance, structural inspection and autonomous robotics [1].Markov Decision Processes (MDPs) have provided an efficient framework to model and solve such problems while accounting for the corresponding uncertainty [2].POMDPs are a generalized version of MDPs, allowing for more uncertainty to be accounted for in the form of partial observability of the state of the system [3].However, finding optimal policies for POMDPs is much more computationally intensive as compared to MDPs and has been proven to be PSPACE-complete [4].Synthesis of optimal policies for POMDPs is a classical problem and many algorithms have been proposed for the same [5,6,7,8].This paper considers optimal planning for a class of structured budget-constrained POMDPs.This setting is motivated by infrastructure maintenance planning -a widely studied problem [9,10,11] that involves finding the optimal policy for maintenance and inspection, of an infrastructure component or a group of components within a certain budget [12,13].For simplicity of planning, the stochastic dynamics of multi-component systems are modelled using a POMDP [14,15].Also, the dynamics of individual components are assumed to be independent of each other [16].We will thus model such a setting by multi-component budget-constrained POMDPs where the transition probabilities of the individual component POMDPs are decoupled from each other.Thus we can say that the individual component POMDPs are weakly-coupled in the sense that they are only connected by the shared total budget of the multi-component POMDP.
An algorithm for solving cost-constrained POMDPs has been proposed in [17].However, this algorithm becomes computationally infeasible for multi-component POMDPs with very large state spaces.A POMDP-based solution for optimal maintenance and inspection of structures using Dynamic Bayesian Networks is presented in [18].However, in addition to computational infeasibility, this algorithm does not account for budget constraints.Optimal allocation for MDPs has been studied in [19].The paper models the statistical ranking and selection problem as an MDP and derives an approximately optimal allocation policy using value function approximation.In our work, we study optimal budget allocation for multi-component POMDPs, for infrastructure management.A method for solving budget-constrained MDPs has been presented in [20].The paper introduces a budgeted-MDP model which includes the budget as an implicit part of the state.This is because the paper uses a cost function, similar to Constrained MDPs [21], to keep a track of the cost incurred by the policy.Hence, this algorithm cannot be directly extended to POMDPs because the partial observability of the state would cause a violation of the budget constraint in some cases.
In this work, we propose a computationally efficient algorithm for optimal policy synthesis of a multi-component budget-constrained POMDP.Our contributions here are: • we introduce a b-POMDP model to facilitate strict adherence to budget constraints in POMDPs, • we obtain an approximately optimal policy for multi-component POMDPs by finding the optimal budget split among the individual component POMDPs.
The b-POMDP model includes the total cost incurred upto a time instant k explicitly as a part of the state vector.We show that the value function for a particular class of b-POMDPs is a concave function of the budget.Next, we compute the optimal policy for the individual component POMDPs by modeling them as b-POMDPs and using an online solver like POMCP [8].Doing so gives us the approximate maximal total reward collected by the policy in terms of the budget allocated to the b-POMDP.We use these rewards to calculate the optimum distribution of the total budget among the individual component b-POMDPs and find the approximately optimal policy of the multi-component POMDP.The budget splitting is posed as a welfare maximization problem constrained by the total budget of the multi-component POMDP.The concave nature of the value function renders this as a convex optimization problem, thus guaranteeing a global optimum.We demonstrate the utility of the proposed algorithm by finding optimal maintenance and inspection policies for multiple components of a realistic general administrative building, subject to a budget.Based on this real data, we show that our algorithm vastly outperforms the policy currently used in practice.

Preliminaries and Background
In this section, we provide background on Partially Observable Markov Decision Processes for sequential decisionmaking with stochastic dynamics.We start by defining the notation used in the paper.
Given a finite set A, |A| denotes its cardinality and ∆(A) denotes the set of all probability distributions over the set A.

Partially Observable Markov Decision Process
A discrete-time finite-horizon POMDP [22,23] M is specified by the 8-tuple (S, A, π, T, Ω, O, R, H), where S denotes a finite set of states, A denotes a finite set of actions, π : S → A denotes the policy which specifies the action to take in a given state s and T : S × A → ∆(S) denotes the transition probability function, where ∆(S) is the space of probability distributions over S. Furthermore, Ω denotes a finite set of observations and O : Ω × S × A → ∆(Ω) denotes the observation probability function where ∆(Ω) is analogous to ∆(S).Finally, R : denotes the reward function and H ∈ N 0 denotes the finite planning horizon.
For the above POMDP, at each time step, the environment is in some state s ∈ S and the agent interacts with the environment by taking an action a ∈ A. Doing so results in the environment transitioning to a new state s ∈ S in the next time step with probability T (s, a, s).Simultaneously, the agent receives an observation o ∈ Ω regarding the state of the environment with probability O(o|s, a) which depends on the new state of the environment and the action taken by the agent.In a POMDP the agent doesn't have access to the true state of the environment.However, the agent can update it's belief about the true state of the environment using this observation.The agent also receives a reward R(s, a).
The problem of optimal policy synthesis for a finite-horizon POMDP is that of choosing a sequence of actions which maximizes the expected total reward.

Problem Formulation
We consider a multi-component POMDP, which is a collection of n component POMDPs.The component POMDPs are weakly-coupled in the sense that they have independent transition probabilities and are connected only by the shared total budget.In this paper, we consider optimal policy synthesis for POMDPs with budget, i.e., each action incurs a cost and the total cost incurred by the optimal policy is limited by the budget.We first formally define the multi-component POMDP with a budget and then define the problem of finding the optimal policy for such a POMDP.

Multi-Component Decoupled POMDP with Shared Budget
For a multi-component POMDP, the state space S ⊆ N n 0 is given by S = S 1 × S 2 × . . .× S n .The state space S i ∈ N 0 , for component i, is given by S i = {0, 1, 2, . . ., s max }, where s max ∈ N 0 .The state s k ∈ S, at time step k is given by where the action space A i for component i is given by A i = {d i , q i , m i }.Action d i lets the component move to a new state according to the transition probabilities.The action q i provides an observation which is equal to the next state si of the component and action m i drives the component state to s max .The transition probability function for the multi-component POMDP for s, s ∈ S and a ∈ A is given by T i denotes the transition probability function for component i and is defined as 1, if si = s max and a i = m i , p i (s i , a i , si ), if si ≤ s i and a i ∈ {d i , q i }, 1, if si = 0 = s i and a i ∈ A i , 0, otherwise. ( The probability p i (s i , a i , si ) is chosen according to a probability distribution specific to component i.From the above equation, it can be observed that 0 is an absorbing state.
The observation space is given by Ω = S ∪ {e}, where e ∈ N 0 is an observation that does not provide any information regarding the true state of the system, i.e., e / ∈ S i for all i ∈ {1, 2, . . ., n}.The observation function for the multi-component POMDP is given by Here, O i is the observation probability function for component i and is defined as For each component i, each action d i , q i and m i incurs a cost c i d , c i q and c i m respectively, against a total budget B.

Problem Statement
For a multi-component POMDP, given by the formulation in the previous section, we consider the problem of finding an optimal policy π * which maximizes the time before reaching the absorbing state.Mathematically, π * maximizes t such that s t > 0. Furthermore, for a horizon of length H, π * should be such that the total cost incurred for the multi-component POMDP does not exceed the total budget.We propose an approximately optimal solution for the above problem through a two-step approach.In the first step, we will solve a single-component POMDP for any given budget.In the second step, we will partition the total budget B into budgets for each individual component.
In this section, we detail our methodology for solving the problem of optimal policy synthesis of a multi-component POMDP.First, we introduce the b-POMDP model and discuss how to solve a single b-POMDP.Next, we discuss why the value function for such a POMDP is a concave function of the budget.Finally, we present our proposed method for finding the optimal policy for an n-component POMDP by computing the optimal split of the total budget, among the individual component POMDPs.

Budgeted-POMDP Model (b-POMDP)
Our main goal is to find an optimal policy for a POMDP while adhering to a total budget for actions.The budgeted-MDP model in [20] tracks the incurred cost using a cost function similar to Constrained MDPs [21].This model can't be extended directly to a POMDP because the partial observability of the state would lead to budget violation in some cases.
Hence, we introduce a new b-POMDP model.In a b-POMDP, the budget constraint is incorporated by augmenting the total cost incurred upto time instant k, to each state of the state space.Thus, if we consider a single component POMDP with a total budget B, the modified state at time instant k according to the b-POMDP formulation is given by (s k , c k ) where s k is as defined in the previous section, for i = 1.We assume that unlike s k , c k is completely observable at all time instants.The transition function for the cost component of the state is given by: The transition function for the overall b-POMDP is: where T (s, a, s) is defined in Section 3.1.The new formulation prevents the policy from violating the budget at any time instant k.This is done by making c k > min{B − c m , B − c i } an absorbing state, similar to s = 0.The reward function for the b-POMDP is given by: To find an optimal policy for a b-POMDP, we use the method of Monte-Carlo Planning in POMDPs (POMCP [8]).
POMCP is an online planning algorithm for large POMDPs, which combines a Monte-Carlo update of the agent's belief with a Monte-Carlo tree search for the best action from the current belief state.For a b-POMDP, the maximal collected reward (value function) obtained using POMCP will be a function of the budget B associated with it.We will now prove that this value function is concave in the budget for a special subclass of our overall problem.

Proof for Concavity of Optimal Value Function
Consider an MDP with state space S M DP = {0, 1, 2 . . .s max }, where s max ∈ N 0 and action space is A M DP = {m, d}.
The state of the system decreases by a value d 0 ∈ N 0 unless we perform action m.The transition probability function is defined as: From the above transition function, we can clearly see that state 0 is an absorbing state.Also, d 0 is a decrease in the state value.The cost for the d action is c d = 0 and the cost for the m action is c m > 0. For simplicity, assume that c m = 1.Let the available budget be denoted by b ∈ N 0 .This means that we can perform the action m at most b times.
The reward function is similar to our original problem with a constant positive reward r for all states s > 0 and a zero reward for s = 0.
The value function for an MDP is the total expected reward collected by the optimal policy.For a b-POMDP, the value function is a function of the state value and the available budget.Let V H (s, b) represent the value function of the state and budget for a given horizon with H steps to go.
The following lemma will be used for proving the concavity of the value function for the above mentioned MDP with respect to the budget.Lemma 4.1.For a given budget b and horizon H, the value function V H (s 0 , b) is an increasing function of the state s 0 , i.e., for two states s 0 and s 0 such that s 0 < s 0 , the following holds: Proof.Let π * be a policy that generates the value function V H (s 0 , b), in the sense of maximizing the expected total return over the horizon H given the initial state s 0 and budget b.Thus, we have If we start at state s 0 > s 0 , then following the same policy π * the expected return will be at least as good as that for s 0 .We can therefore say that: Hence, we can say that for a given budget b and horizon H, the optimal policy for s 0 is atleast as good as the optimal policy for s 0 and hence: For a given initial state s 0 and horizon H, the value function V H (s 0 , b) is an increasing function of the budget b, i.e., for two budgets b and b such that b < b , the following holds: Proof.Let π * be a policy that generates the value function V H (s 0 , b), in the sense of maximizing the expected total return over the horizon H given the initial state s 0 and budget b.The same policy will also be optimal for the same initial state s 0 but budget b .This is because following the same policy will result in the same trajectory of states for both budgets, but with a higher budget at each time step for b .Thus, we have Hence, the value function generated by this policy with budget b will be at least as good as the value function generated by the same policy with budget b.Since this holds for any optimal policy, it follows that: For a given initial state s 0 > d 0 and horizon H, if b > 0, the optimal action to take at time t = 0 is d.
Proof.We will compare four cases: 1. We first take an action m and then action o.Doing so yields the maximal collected reward V md H given by 2. We first take an action o and then action m.Doing so yields the maximal collected reward V dm H+1 given by 3. We first take an action m and then action m.Doing so yields the maximal collected reward V mm H given by 4. We first take an action o and then action o.Doing so yields the maximal collected reward V dd H+1 given by Using Lemma 4.1, we know that the value function is an increasing function of the state for the same budget.Also from 4.2 we know that the value function is an increasing function of the budget for the same state.Using these results and (3), ( 4) and ( 5), we get that We don't need to compare V dd H (s 0 , b) because ( 7) is sufficient to prove that for s 0 > d 0 , action o is the optimal initial action.
Lemma 4.4.For a given initial state 0 < s 0 ≤ d 0 and horizon H, if b > 0, the optimal action to take at time t = 0 is m.
Proof.We will consider two cases: 1. We take action d in the first step.Doing so yields the maximal collected reward V o H given by 2. We take action m in the first step.Doing so yields the maximal collected reward V m H given by Clearly from ( 8) and ( 9), taking action d in the first step leads to the absorbing state and hence we get a total reward of r.Taking action m gives the same reward r and drives the state to s max .This implies that the system will not reach the absorbing state for at least one more step and hence leads to higher total reward.Thus, we have ).Hence, we can say that if b > 0, action m is the optimal action to perform at t = 0 when 0 < s 0 ≤ d 0 .
Proof.First, we will show that the claims hold for a horizon of length 0.Then, using induction we will prove that they hold for a horizon of length H.
Consider a horizon of length 0. For this case we have: where b ≥ 0. Clearly, V 0 (s 0 , b) is constant in the budget for all b ≥ 0 and thus, the first two claims hold.Also, for all values of s 0 > 0 we can say that: , Now, for a horizon of length H, assume the following: • For s 0 > d 0 , V H (s 0 , b) is constant with respect to the budget for all b ≥ H/2 , where .represents the floor function, • For s 0 ≤ d 0 , V H (s 0 , b) is constant with respect to the budget for all b ≥ H/2 , where .represents the ceiling function, • Relation (10) holds true for b ≥ 0.
As can be clearly seen, the above assumptions hold true for H = 0 as we proved previously.Now, assume that they hold for some H > 0 and consider a horizon of length H + 1.We will prove that the assumptions hold true for horizon H + 1 when s 0 > d 0 and when s 0 ≤ d 0 .
First we will consider s 0 > d 0 .Using Lemma (4.3) we know that, if b > 0, action d is the optimal action to take at time t = 0 when s 0 > d 0 .Hence for horizon H + 1, if s 0 > d 0 , the value function is given by: We will now consider two cases: 1. Case 1 : s 0 − d 0 > d 0 .In this case, using the results for horizon H, we get that V H+1 (s 0 , b) becomes constant in budget for all b ≥ H/2 .Using the properties of the floor function, we have: 2. Case 2 : s 0 − d 0 ≤ d 0 .In this case, using the results for horizon H, we get that V H+1 (s 0 , b) becomes constant in budget for all b ≥ H/2 .Using the properties of the ceiling and floor functions, we have: Thus we can say that V H+1 (s 0 , b) becomes constant with respect to the budget for all b ≥ (H + 1)/2 .Also, if we consider three budget values b = b + 1 = b + 2 such that b ≥ 0, we have: where the inequality is due to the assumption that relation (10) holds holds for horizon H. Also, using (11), we know that : Hence, from this we can say that: and thus, relation (10) holds true for horizon H + 1 when s 0 > d 0 .
Using Lemma (4.4) we know that, if b > 0, action m is the optimal action to take at time t = 0 when 0 < s 0 ≤ d 0 .
Hence for horizon H + 1, if 0 < s 0 ≤ d 0 , the value function is given by: Note that if s 0 = 0, then we are in the absorbing state and we get V (0, b) = 0 for all b ≥ 0. Thus, we can say that V (0, b) is constant with respect to budget for all b ≥ (H + 1)/2 .Now, for b > 0, using (12) and the results for horizon H, we get that V H+1 (s 0 , b) becomes constant in budget for all b − 1 ≥ H/2 or b ≥ H/2 + 1.From the properties of the floor and ceiling functions, we know that: Hence, we can say that for a horizon H + 1 and s 0 ≤ d 0 , the value function V H+1 (s 0 , b) becomes constant in b for all b ≥ (H + 1)/2 .Now, to prove that relation (10) holds for s 0 ≤ d 0 , we will consider two cases: 1. Consider three budget values b = b + 1 = b + 2 such that b > 0.Then, using (12) we have: where the inequality is due to the assumption that relation (10) holds for horizon H. Also, using (12) we know that: Hence, we using the above equations we can say that: steps if we take only action o repeatedly.We will now consider two cases: (a) Case 1 : smax d0 < H. Using (12) we have: Using the above equations, we have: and thus we can say that: 12) we have: Using the above equations, we have: and thus we can say that: Hence, we can say that relation (10) holds for horizon H when s 0 ≤ d 0 .
Thus, using induction, we have shown that for any horizon of length H ≥ 0 the following holds: • The value function becomes constant with respect to the budget for b ≥ H/2 if s 0 > d 0 and b ≥ H/2 if s 0 < d 0 .
• The increase in the value function decreases with increase in the budget.
Corollary 4.6.The proof of theorem 4.5 and lemma 4.2 implies the concavity of the value function with the budget, i.e., if b = αb + (1 − α)b for some α ∈ [0, 1], then we have This proof can be easily extended to any general c m ∈ R + by scaling the costs and budget with 1/c m .Furthermore, this proof works under heavy technical assumptions of full observability and deterministic transitions.However, we empirically observe that that same property is often true for general systems (partially observable and stochastic) and we believe the same proof approach could work, and we leave it for future work.
We will now use this concavity property to obtain the optimal budget split among the component POMDPs of an ncomponent POMDP.Doing so would provide the approximately optimal policy for the individual component POMDPs and hence the n-component POMDP.

Optimal Policy Synthesis for Multi-Component POMDP
Consider a multi-component POMDP with n components and budget B as described in Section 3.1.The size of the state space is (|S|) n where |S| is the size of the state space of each component POMDP.Also, the size of the total action space is 3 n .Directly applying a POMDP solver to such a large state and action space may not be computationally feasible.Hence, we propose an algorithm which decouples the n component POMDPs by allocating a portion of the total budget to each of them prior to the beginning of the system run.We then compute the approximate value function for each component POMDP as a function of the budget and then using that, obtain the optimal split of the total budget.
Given a total budget B, we assume that the i th component POMDP is alloted a budget b i from the total budget.Hence, We now have n independent POMDPs, where each POMDP has its own total budget.We formulate each of them as a b-POMDP and solve each b-POMDP using the POMCP algorithm as discussed in Section 4.1.Let the maximal collected reward, for component i, obtained using the POMCP algorithm, for a given initial state s 0 and horizon H, be denoted by V i H (s 0 , b i ).We can then find the optimal budget split among the n b-POMDPs by solving a welfare maximization problem.Welfare maximization is the concept of maximizing the overall well-being or welfare of a society, and is achieved by maximizing some measure of social welfare (e.g.maximal collected reward).We thus maximize the total maximal collected reward, for all components, with respect to b i while adhering to the constraint in (14), i.e., Using the results we proved in Section 4.2, we know that V H (s 0 , b i ) is a concave function of b i in the special case mentioned in Section 4.2 and emperically observe it to be concave in general.The welfare maximization problem then becomes a constrained convex optimization problem.Hence, it can be solved easily and is guaranteed to have a global optimum.The solution to (15) provides the optimal budget allocation for each b-POMDP which in-turn gives us the optimal policy for all n component POMDPs.Let π i : S i → A i be a policy for component i with budget b i obtained using the POMCP algorithm.Then, we define the overall policy, for the n-component POMDP, π : S → A by While such a policy is naturally not guaranteed to be generally optimal on the multi-component POMDP, it provably satisfies the budgetary constraints and performs well in practice.To illustrate its performance on real data, we now move to the implementation and evaluation section.

Implementation and Evaluation
In this section we illustrate the utility of the proposed approach for multi-component decision making with budgetaryconstraints.In particular, we compare the policy described above with existing approaches on a scenario of multicomponent building management.Our implementation utilizes the POMDP.jl[24] Julia package for efficiently solving the budgeted-POMDPs using POMCP, as well as CVXPY [25] for solving the convex optimization formulation of the budget allocation problem.The initial budget-split for solving the budget allocation problem is chosen randomly while satisfying the constraint of (14).
We model the components that comprise a typical administration building with a total size of 10,000 sq.ft.The building comprises multiple components such as lighting systems, roofing components, boilers, and carpeting, where each component's cost of replacement and inspection are based on empirically derived industry averages.Each component's health is defined by the Condition Index (CI) [26], which takes values between 0 and 100.The condition deteriorates stochastically over time, depending on various factors, and can only be observed through explicit inspections, which incur a cost.The component fails when the CI reaches below a failure threshold, which we assume to be 0. Components can also be replaced, restoring their CI to its full value.
The building is associated with an average maintenance budget of $2,200,000 for a given period of interest.Using historic CI data for each component, we synthesize the transition probabilities of their corresponding Partially Observable Markov Decision Processes (POMDPs).We consider 20 sustainable components from the building, with replacement costs ranging from 0.15% to 3% of the total budget and inspection costs ranging from 0.01% to 0.03% of the total budget.We scale the total budget to 10,000 units and appropriately scale the replacement and inspection costs of all components while ensuring that they are rounded to the nearest integers.The decision-maker's objective is to maximize the time until failure of the components by effectively allocating the budget among the components and taking replacement and inspections when needed.As in Section IV.C, we model this objective as a POMDP by assigning a reward of 1 when the CI is greater than the failure threshold and 0 otherwise, and modeling the state of 0 health and budget exhaustion as absorbing states.For our experiments, we consider simulations with a horizon of up to 100 decision steps with a 1 year step size.

Maintenance Policy Synthesis
In this section we compare the maintenance and inspection policies obtained from the proposed POMDP-based model with a realistic baseline approach.In the baseline approach, a building manager typically schedules component inspection at a regular interval and the true health of the component is only obtained at these regular intervals.In the absence of an inspection, the CI of a component at a given time step is estimated to be the most probable CI state as determined by its CI transition dynamics.The baseline policy used in this section replaces the component if its estimated CI is less than a pre-determined threshold.
We use time-to-failure (TTF), defined as the number of simulation steps until failure, as the performance metric.We run experiments to calculate the TTF for each component by averaging the values obtained over 5 independent simulations with 100 maximum possible simulation steps.We set the maximum tree depth for POMCP rollouts to 50 and use a UCB exploration constant of 10.In the baseline policy, we inspect the CI every 5 steps and replace the component if the estimated CI is below 15. Figure 1 shows the simulation results comparing the TTF obtained for different budget values using the baseline and the proposed approach.The proposed approach provides a clear advantage over baseline strategy over the entire range of budget values for all 20 components, irrespective of the replacement costs.Figure 2 shows sample CI histories for the same component obtained from simulations using the proposed approach and the baseline policy.The proposed approach takes inspection and replacement actions only when deemed necessary based on the latest belief estimate and the potential loss of value due to an inaccurate estimate or due to not taking a replacement action.Such a behavior holds true for every component without any component specific parameter tuning.On the other hand, the estimated state from the baseline policy based on the most-probable transition may not always be the same as the real transition, ultimately resulting in early failures.Although it is possible to enhance the baseline by incorporating component-specific parameters and budget-aware heuristics, our experiments indicate that its performance still lags behind the proposed approach, particularly when the budget is tightly constrained.

Budget Allocation
We demonstrate the effectiveness of our proposed budget allocation approach by comparing it to a baseline that depends on two component properties: (i) mean-time-to-failure (MTTF) which is the expected number of steps a component takes for its condition index to go below the failure threshold when starting from maximum possible condition index, We quantify the performance of the budget allocation algorithms by running 20 independent simulations over all the components using the allocated budgets and calculating the overall TTF for the building.To ensure fairness, we compare both budget allocation algorithms by running simulations using policies obtained by the same decision making strategy: the POMCP-based approach.Figure 3 summarizes the TTF results from the baseline and the proposed approaches for all components in the building.The proposed allocation approach achieves an overall TTF of 1510, outperforming the baseline that achieves an overall TTF of 1355.Hence the proposed approach maximizes the overall TTF in accordance with the objective defined by (15).Analyzing individual component data we observe that instances where the proposed approach underperforms the baseline exhibit only slight differences in TTF values.In contrast, when the proposed approach outperforms the baseline, we observe a significant improvement.Note that the maximal TTF of 100 is achieved by both strategies for 50% of components, the proposed strategy performs better for 35% of the components, and the baseline performs better only for 15% of the components.Table 1: Comparison of time taken to find optimal budget split among 5, 10 and 20 components respectively, for a total budget of 10000 units.
The results in Table 1 present the time taken to solve the optimization problem given by (15).As can be clearly observed, the solution time increases with an increase in the number of components.However, the values are always of the order of milliseconds.

Conclusions and Future Work
In this paper, we formulated the problem of optimal policy synthesis for a multi-component POMDP with a budget, in the sense of maximizing the time before reaching an absorbing state.We first introduced a b-POMDP model to facilitate optimal planning in POMDPs while adhering to budget constraints.Next, we showed that the value function or maximal collected reward for b-POMDPs, under significant assumptions of full observability and deterministic transitions, is a concave function of the budget.We then presented an algorithm to find the optimal budget split among the component POMDPs of an n-component POMDP with a given total budget B. The budget-splitting problem was posed as a welfare maximization problem.The concavity of the maximal collected reward, with respect to the budget, makes the problem a convex optimization problem.The experimental evaluations of the proposed algorithm, in an infrastructure component management scenario, verify its effectiveness in terms of performance.Performing simulations on real data, we observe that our algorithm vastly outperforms the policies currently used in practice.
There are two possible directions of future work.While the proposed approach makes it possible to compute an approximately optimal policy in a feasible amount of time, the computational cost of using the POMCP algorithm is still high.Our first direction of work is to reduce this cost by incorporating a learning framework so as eliminate repeated runs of POMCP.Finally, the budget allocation scheme is fixed in the sense that the budget-split is done before the start of the planning horizon.The second direction of work is to consider other efficient budget allocation methods.A sequential algorithm for optimal computing budget allocation is presented in [27].Applying this algorithm to our problem may result in more accurate budget allocation.Similarly, another method which can be explored is the allocating method presented in [28].Also, following these methods may allow us to generalize our algorithm for cases where the value function does not satisfy the concavity property.Furthermore, another direction is to derive an optimal dynamic budget allocation scheme to account for change in transition probabilities of the component states during the planning horizon and also account for a cyclical budget.

Theorem 4 . 5 .•
The value function for the MDP, V H (s 0 , b), is concave in the budget b for any horizon length H ∈ N 0 and initial state s 0 ∈ S M DP .More specifically:• For s 0 > d 0 , V H (s 0 , b)is constant with respect to the budget for all b ≥ H/2 , where .represents the floor function, For s 0 ≤ d 0 , V H (s 0 , b) is constant with respect to the budget for all b ≥ H/2 , where .represents the ceiling function, • For budget values b,b and b such that b = b + 1 = b + 2 and b ≥ 0, we have:

2 .
Consider three budget values b = b + 1 = b + 2 such that b = 0.If we are at s = s max , we will reach s = 0 in smax d0

Figure 1 :
Figure 1: Comparison of the proposed and baseline approaches using time-to-failure for a range of budget values.(a) Overall results obtained by averaging over all components.(b) Results for the Air Handling Unit component with a replacement cost of 250 units.(c) Results for the Lighting Equipment component with a replacement cost of 24 units.

Figure 2 :
Figure 2: Sample condition index (CI) histories illustrating the performance of the proposed policy when compared to the baseline for the Boiler component with a replacement cost of 45 units, an inspection cost of 1 unit, and a total budget of 500 units.(a) CI history using proposed approach showing failure at 80 time steps.(b) Baseline approach failing at 39 time steps.

Figure 3 :
Figure 3: Comparison of baseline and proposed budget allocation approaches for the all 20 components for an overall budget of 10,000 units.
Number of Components Time (mean ± std.dev. of 7 runs) 5 333ms ± 21.4ms 10 412ms ± 37.1ms 20 552ms ± 31.6ms 2 k , . . ., s n k } where s i k ∈ S i represents the state of component i at time step k.The action space is given by