Managing the Balance Between Project Value and Net Present Value Using Reinforcement Learning

Project managers make decisions weighing financial returns (net present value, NPV) and value creation expected by stakeholders. Often, plans maximizing NPV neglect stakeholder benefits while those focused strictly on value creation may reduce financial viability. This paper puts forth a new stochastic optimization model handling this compromise using a mixed integer program solved with reinforcement learning. The model incorporates uncertain activity durations and considers positive and negative cash flows. Our Monte Carlo control method with $\epsilon $ -greedy policies and timed start actions for activities facilitates the simultaneous maximization of NPV and project value. The resulting efficient frontier delineates various project plans, demonstrating the trade-off between maximizing NPV and project value, providing decision makers with visual analysis to select plans that fit organizational needs. Computational experiments demonstrate superior performance over a mathematical solver limited by the problem’s complexity and a metaheuristic lacking guided online learning. The results help senior management select satisfactory plans that balance financial returns with stakeholder preferences. The methodology contributes a novel tool for quantitatively incorporating value creation alongside financial objectives in project planning.


I. INTRODUCTION
Solutions to the maximization of project net present value (max-NPV) problem are a sought-after commodity today.Decision makers need to evaluate different project alternatives, make go/no go decisions, and decide which projects will be part of their project portfolio [1].It is common knowledge, however, that the evaluation of a project should not be based solely on financial considerations; a project can be unsuccessful by NPV criteria and yet deliver the expected value 1 to customers and other stakeholders.For example, [2] describes a construction company planning a major industrial safety campaign in response to its poor safety The associate editor coordinating the review of this manuscript and approving it for publication was Fabrizio Messina . 1 Project value and benefit are used interchangeably in the literature.In this paper we use the term project value.record and high insurance premiums.The project aims to reduce insurance costs by $250,000 annually and improve the company's ranking in an industry safety review from the 90 th percentile to the 10 th percentile.The project has a negative NPV of −$350,000, which, taken alone, may mean a no-go decision.Nevertheless, if the company's board also considers the value criterion -the improved industry ranking and that it outweighs the financial loss -it could decide to proceed. 2hus, project value is increasingly becoming a vital factor in Project Management [2]. 3roject value can be defined as a combination of attributes that depend on the stakeholders' preferences ''such as features, functions, reliability, size, speed, availability, design aesthetics, etc.'' [4].This paper adopts the framework used in [5], where the attributes are formulated as an objective function that reflects the value according to the customers' and stakeholders' preferences.In Section IV, we present an example of how the project value is calculated.
Research has tended to focus on the max-NPV problem and project value as separate research tracks rather than considering them together.We feel strongly that the consideration of both goals in tandem presents decision makers with a more thorough evaluation of a project when reviewing project alternatives.In this paper, we introduce a new formulation of the maximization problem that includes both NPV and project value, model a multimode setting that allows the consideration of different project plan alternatives, develop algorithms to solve the problem, and consider the tradeoff between realizing both objectives.
Other key components of our formulation are stochastic activity durations and the use of a robust form of NPV.Uncertainties are common in real-life projects and often result in budget and schedule overruns.According to one report that analyzed over 50,000 projects in 1,000 organizations, more than half (56%) of the projects went over the planned budget and almost two-thirds (60%) of the projects fell behind schedule [6].More recent findings by the Project Management Institute indicate global figures of 38% and 45% for project budget and time overruns, respectively [7].By focusing on stable solutions, we think that we provide a more relevant tool for decision makers than is available today.
The literature dealing with this paper's topic can be divided into two main branches: the max-NPV problem and project value management.

A. THE MAX-NPV PROBLEM
There is a considerable amount of research on the max-NPV problem.An early study of the deterministic problem was carried out by [8], where the objective function was linearized by approximation using the first terms in the Taylor expansion.Since then, more research on the max-NPV problem has accumulated.A review of past literature can be found in [9]; we focus on more current research.The problem has been extended to include resource constraints, in the resourceconstrained project scheduling problem with discounted cash flows (RCP-SPDC).This is also an extension of the resource constrained project scheduling problem (RCPSP), which was proven to be NP-hard [10].Gu et al. [11] offered an exact solution approach for the RCPSPDC limited to small projects and a Lagrangian relaxation with a decomposition method for large problems.Leyman and Vanhoucke [12] solved the RCPSPDC by constructing sets of activities and moving them together.Later they extended their work to include capital constraints and different cash outflow models [13].Klimek [14] examined projects with payment milestones and different scheduling techniques such as activity rightshift, backward scheduling and left-right justification are compared.In [15], the RCPSPDC is solved by combining a genetic algorithm and an immune algorithm.The authors employ different crossover, mutation, and immunization operators and select the best one at each stage.In a similar research, the authors enhance the combined genetic and an immune algorithm with a variable insertion-based local search, a forward-backward improvement, a restart mechanism and an activity move rule to delay the activities with negative cash flow [16].
The multimode version of the RCPSPDC is an extension of the original problem.Projects with up to 30 activities and three modes are solved optimally with a network flow model in [17].The scheduling technique in [12], mentioned above, is extended in [18] to include multimode projects and different payment models for cash inflows.Zhang et al. [19] balanced the NPV of the contractor and client in a bi-objective optimization problem.
Another extension of the original deterministic max-NPV problem is the stochastic max-NPV problem (denoted as SNPV by [20]) where the activity durations and cash flows are random variables ( [20] present a detailed review of early literature on this topic).Creemers et al. [21] maximized the expected value of the NPV (eNPV) with variable activity durations, the risk of activity failure and different paths or modules to mitigate this risk, ignoring resource constraints.In a similar vein, [1] considered a general project failure risk that decreases with project progress, and activity-specific risks; earlier activity completion on the one hand eliminates its risk of failure, improving the eNPV, but on the other hand may also accelerate costs, which worsens the eNPV.Weather condition modeling was incorporated into stochastic durations by [22], where the decision variables are gates when resources are made available for specific activities.
Creemers [23] found globally optimal solutions for the SNPV problem where activity durations are phase-type distributed, cashflows are deterministic, and no resource constraints are considered.The author subsequently applied the results to finding the optimal sequence of stages in multistage sequential projects with stochastic stage durations, also obtaining exact, closed-form expressions for the moments of the NPV and using a three-parameter lognormal distribution to approximate the NPV distributions accurately [24].He showed that the problem is equivalent to the least cost fault detection problem (LCFDP; this was also proven by [25]).Hermans and Leus [26] offered a new efficient algorithm and showed that in Markovian PERT networks, where activities are exponentially distributed and there are no resource constraints, the optimal preemptive solution solves the non-preemptive case as well.Two known proactive scheduling time buffering methods and two reactive scheduling models were employed by [27] to investigate the max-eNPV problem with stochastic activity durations.Timebuffer allocation was also proposed in [28], who added the expected penalty cost as a measure of solution robustness.Rezaei et al. [29] considered uncertainty in activity duration and cash flow and two objectives: maximization of eNPV and minimization of NPV risk.Their model ignores resource constraints.

B. PROJECT VALUE MANAGEMENT
In [30] we reviewed the growing body of literature on project value management; here we cite the main references.A qualitative approach was taken by some researchers who study value in projects.For example, these researchers developed a framework to evaluate and formulate value [31].They also examined the influence of value management on project success [32], [33], [34], suggested a scale to determine target values [35], explained how value is created [36], investigated value management in the dismantling of infrastructure projects [37], and contrasted the value of projects done offshore with those done domestically [38].
Another research direction that focuses on value is quantitative.It involves measuring the progress of product development based on the value added to customers [39], computing the value contribution to the project by staff member skills [40], quantifying value according to the attributes that matter to stakeholders [41], and developing a framework to plan and monitor cost, schedule, risk, and technical performance based on these attributes [4].Some researchers in this quantitative field also use Quality Function Deployment (QFD) for project management.QFD is a well-known tool that captures the voice of the customer and converts it into engineering requirements [42].It measures the value or performance of a product in multiple dimensions.We apply QFD to determine project value by using value parameters in the activity modes.Section IV shows an example of how we translate the voice of the customer into product value parameters and calculate the value of a specific project.Other recent papers apply QFD to project management [43], [44], [45].
A third research direction in project value is a new branch of research integrating project scope with product scope, which is the outlook we adopt in this paper.This research branch is characterized by expanding the idea of activity mode to cover not only cost and duration but also value parameters.Mode selection will, therefore, affect project value.In [46], a cost-effective design strategy is investigated, aiming to maximize the effectiveness-to-cost ratio and integrating decisions on project schedules, resource allocations, and product performance.Balouka et al. [5], on the other hand, extended the deterministic multimode resource-constrained project scheduling problem to include project value.Project management is combined with systems engineering in [47], who synchronize each activity mode with selected architectural components.Shtub et al. [48] and [49] described the use of simulation-based training in the integration of project and product scopes.
Table 1 summarizes the main features and characteristics of this and existing studies.It also highlights the gaps in the literature that this paper addresses.The most prominent lacuna is that none of the previous papers considers both NPV and value as objectives in the same model.The present study, in contrast, aims to find the efficient frontier between these two alternative goals.Another gap is that most papers that involve NPV assume single mode projects, whereas this paper deals with multimode projects, which allow generating alternate project plans that offer a range of value outcomes to stakeholders.Moreover, the present paper incorporates risk into the NPV calculation by using stochastic activity durations, while most papers that use NPV adopt deterministic models.Furthermore, we develop a quantitative optimization model for project value management, which is rare in the literature, as most papers focusing on project value are qualitative or descriptive.Finally, this paper employs reinforcement learning (RL) as a solution method, which is a novel and powerful approach for project management problems.To the best of our knowledge, the only previous paper that uses RL for project value management is our own [29], and this is the first paper that applies RL to project management problems related to NPV.
The aim of this paper is to model the tradeoff between project value and NPV in a multimode setting, where the selection of an activity mode will impact cost, duration, resource usage, and value, thus combining project scope (the tasks to be completed) with product scope (the characteristics and capabilities of the product and the resulting value) [50].We consider stochastic activity durations to model realistic uncertain environments and introduce a new measurement of robustness in the NPV decision variable.Both objectives are introduced in a mixed integer program, and the evaluation of the objective function can be used to plot the efficient frontier (see Section IV for an example).To solve the proposed problem, we offer an innovative reinforcement learning (RL) based algorithm.
We have organized the rest of this paper in the following way.Section II presents the mathematical formulation.In Section III, our RL based solution is explained.We describe an example in Section IV and the experimental setting in Section V. Section VI presents our results, which are discussed in Section VII.Some conclusions are drawn in the final section.

II. THE PROPOSED MIXED INTEGER PROGRAM (MIP) FORMULATION
We formulate the problem as a mixed integer program (MIP).To model resource allocations, we employ a flow-based formulation adopted in many recent project scheduling works, e.g., [51], [52], [53], [54], [55], [56], [57], [58].This formulation is especially suitable for stochastic models, where the activity start or finish times may vary according to the realized durations.The multimode setting, where each activity mode represents an alternative with its own time, cost, resource, and value parameters, is essential for the generation of different solutions on the efficient frontier of the project value/NPV curve.In a single-mode problem no change in the project value is possible, and an efficient frontier cannot be constructed.The model seeks to maximize the robust project NPV and the project value.We tackle the chance constraints using a scenario approach (SA), introduced in [59] and applied in recent project scheduling papers [60], [61], [62].The idea is to take S samples or scenarios of the realization of the random variables in the constraints-in our case, the activity durations-and substitute the deterministic scenario constraints for the stochastic chance constraints.Table 2 lists the mathematical model's sets, parameters, and decision variables.
Let us consider a project with J activities.Each activity j can be executed in one of M j modes and is preceded by a set of immediate predecessors P(j).Each activity j executed in mode m in scenario s ∈ {1, • • • , S} has a duration d jms .There are K different renewable resources, each with unit cost c k per period.Activity j executed in mode m needs r k jm units of resource k, which has a total availability of R k .Apart from the duration-dependent resource costs, there is a fixed cash inflow or outflow c jm associated with activity j executed in mode m, composed of fixed costs and payments received.Without loss of generality, we assume that payments are received or made at the end of each activity.The literature contains two main approaches to avoid gaps between activities and to prevent an activity with negative cash flow from being indefinitely postponed: 1) using a deadline [12] and 2) assuming a sufficiently large payoff at the end of the project that offsets the gains from postponing activities that affect project completion [21].In this paper we adopt the latter approach.
For problems that seek to minimize project duration, a common robustness measure is the timely project completion probability (employed, for example, in [63]).We adopt this concept and define, in our problem, decision variable rNPV , the robust NPV, as the project NPV delivered with a probability of at least γ .This way, instead of applying the robustness measure to a given schedule, we search directly for a schedule with the desired robustness.
We set parameter NPV UP as an upper bound for rNPV .Parameter ⌢ r is the discount rate, and EF js and LF js are the earliest and latest finish times for activity j in scenario s, respectively.T max is an upper bound for the project duration.
Binary decision variable δ jm indicates (value 1) if activity j is carried out in mode m (as presented in [64]) and decision variable t js ∈ EF js , . . ., LF js denotes, for scenario s, the finish time of activity j, j = 0, . . ., J + 1, where activities 0 and J + 1 are dummy activities (milestones) with a single mode, no duration, and no resources, and represent the start and end of the project, respectively.τ s is a binary decision variable indicating (value 1) whether the scenario NPV is greater than rNPV .Decision variable β js is the discount factor for activity j in scenario s and parameter β UP is an upper bound for the discount factor.Binary decision variable z ij indicates (value 1) if activity j starts after activity i finishes.The amount of resource k transferred from activity i to activity j is modeled by the flow variable φ k ij .The project has V different value attributes.As noted in the Introduction, these attributes depend on the stakeholders' preferences (see there for examples of attributes).Let V jmv be the parameter that represents the value of attribute v for activity j performed in mode m.Let V ′ jv be the decision variable that denotes the value of attribute v for activity j performed in its chosen mode.We use a project-specific function ) that computes the project value for each attribute v based on the individual attributes V ′ jv and a project-specific function JV that determines the project value based on the values for each attribute (we introduced these value functions, decision variables and parameters in [50]).Parameters w 1 and w 2 represent the objective function weights for rNPV and project value, respectively.By solving the MIP for different weights w 1 and w 2 , the efficient frontier between rNPV and the project value can be determined.
We also employ additional variables for linearizing two constraints.Binary variables t p js are equal to 0 for all p < t js and 1 for all p ≥ t js , p = 0, • • • , T max .Variables y jms replace the products β js • δ jm .We now present the model, followed by an explanation of the objective function and constraints. Max subject to: and The objective function (1) aims to maximize a weighted sum of the project's rNPV and value.The weighted-sum approach is commonly used in multi-objective optimization in general [65] and is applied in a number of project scheduling papers (for example, [28]; [66]).Constraints (2) indicate whether a scenario's NPV is greater than the project's rNPV .Initially, these constraints would be nonlinear because the positive or negative cash flow associated with each activity mode, c jm + k c k • r jkm • d jms , would have to be multiplied by the discount factor variable and the indicator variable, β js • δ jm , indicating that the cash flow would have to be discounted according to the finish time and realized only for the selected mode.To avoid this nonlinearity, we use variables y jms in constraints (2).Constraints (3)-( 6) guarantee that y jms = β js • δ jm .
We use a discrete discount factor as in [67], which has the form , since a predecessor will always assume the value of 1 before its successor.Likewise, constraints (11) and ( 12) fix the value of t p js for finish times before the early finish and after the late finish, respectively, and constraints (13) fix the value for the initial dummy activity.Constraint (14) counts the fraction of scenarios that yield the rNPV and force this fraction to remain above the predetermined threshold.
The following constraints were introduced by us in a prior conference paper [64].Constraints ( 15) and ( 16) avoid cycles of 2 and 3 or greater, respectively, thus guaranteeing that the network is acyclic [51], [68].Constraints (17) enforce the precedence constraints.Constraints (18) link the continuous activity finish time variables with the binary sequencing variables.Constraints (19) give upper and lower bounds for the activity finish times.Constraints (20), from [51], connect the continuous resource flow variables with the binary sequencing variables and the binary mode variables.Constraints (21) force the selection of only one mode per activity.Outflow constraints (22) ensure that all activities, except for J +1, send their resources to other activities.Inflow constraints (23) ensure that all activities, except for activity 0, receive their resources from other activities.Constraints (24) bound the flow variables with the maximum resource consumption modes.Finally, constraints (25), which we introduced in [50], determine the value attributes according to the selected modes.
With the linearization of the constraints described above, if the project's value function is linear, the MIP is a mixed integer linear program (MILP) and can be solved with a commercial solver.We use this method as a benchmark in the computational experiments (Section V).
We previously presented a scenario-based MIP model for the multimode RCPSP (MRCPSP) with the objective of minimizing the duration in [64].In this paper, we extend that model by incorporating the following innovations: 1) The new objective function that jointly maximizes the robust NPV and the project value; 2) Additional constraints that capture the NPV and value aspects of the project; 3) The extra variables and constraints for linearizing the non-linear terms in the model.

III. THE REINFORCEMENT LEARNING SOLUTION
From learning to play backgammon at near the level of the world's best players [69], through landing unmanned aerial vehicles (UAVs) [70], beating the highest ranked players in Jeopardy![71], and human-level performance in Atari games [72], RL has been successful in applications for uncertain environments.This success is the factor motivating our application of RL to the formulation discussed in Section II.RL-based heuristics have been applied to project scheduling [73], [74], [75], [76], but to the best of our knowledge, [30] is the only study that tackled multimode problems involving chance constraints.
The RL model begins with an agent in state S. It takes action A and transitions to state S ′ , earning reward R ′ .Then it performs action A ′ , transitioning to state S ′′ , and earning reward R ′′ , and so on.The agent's life trajectory can be expressed as S, A, R ′ , S ′ , A ′ , R ′′ , S ′′ , A ′′ , R ′′′ , S ′′′ , A ′′′ , etc.The agent follows a policy π (S, A) that indicates which action it should choose at each state.The goal of the RL problem is to learn a policy that maximizes the agent's reward.We also define an action-value function q (S, A, ) as the estimated reward for choosing action A on state S and then following policy π (S, A) [30].Applying the RL model to the formulation presented in Section II, we define a state as project activity j.The agent undertakes an action by choosing a mode mj and start time tj for activity j, and then moves on to the next activity.After selecting modes and start times for all activities j = 1, • • • , J , it can calculate its reward R (j, m, t).As it receives rewards, it learns the action-value function q (j, m, t) and which policy π (j, m, t) to follow.
The RL method that we apply in this paper is Monte Carlo control (MCC), based on [77].We employed MCC because it fits best the problem at hand, making full use of Monte Carlo simulation to run the project plans, determine the cumulative probability distributions for rNPV , and obtain an exact value of the reward for each simulation run, without the need for bootstrapping, i.e., estimating the reward based on another estimate.MCC is a state-of-the-art RL method that has been employed in recent works such as [78], [79], [80], [81], [82], [83], [84], [85], and [86] to solve various problems in different domains.Table 3 summarizes our RL notation, in addition to the notation employed in the quantitative model.Our pseudocode and an explanation of our MCC method follows.The main procedure is shown in Algorithm 1.
Our algorithm starts with the initialization of the action-value list (Algorithm 2).For each activity, the action taken is selecting the mode and the start time.We use η start times, equally spaced from zero to an upper bound, the maximum sequential project duration.We initialize the table with artificially high values, a technique known as optimistic initial values, in order to allow initial exploration of all actions.
The action-value list is then used to calculate the policy (Algorithm 3).To balance exploration and exploitation, we adopt an ϵ-greedy policy, meaning that in the policy list, we ascribe a probability ϵ of taking a random action and a probability (1 − ϵ) of taking a greedy action, i.e., the action with the highest action value.
Next, we take an action based on the policy (Algorithm 4), selecting, for each activity, the mode and start time according Algorithm 1 Main Procedure for MCC initialize_action_values J , η, M j , LS j , ∀j = 1, • • • , J from Algorithm 2; while not stopping criterion: calculate_policy J , η, M j , q(j, m, t), ∀j = 1, • • • , J from Algorithm 3; choose_mode_start π(j, m, t), d ML j m , P(j), ∀j = 1, • • • , J from Algorithm 4; calculate_reward sorted( mj ), P(j), η, to the probabilities in the policy list.Then, by right-shifting the activities, adding to each start time the finish time of the latest-finishing immediate predecessor, we adjust the start times to make them precedence-feasible.This means that if we select a start time of tj for activity j, we rightshift this activity to start at time tj after its immediate predecessor finishes.The finish times are determined using the most likely duration of each activity in its selected mode.for activity j = 1, • • • , J : choose m, t according to π (j, m, t); if P (j) == ∅: t * j = tj ; else: Thereafter, we sort all activities according to their adjusted start times, obtaining a precedence-feasible activity list with the activities and their selected modes.The construction of this precedence-feasible activity list is the first of two steps of the implementation of the start time selection.The second step (Algorithm 5, described below) is implemented in the calculate_reward function.
Note that the selection of start times to generate an activity list is really a surrogate for selecting different combinations of precedencies between the activities.The range of possible start times between zero and the upper bound provides ample options of early start or postponement of each activity, providing a richer search space with the possibility of better solutions.Furthermore, the adjustment to generating only precedence-feasible activity lists avoids both wasting runtime with infeasible solutions and discarding potentially good solutions.
The next step in the algorithm is to calculate the reward for the actions taken (Algorithm 5).Here, we implement the second step of the start time choice: the insertion of each activity in the baseline schedule.We handle each activity sequentially.First, we determine the interval between the earliest precedence-feasible start and the latest activity finish time of the activities scheduled until this point.This interval is divided into η equal periods and we start the activity according to the index of its start time tj in the policy list.For example, if tj is the third start time in the policy list, we use the third period in the interval, rounding it to the nearest activity finish time.If there are not enough resources, we repeatedly right-shift the activity to the next scheduled activity finish time until there are enough resources.With the schedule in place, we calculate the objective function value.To calculate rNPV we simulate the NPV cumulative distribution function (CDF).For example, if the decisionmakers desire a 95% probability of delivering the rNPV , the baseline schedule is simulated 1000 times, the realized NPVs are sorted in increasing order, and the 50 th element of the NPV list is the rNPV .We define the reward as the objective function value.As pointed out in Section II, to calculate the objective function value we define weights w 1 and w 2 for rNPV and project value, respectively (equation 1); repeating Algorithm 5 Calculating the Reward def calculate_reward sorted mj , P (j) , η, for activity mode mj in sorted mj [1 : ]: the algorithm for different weight values gives us different points on the efficient frontier.
In [64], we introduced an RL algorithm for the MRCPSP with the objective of minimizing the project duration.In this paper, we extend that algorithm by incorporating a novel feature: Algorithm 5, which allows the agent to select the start time of each activity from a set of feasible options, rather than always choosing the earliest possible start time.This feature enables the agent to account for the impact of positive and negative cash flows.For instance, when an activity has a negative cash flow, delaying its start time can increase the NPV by reducing the present value of the cash outflow.
The last step in the algorithm is to update the action-value list using the reward.We can choose from two update methods, RL 1 (Algorithm 6) and RL 2 (Algorithm 7).For both, we only update the action values corresponding to the selected modes and start times.RL 1 learns an action value by averaging all the rewards this action (mode and start time) has received each time it was taken.This signifies that new rewards have an increasingly smaller impact the more the actions are taken.The means are calculated incrementally to speed up the process and save memory.RL 2 updates the action values using a formula very similar to the incremental mean from RL 1 , but instead of using the decreasing step 1  N , it uses a constant step α, giving an exponentially large weight to the last action.These methods are explained in [77]; RL 1 and, to a lesser extent, RL 2 appear in the recent MCC papers listed above in this section [78], [79], [80], [81], [82], [83], [84], [85], [86].
We repeat the process, calculating the policy based on the updated action values, selecting the modes and start times based on the policy, calculating the reward, and updating Algorithm 6 Action-Value Update Using Average Rewards (RL 1 ) def update_action_values_RL 1 J , mj , tj , ∀j = 1, • • • , J : for activity j = 1, • • • , J : q j, mj , tj + = 1 N R j, mj , tj −q j, mj , tj ; return q j, mj , tj the action values, until reaching a stopping criterion.The solution takes the form of a baseline schedule consisting of the selected activity modes and start times, the rNPV and the project value.

IV. EXAMPLE
We use a radar development example, a simplified version of a real project [30], to demonstrate our problem and RL solution approach.The AON network of the project is shown in Fig. 1 and Table 4 gives the five project activities with two modes each, the durations (O, ML and P) for optimistic, most likely and pessimistic scenarios, the fixed cost (FC) of each activity, the resources per period needed for each activity mode (engineers, E, and technicians, T), the value parameters, and the income received after completing the activity.Three activities have negative cash flows, comprising the fixed and resource costs, and two of them have positive cash flows due to the income.This is a practical example of how to define and compute value.The value attributes of range, quality, and reliability (R, Q and Re in Table 4) reflect the needs and expectations of the project stakeholders.They are influenced by the value parameters in each activity mode.We use the notation from Section II and have three value attributes, V = 3.The radar equation [87] is applied to calculate the range (R), quality (Q) and reliability (Re) of the radar system [46], since they depend on technical parameters such as transmitter power and antenna gain, which vary according to the technological alternatives considered for each mode.The mode selection for the project plan will affect not only the value, but also the cost and NPV, thus integrating both project value components.
We now present the value functions for each attribute, 25 , where [TP] is transmitter power, [RS] is receiver sensitivity and [AG] is antenna gain, extracted from the activities of ''transmitter design'' (TD), ''receiver design'' (RD), and ''antenna design'' (AD), respectively, in Table 4. Using the notation from Section II, [TP], [RS], and [AG] are decision variables V ′ 21 , V ′ 31 , and V ′ 41 , respectively.When we select one of the two modes, say, for the TD activity, we are also determining which of the parameter values, V 211 (50 in Table 4) or V 221 (100 in Table 4), will be assigned to the decision variable V ′ 21 ([TP] in this example).This same mechanism applies for decision variables V ′ 31 and V ′ 41 ([RS] and [AG]), allowing us to compute the value of function F 1 , the value for the (R) attribute.
The equation for the radar quality (Q) is denote the reliability of antenna design, integration effort, transmitter, and receiver, respectively.The value of the project JV is calculated by a weighted sum of the three value attributes, a technique that is widely used in multi-attribute utility theory [88]: There are a total of 11 engineers and four technicians available and the resource unit costs per period are $100 for engineers and $50 for technicians.We want to solve the problem for different weights w 1 and w 2 and find the efficient frontier for an rNPV probability of 95%.The result is shown in Fig. 2 with four non-dominated points.We reached similar objective values using RL 1 and RL 2 .For convenience, we normalized the project values to be between 0 and 100 as in [5].Decision makers can conduct a tradeoff analysis and select the solution that best meets stakeholders' needs and requirements.
As explained in Section III, rNPV is determined in each iteration by simulating the NPV CDF.For the point (76.62, 40,772) in Fig. 2, the CDF plot is shown in Fig. 3 and the rNPV is marked.For any solution that the decision makers select, a baseline schedule can be constructed easily by the process highlighted in Algorithm 5 and explained in Section III.The solution highlighted above produces the Gantt chart shown in Fig. 4. The activity durations are the most likely durations from the three-point estimates, and the selected modes are shown next to the activities.
It is interesting to visualize how our RL agent learns better solutions.Recall from Section III that the agent wants to learn the best actions, i.e., select activity modes and start times  that will maximize its reward.In our RL model, we defined the reward as the weighted sum of rNPV and project value; thus, the agent will learn the action values, generate policies from the action values, and take actions based on the policies, seeking to maximize the objective function.
Fig. 5 exhibits the learning curves for both action-value updating variants, RL 1 and RL 2 .We see at the beginning of the curves the effect of the optimistic initial values (Section III): even though a near-maximum objective was found early on, the agent kept searching haphazardly, ''thinking'' that it could  receive a better reward by taking other actions, since the action-value list was initialized with artificially high values.The objective eventually stabilized on about 20,000.Since we used ϵ-greedy policies (Section III), sometimes the agent still wanted to explore, so the project delivery never settled completely on the maximum, jumping occasionally to smaller objective values.

V. HYPOTHESES AND EXPERIMENTAL SETTING
In this section, we explain our factorial experiment, summarized in Table 5.The experiments were conducted to prove two hypotheses:

A. HYPOTHESIS H 1
Our RL methods are suitable for solving the formulation presented in Section II, compared to established benchmarks.

B. HYPOTHESIS H 2
The start time selection RL actions can be leveraged for solving the problem with positive and negative cashflows.To evaluate H 1 , we selected two additional methods as benchmarks.First, we solved our MILP from Section II using the Python interface for the commercial solver Gurobi version 9.5, and second, we employed tabu search (TS), applied in several project scheduling studies. 4In the different TS applications in the literature, the general TS algorithm, described in [91], is tweaked specifically for the problem at hand.We selected a multimode RCPSPDC application [92], closer to our subject matter, thus simplifying the adaptation of TS to our problem. 5Additionally, we compared the performances of the algorithms for three project sizes, each with three modes per activity: 10 and 20 activities, which are small problems with greater potential of quickly covering a wider search space and obtaining faster solutions, and 50 and 100 activities, closer to real-life projects.To evaluate H 2 we ran our RL algorithm using both methods for updating the action values described in Section III, RL 1 and RL 2 , and compared them with RL ES  1 and RL ES 2 , a simplification of RL 1 and RL 2 where all activities are scheduled as early as possible.
For the 10-and 20-activity projects we used the complete PSPLIB J10 and J20 datasets [93], and for the 50-and 100-activity projects, the complete MMLIB50 and MMLIB100 datasets [94].These datasets are the standard in the multimode project management literature [95].We analyzed projects with two types of cash flows: positive cash flows only (+), and cash flows that are both positive and negative ( + − ).In the former, the NPV criterion is a regular scheduling objective, meaning that in a given schedule it is never beneficial to delay an activity if it could be scheduled earlier.In the latter, this observation does not hold [96].
Because of the long runtimes, the solver runs were performed only for 10-and 20-activity projects. 6The TS application in [92] was developed for a deterministic multimode RCPSPDC problem with positive cash flows, and thus in this paper we use TS as a benchmark only for settings 4 A literature review on TS applications in project scheduling falls outside the scope of this paper.Recent research includes [27], [28], [89], and [90]. 5We opted for TS because in that publication it produced smaller maximal relative deviations from the best solutions than simulated annealing. 6In our tests the solver could not generate a single incumbent solution for a sample of four 50-activity projects after 48 hours of runtime.Even when we tried to run the sample for 100 scenarios instead of 1000, after 6.8 hours the solver was still running the linear relaxation and had not yet started to solve the MIP.
with positive cash flows (see Appendix A for more details about our TS implementation).
We calculated the activity start times for RL ES 1 and RL ES 2 in the calculate_reward function (Algorithm 5), as t * j = min t j | t j ≥ t * i + d ML i m , ∀i ∈ P(j) and r k j m ≤ R k surplus , ∀k = 1, • • • , K .The start times in TS were also calculated this way.R ES 1 and R ES 2 were the only RL methods used with the positive cashflow instances.In the positive and negative cashflow instances, they were used for comparison to evaluate the improvement in the objective function obtained by selecting the start times instead of starting as early as possible.
The stopping criterion for all RL methods was 1000 iterations after having visited all states with optimistic initial values.For TS, we used two stopping criteria: the maximum runtime between RL ES 1 and RL ES   2   for the corresponding instance 7 and double this time (TS D ).For the solver, because of the long runtimes, we set the gap between the lower and upper objective bounds to 10% (Gurobi parameter MIPGap = 0.1) and a maximum runtime of 30 minutes (Gurobi parameter timeLimit = 1800).We employed 1000 scenarios for the solver for the 10-activity runs and 100 scenarios for the 20-activity experiments (because of the long runtimes); we used 1000 simulation runs to calculate each RL reward and TS objective function.
To determine the durations of different activity modes, we employed a three-point estimation technique.The dataset's duration was defined as the most likely duration, while the pessimistic duration was set at 2.25 times this value.Similarly, the optimistic duration was determined to be half of the most likely duration.These factors, which can be found in [97], align with the widely recognized observation that activity durations in project management tend to be skewed towards longer durations (refer to [98] for an example).To simulate realized durations for the activities, we utilized a triangular distribution, a commonly used method in project simulation (see the scenario presented in [99]).The resulting durations were then rounded to the nearest integer.The optimistic, most likely and pessimistic durations were used for the triangular distribution lower limit, mode and upper limit parameters, respectively [100].
The objective function was evaluated with weights w 1 = w 2 = 0.5.We set γ , the desired probability of the project to yield the rNPV , to 0.95.We defined the discount rate r = 0.01 per period and generated positive activity mode cashflows, randomly drawing from uniform distributions from the interval (0, 10), and positive and negative cashflows from the interval (-100, 100).At the end dummy activity, a final payment of 10 was received in the experiments with positive cash flows; in the positive and negative cash flows, the final payments were 1000, 2000, 5000, and 10,000 for 10, 20, 50, and 100 activities, respectively.
To tune the RL algorithm parameters, we undertook a full factorial experiment based on the F-Race algorithm, following [101].The inputs for F-Race are a target algorithm 7 We wished to allow TS at least the same RL runtime.7510 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.(in our case, the RL algorithms), a set of configurations, a set of problem instances, and a performance metric (in our case, the objective function).F-Race internally employs statistical tests to guide its search process, as follows.When a minimum sample of instances is run for all configurations, a rank-based Friedman test is performed.If the test indicates significant performance differences, Wilcoxon signed rank (WSR) pairwise tests are executed and the configurations with inferior performance are gradually eliminated (for more details, refer to [101]).The advantage of using F-Race in the factorial experiment is that we do not need to run all instances for all configurations, only for the ''winners'' at each step.
To run F-Race in our experiments, we randomly shuffled the complete datasets.In most cases, one best configuration (with the highest objective function values) was found after running some instances; in some cases, there were ties.We conducted the Friedman and WSR tests with a significance level of 0.05.In all experiments, after finding the best configuration, we continued running it on the remaining instances; thus, the best configurations were run on the complete datasets.Table 6 details the configurations evaluated. 8 We used two value attributes (V = 2) and defined their relative weights as 0.6 and 0.4.We established an additive project value function F v for each attribute, forming the linear objective function 0.5rNPV + 0.5 0.6 J j=1 V ′ j1 +0.4 J j=1 V ′ j2 that could be tackled by Gurobi.The value parameters V jmv were drawn from uniform distributions from the interval (0, 10) for the experiments with positive cash flows and from the interval (0, 100) for positive and negative cash flows.
The algorithms were coded in Python.We ran all experiments on a computer with an Intel(R) Core (TM) i7-7700 CPU 3.60GHz, 8 GB RAM.To analyze the data, we conducted pairwise comparisons of the objective function value generated by each method and used JMP to calculate the p-values (p) for the WSR tests with a significance level of 0.05.Pareto analysis was also used to gain more insight into the results.

VI. RESULTS
This section begins by examining the leading RL algorithm configurations found by the full factorial experiment.We then present the experiment results for instances with positive cash flows, and those with cash inflows and outflows.The files with the datasets used and the results obtained can be accessed in [102]. 8These parameter values are found in examples in [77] and other RL resources.The best value found for the probability of random action ϵ and for step-size parameter α was 0.1.As regards the number of possible start times to select η, discounting two ties, the top values were 2, followed by 10.In our shared results [102], we show the F-Race process gradually narrowing down to the best configurations.

B. POSITIVE CASH FLOWS
Strong evidence of the suitability of the RL methods was found.Table 8 presents the results of the pairwise comparison.The average percent difference (%dif) and WSR p-value (p) for each pair of methods is shown.
TS and TS D generated objectives closest to the solver values in the smaller 10-and 20-activity projects, outperforming RL ES  1 and RL ES 2 .For the larger 50-and 100-activity projects, however, RL ES  1 outperformed the TS algorithms, and RL ES  2 only lost to TS D in 50-activity projects.The average difference between the solver solutions and other methods increased for 20 activities in relation to ten activities.
Note that throughout this subsection and the next one, we considered only solver solutions with a maximum gap between the lower and upper objective bounds of 0.1 rounded up to the nearest tenth.The solutions with larger gaps were inferior; including them, thus, would distort the results.Please refer to Appendix B for the results with gaps larger than 10%.
Table 9 reports the number of times each method generated the highest objective value.The results are in complete agreement with the pairwise comparison shown above.Where the solver found a solution within the time limit, the MILP solution generated more best solutions.Otherwise, TS D gave better results for ten and 20 activities, RL ES 1 outperformed the other methods for 50 and 100 activities, and RL ES  2 outperformed TS D for 100 activities.

C. POSITIVE AND NEGATIVE CASH FLOWS
The tests showed that RL 1 generated the objective values closest to the MILP solver solutions and outperformed the other methods.The MILPs for the 20-activity projects, which had more difficulty in generating feasible solutions in the 30-minute runtime, only produced statistically significant comparisons with the RL 2 variants.Table 10 highlights the results of the pairwise comparisons.We omitted the differences RL 1 − RL ES 2 and RL 2 − RL ES 1 because we are interested in comparing start time selection with early start schedule generation for the same RL methods (Hypothesis H 2 ).Accordingly, in all cases, the non-early start strategies generated better results than the early start ones.
The count of the number of times each method generated the highest objective value (Table 11) reflects the results obtained above.RL 1 found more best solutions than RL 2 and RL ES  1 , and RL 2 outperformed RL ES 2 .Our results lead us to accept Hypotheses H 1 and H 2 , validating both the quality of our RL results, particularly RL 1 , and the leverage of RL start time selections to increase the objective values.

VII. DISCUSSION
Although all values for the probability of random action ϵ and step-size parameter α inputted into the F-Race algorithm are found in the literature, the best value found for both parameters, 0.1, is consistent with [77], where this configuration is the most common one employed. 9f we turn to the number of possible start times from which to select η, in most RL variants and project sizes, two starttime actions, an early start and a late one, were sufficient to generate the highest objective values.Presumably, there is a balance between, on one hand, the improved learning generated by the higher frequency in which each start-time action is taken due to the fewer number of start-time options, and on the other hand, the potential gains accrued by a wider range of start-time alternatives.In most cases, the improved learning offset the finer start-time tuning.In some cases, η = 10 was found to be superior; understandably, with one exception, this was observed in the larger 50-and 100-activity projects, where the longer project durations could warrant the need for more intermediary start-time actions between the early and late starts.
As hypothesized, our experiments validate the usefulness of RL as a method for analyzing the tradeoff between the project value and its net present value compared to established benchmarks (Hypothesis H 1 ).Our RL agent captures a signal at each iteration (i.e., the reward) indicating how good the solution is, and immediately acts upon this signal.Therefore, from the beginning, an informed search is launched based on online information.TS, in contrast, is a neighborhood search with a memory mechanism to avoid being trapped in local optima.It does not use information obtained during the search to direct its next steps.Apparently, this works well for smaller projects, where the search space can be thoroughly covered by TS's local search mechanism.For example, for 10-activity projects, the average difference between TS D and TS was only 0.06% (Table 8) and in 238 instances (almost half the dataset) the difference was null, meaning that the search space was already covered before doubling the runtime.This lack of learning, however, hampers TS's ability to embark on more promising sections of the search space earlier on, and this factor could well explain why RL ES  1 outperformed TS for 50-and 100-activity projects, even when TS was given double the time.
As far as TS is concerned, the results point to the likelihood of a deterioration in the quality of its solutions vis-'a-vis RL ES 1 as the projects increase in size.This can be seen in the 50-and 100-activity projects if we observe the significant differences between the solutions (Table 8; for 100 activities RL ES  2 also outperformed TS D ) and the smaller percentage of instances where the TS algorithms rendered the highest objective values 7512 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.(Table 9).It would seem that the explanation reported above applies here also: larger search spaces slow down the process of exploring superior search space sections because of TS's lack of exploitation or learning.
A significant difference was identified between RL 1 , RL 2 and their early start counterparts, confirming Hypothesis H 2 .These values correlate fairly well with [13] and further support the idea of NPV improvement by moving activity start times.We accomplished this in the RL framework by integrating the activity moves into the RL actions.The first step of the start time selection was implemented for the experiments with positive cash flows where the RL actions generated an activity list that was then decoded into a unique early start schedule (this process is known as a serial scheduling scheme; for a detailed review on this topic, see [103]).This worked well with the positive cash flows because, as pointed out in Section V, for a regular scheduling objective it is never beneficial to delay an activity if it could be scheduled earlier.The second step of the start time selection was implemented for the experiments with positive and negative cash flows, where the activity list no longer generated a unique early start schedule, but rather a schedule with start times determined by the start time selection actions.
As expected, the solver generated the best results.Nevertheless, as was noted in the Introduction, the problem is NP-hard and thus the long runtimes prevent the use of this method for larger problems.Even for 10-activity projects, the solver could not find an incumbent solution after the 30-minute limit in 39% of the projects for the experiments with positive cash flows, 10 and in 43% of the projects for the experiments with positive and negative cash flows.For 20-activity projects, these figures grow to 85% and 95%, respectively, even with the reduced number  of scenarios considered in the MILPs.Comparing the solver and RL 1 runtime distributions for both experiments (Figs. 6 and 7), MILP solver-based solutions tend to be a less interesting option.
We were surprised to find that RL 1 and RL ES 1 outperformed RL 2 and RL ES  2 in all our experiments. 11Since the constantstep action-value update gives larger weight to the last actions and exponentially less weight to previous ones, we would think that the RL 2 and RL ES  2 results could be more promising: the last decisions tend to be better because of the learning and ascribing them more weight could more quickly point to better policies.It would appear that RL 2 and RL ES  2 could find acceptable results with fewer iterations than RL 1 and RL ES  1 ; however, after more iterations, while RL 1 and RL ES 1 stabilize the action values by averaging the rewards, RL 2 and RL ES 2 over-emphasize the last decisions, good or bad.This short memory causes the forgetting of near-optimal policies that could maximize the objective value.Further research is required to consider potential upgrades to the constant-step methods.
Finally, our findings suggest that analyzing the tradeoff between the project value and its NPV using the RL method can be a valuable tool for project managers.The near-optimal solutions obtained can be used to plot the efficient frontier between project value and rNPV and decision makers can conduct a tradeoff analysis to select the project plan that satisfies stakeholders' requirements sufficiently.

VIII. CONCLUSION
This paper has investigated the tradeoff between project value and its NPV in a stochastic multimode setting.We have presented an MIP formulation for the problem using a flow-based model with a project-specific value function and a robust NPV decision variable, and modeled its robustness by means of chance constraints, tackled using a scenario approach.We have employed linearization techniques that allowed us, in the case of linear benefit functions, to produce MILP models that could be solved for small projects by a commercial solver.
We have found a cutting-edge solution for the MIP formulation using RL and illustrated its application with an example.We have designed and conducted a fractional factorial experiment and obtained satisfactory results showing that the RL method is suitable for solving our formulation (Hypothesis H 1 ) and that the activities' start time selection can be leveraged as RL actions for solving the problem with positive and negative cashflows (Hypothesis H 2 ).Furthermore, this work has revealed that our RL method is able to tackle large multimode projects with 50 and 100 activities, where the search space is very large.
The usefulness of our contribution lies in finding the efficient frontier between the robust NPV and the project value, enabling the decision makers to make focused tradeoffs between different alternatives of project plans.Since these two factors represent the project scope and the product scope, decision makers are presented with a more thorough evaluation of each project alternative.
While results demonstrate the promise of the proposed approach, scalability to highly complex projects remains untested.Performance on large-scale programs with hundreds of project activities or multi-year durations may expose limitations in the optimization efficiency.Additionally, expanding benchmarking to include comparisons against a wider range of emerging metaheuristic and hyperheuristic algorithms for project scheduling problems could further validate effectiveness.Generalizability also requires examination through real-world project data case studies and evaluation in multiproject environments.
Several meaningful extensions present avenues for advancing the model.Applying state-of-the-art function approximation techniques may enhance scalability for mega-projects.Variations that combine project-and portfolio-level goals or integrate additional objectives such as flexibility metrics could significantly increase applicability to practice.

APPENDIX A THE TABU SEARCH (TS) ALGORITHM USED IN THIS PAPER
As a benchmark we customized the general TS algorithm found in [91] and adopted the solution representation and neighborhood moves published in [92].The following adaptations were made for the formulation presented in Section II: • The objective function (1) from Section II was used instead of the original pure NPV objective.As pointed out in Section V, TS was designed for a regular objective.Since our value function is time-independent, it does not affect the regular objective attribute and thus TS can be applied.
• The TS method was published for deterministic problems.To calculate the objective function in our stochastic problem, we proceeded as in the RL algorithms: we simulated 1000 project runs as shown in Algorithm 5 and explained in Section III, with the early start simplification explained in Section V.
• There were no penalty functions since all our solutions are feasible.
A discussion of TS falls outside the scope of this paper.More details on this topic can be found in [91] and in the references cited in footnote 3.

TABLE 10 .TABLE 11 .
Pairwise comparison between the objective values for projects with cash flows that are both positive and negative.Data is for the pairwise difference between the row value and the column value.Pareto count of highest objective values (in percentage).

FIGURE 6 .
FIGURE 6. Solver and RL ES1 runtime distributions for experiment with positive cash flows and ten activities.The high frequency on 1800s corresponds to the runtime limit.

FIGURE 7 .
FIGURE 7. Solver and RL 1 runtime distributions for experiment with positive and negative cash flows and ten activities.The high frequency on 1800s corresponds to the runtime limit.

TABLE 12 .TABLE 13 .
Pairwise comparison between the objective values obtained with RL and TS and those obtained with the solver for positive cashflows for 10-activity projects.WSR tests for the 20-activity projects did not show statistical significance.Pairwise comparison between the objective values obtained with RL and those obtained with the solver for positive and negative cash flows.

TABLE 1 .
A summary of the main features and characteristics of this and existing studies.

TABLE 2 .
Sets, parameters, and decision variables for the mathematical model.

TABLE 3 .
Additional notation for the RL method.

TABLE 4 .
Summary of data for radar development activity modes.
FIGURE 2. Efficient frontier for radar project.

TABLE 6 .
Configurations evaluated in the factorial experiment.

Table 7
lists the top for all RL algorithms and experiments.

TABLE 8 .
Pairwise comparison between the objective values for projects with positive cash flows only.Data is for the pairwise difference between the row value and the column value.

TABLE 9 .
Pareto count of highest objective values (in percentage).