Indirect Dynamic Negotiation in the Nash Demand Game

The paper addresses a problem of sequential bilateral bargaining with incomplete information. We proposed a decision model that helps agents to successfully bargain by performing indirect negotiation and learning the opponent’s model. Methodologically the paper casts heuristically-motivated bargaining of a self-interested independent player into a framework of Bayesian learning and Markov decision processes. The special form of the reward implicitly motivates the players to negotiate indirectly, via closed-loop interaction. We illustrate the approach by applying our model to the Nash demand game, which is an abstract model of bargaining. The results indicate that the established negotiation: i) leads to coordinating players’ actions; ii) results in maximising success rate of the game and iii) brings more individual profit to the players.


I. INTRODUCTION
P OLITICS and business are considered traditional spheres of human negotiation.The internet and modern means of communication have extended human negotiation to new domains such as social networks, deliberative democracy, e-commerce, cloud-based applications, [1], [2].Besides, automatic bargaining and negotiation, being inevitable in modern cyber-physical-social systems [3], have been established in variety of applications, like network negotiation, energy trading [4] and traffic management [5], multi-robot systems [6], manufacturing service allocation [7] and newly in ransomware negotiation [8].While solving negotiation task, agents must take into account incomplete information and strategically interact with other, human or artificial, agents.Majority of the existing research however assumes negotiation with non-human agents.
Here we consider the simplest bilateral bargaining scenario with incomplete information often found in e-commerce [9].A typical example is two self-interested agents (say, a buyer and a seller) bargaining on some goods or service.As soon as their price preferences differ, agents begin negotiations to achieve a mutually acceptable price.Either agent strives to satisfy own preferences as much as possible, but also has to take into account the opponent's preferences.Otherwise it is unlikely that an agreement can be reached 1 .Additional aspects of real-life bilateral bargaining to be considered are: i) multi-attribute negotiation when agents need to agree on goods/service characterised by several, possibly interrelated, attributes (say price of a product and terms of its delivery); ii) limited negotiation time as no agent can deliberate infinitely; iii) absence of moderator to coordinate the negotiation, so the agents must reach agreement themselves [11].
The negotiation has been widely addressed in diverse fields ranging from economy and sociology to computer science.An amount of works is much too large to survey them here.One can distinguish several main frameworks: game theoretic approach, negotiation protocols approach, evolutionary approach.Existing works however have different limitations preventing them from wide use.Game theoretic approach [12], [11], assumes that agents are perfectly rational and have common knowledge.Negotiation protocols approach, [13], needs the clear rules for negotiation, [14], and the results largely depend on the information available to the agents about each other.Evolutionary approach , [15], being inspired by biological evolution, finds optimal negotiation via trial-error and agents should have access to policy of their opponents and their profits.Some approaches are based on an agent-coordinator responsible for assigning goods or services to agents.This coordinating (or planning) agent uses a negotiation mechanism to find the best share.
We consider a finite horizon bilateral sequential bargaining of two independent self-interested Bayesian decision making (DM) agents facing with incomplete information.The key aspects of the targeted solution are as follows.
• Negotiation.The purpose of negotiation is to enable agents to coordinate their actions/decisions.Thus negotiation is a means to achieve coordinated behaviour of the agents.We consider the ability to negotiate an intrinsic part of an agent and treat it accordingly.
The proposed solution allows indirect negotiation via information feedback and further leads to coordinated behaviour without conventional (explicit) negotiation.• Domain-independence. Existing solutions are either of domain-specific, [16], or domain-independent, [17].
The former ones may be more effective, but tailoring them to a new domain may often be useless.The ever-growing number of new applications make domain-tailored solutions less favourable.The considered Bayesian DM agent is inherently domainindependent.
• Modelling and learning the opponent.Incomplete knowledge is given by uncertainty regarding the opponent's preferences and behaviour.This uncertainty may prevent agents from reaching mutually beneficial agreement as well as own DM goals.The proposed solution uses Bayesian approach to dynamically learn opponent's model based on observed actions (bids).• Bounded rationality.Assumption on perfect rationality used by game theoretic approach is not valid in reallife tasks.Moreover human agents often behave seemingly irrational due to cognitive or social factors [18].
Their DM is also influenced by emotional state [19] and personal traits [20]: self-interest, altruism, ability to cooperate.The proposed solution is general enough and has already proven to take into account humanlike factors [21].Thus the approach can serve both an artificial agent and a human.
Other important aspect of the negotiation problem concerns limited deliberation.Obviously, no agent can bargain indefinitely so the DM policy that is being designed must take that into account.It is hardly possible to set flexible limits on the length of negotiations, but we believe that the established internal feedback complemented by stopping rule can adaptively influence the length of negotiations.A natural decrease of the utility of goods/service over time can also be counteracted by introducing a kind of forgetting [22] in the utility function.Main contributions.The paper contributes to research on bilateral bargaining in distributed settings.We propose a self-interested probabilistic DM agent maximising expected utility, that is able to purposefully negotiate.The developed agent is domain-independent, can serve to either human or artificial agents and is equipped with the following abilities (which indicate major contributions): • Learning ability.To counteract incomplete knowledge and adapt to possible changes of its opponent, the agent is equipped with the learning ability.The algorithm is based on Bayesian approach and learns opponent's model from bargaining history, i.e. from the bids the opponent proposes during a negotiation.This allows to respect the opponent's dynamics as well as any other related uncertainty, cf.[23].• Indirect negotiation.A key component of the proposed bargaining agent is a reward function that consists of two components.The first one respects a purely economical individual profit of the agent.The second component expresses degree to which bargaining agents exploit the game potential.It is important to note that the second component i) provides the agent with information feedback; ii) prompts the agents for indirect negotiation, and iii) set limits on the negotiation range.The trade-off between the individual profitability and the game potential is expressed by an agent-specific weight, cf.[24].Naturally the opponent equipped with learning ability can model the weight and use this knowledge in next rounds.The weight expresses the agent preferences and partially reflects personal behavioural aspects of human bargaining.The latter opens an avenue for design of automated agents reflecting human traits, [25].• Privacy preservation.The implicit nature of the resulting distributed interaction does not involve the exchange of any private data or models between players.Therefore the proposed approach fully preserves players' privacy.
The proposed solution also allows to incorporate prior knowledge of the opponent though does not require that.The methodology [26] makes it possible to use the available external or domain-specific knowledge to enrich the opponent's model.The paper also compares three types of prior knowledge reflecting typical cases and illustrate its use.The paper continues our previous work [27] that assumes complete knowledge of the opponent model, which is rarely achievable in real-world applications.Thus the present paper focuses on learning the opponent model as well as on intrinsic motivation to cooperate.The last contribution of the work is that we have compared the performance of the proposed bargaining agents to agents employing heuristic models built on the extensive experimental meta-study [28].Related research.The literature of negotiation constitutes a very large collection, and space limitations prevent it from being presented in its entirety here.Generally there are several models focusing on the explicit negotiation based either on game theory or negotiation theory.The proposed approach considers independent dynamical self-interested DM agents, with learning ability and special reward prompting indirect negotiation.The mentioned features are very practical and up to now missing within the otherwise well-elaborated and important area of the paper.Up to the authors best knowledge there is no similar approach.We use probabilistic models [29] of bargaining agents that interact in a closed-loop and admit Markov decision processes as a modelling methodology, cf.[30].The area of agent negotiation and opponent modelling has a lot of achievements, see for instance [31], [32] [33].The comprehensive survey can be found in [34] and in [35].The recent paper [36] discusses main challenges and promises in the area.Most research on negotiating concerns static environments and focused on i) developing utilitybased negotiation strategies for rational DM agents, see for instance [37], [38], and ii) creating agent-moderator helping DM agent in negotiation task, [17], [39], [40].So far much less research describe negotiating in dynamic environment, see [41], [42].The recent approach [43] uses a logistic regression for modelling the opponent, that requires collecting significant amount of data for learning and initialisation.Paper [44] uses a similar utility based on the bargaining principles though constructs a subgame that relies on the perfect equilibrium.The closely related work, dealt with opponent modelling, is probably [45].It also employs Bayesian learning but relies on specific structure of preferences and policy of the opponent.Though work [46] also focuses on design of negotiation agents in dynamic and uncertain environments, it relies on a negotiation agent and proposes a set of heuristics to make negotiation decisions.Our model introduces an intrinsic mechanism that motivates the agent to negotiate while learning opponent's model via Bayesian approach.The resulting bargaining policy is optimal with respect to the resources available and individual preferences of the agents.It can also take into account human factors, which are important whenever human agents are involved.
We illustrate the approach using the Nash Demand Game (NDG) [12], a bilateral bargaining game for two players that should decide how to split given amount of money.The players simultaneously demand a certain portion of the amount they would like to get.The demand of one player is unknown to another one (an opponent).If the players' demands can be satisfied simultaneously, both players get the respective profit.Otherwise, they both get nothing.Despite its seeming simplicity, the NDG is a good model of dynamical resource allocation that achieves coordination without explicit negotiation.It also serves a big challenge for understanding human negotiation.
The remainder of the paper is organized as follows.Section II introduces notations and a mathematical background.Section III formulates the Nash Demand Game as MDP of a single player, introduces heuristic model of the opponent and prior models used in learning.Section IV describes and discuss simulated experiments.Section V and Section VI summarise the results obtained and outlines future research directions.

II. PRELIMINARIES
This section introduces and recalls necessary notions.

N, R
set of natural numbers, set of real numbers x t ∈ X value x from finite set X at discrete time t p(x) probability mass function of discrete random variable x p(x|y) probability mass function of x conditioned on y E[x|y] the expectation of x conditioned on y Note that no notational distinction is made between a random variable and its realisation.

B. MARKOV DECISION PROCESS
We model player's decision making in the NDG via Markov Decision Process (MDP) framework [47].MDPs were first introduced and developed in the operations research and economics [48].Since that MDP framework has been widely used to describe and solve decision-theoretic problems.MDP allows to capture the underlying stochastics omnipresent in application domain and also allows to respect multiple DM criteria.Typical examples of using MDP framework include medical applications [49], predictive maintenance [50], power systems [51], more examples see [52].
The overall scenario is as follows.An player interacts with the environment by taking actions to achieve its2 DM goal.The player is motivated by a reward it receives after each action taken.A finite state and action MDP is considered.Definition 1 (MDP): The fully observable MDP is characterised by {T, S, A, p, R}, where T = {1, 2, ..., N }, N ∈ N, is a set of decision epochs; S is a finite set of all possible environment states and A denotes a finite set of all actions available to the player.Function p : S×S×A → [0, 1] is the transition model p(s t+1 |s t , a t ) that moves the environment from state s t ∈ S to state s t+1 ∈ S after the agent took action a t ∈ A; R : S × S × A → R is a real-valued function representing the player's reward R(s t+1 , s t , a t ) after taking action a t ∈ A in state s t ∈ S.
The transition model captures environment dynamics and is represented by a family of probability distributions p(s t+1 |s t , a t ), each denotes the probability that at time t + 1 the environment will move from s t to s t+1 when action a t is executed.The state transitions obey Markov property: the distribution over states at time t + 1 is independent of any previous state s t−j and action a t−j , j ≤ 1 for fixed s t and a t .The player's preferences are described by a reward function, R. The aim of the player is to choose a sequence of actions in order to maximise the total expected sum of rewards as described in the following section.

C. OPTIMAL DECISION POLICY
The player chooses action a t ∈ A based on the randomised DM rule p(a t |s t ) : S → A in each decision epoch t ∈ T.
A sequence of DM rules forms DM policy π t,h at time t over decision horizon h ∈ N, s τ ∈S, a τ ∈A: (1) MDP with finite horizon h evaluates the quality of DM policy by expected total reward defined as follows: The solution to MDP [47] is a sequence of DM rules, , that maximises the expected reward (2) and forms the optimal decision policy: where π π π is a set of possible DM policies, see (1).The optimal policy (3) is computed by dynamic programming algorithm [48], [53], which requires knowledge of transition model p(s τ +1 |s τ , a τ ).

D. LEARNING TRANSITION MODEL
In bilateral bargaining, the transition model is a model of the opponent, that is, it predicts the opponent's reaction to the player's action.Generally it describes the dynamics of the opponent's decision making.In real-life tasks, opponent model p(s t+1 |s t , a t ) is usually unknown to the player 3 .It reflects the player's knowledge about the behaviour of the opponent.Without lost of generality the model can be assumed timeinvariant, i.e. p(s t+1 |s t , a t ) = p(s t |s t−1 , a t−1 ) and can be learned from the observed data.
To simplify the presentation, let us drop out the time index and introduce the following temporary notations: s ′ = s t+1 , s = s t and a = a t .The transition model then can be written p(s ′ |s, a) 4 .
We consider a parametrised form of the opponent's model with time-invariant parameter θ ∈ Θ where Θ is a set of all possible θ's and 0 ≤ θ s ′ sa ≤ 1, |s ∈ S and a ∈ A. Thus, parameter θ in ( 4) is an array defining transition probabilities θ s ′ sa that opponent's state in the next time will equal s ′ whenever the 3 it can be partially known or incorrectly specified. 4The new notation is valid within Section II-D only.previous state is s and the player takes action a.Our aim is to learn parameter θ, (4).
Let the player have belief b(θ) about the opponent's dynamics expressed via the probability density function of the parameter θ.While interacting with the opponent, the player updates belief about the parameter, b(θ), to a new value, b ′ (θ), given observed transition (s ′ , s, a) as follows, see [54]: Choosing belief b(θ) in conjugate form of Dirichlet distribution implies that the posterior (5) induced by Bayes' rule [54] is In ( 6) concentration parameter ν ν ν > 0 is an array containing occurrences ν s ′ sa > 0 of triples (s ′ , s, a).Each observation of a triplet (s ′ , s, a) increases the corresponding entry, ν s ′ sa , by one.Therefore, after n ∈ N observations {(s ′ , s, a)} n∈N , update ν ′ s ′ sa contains the actual occurrences of (s ′ , s, a).Recalling (4), the expectation of ( 6) can be interpreted as Bayesian estimate of unknown parameter θ based on the observed data (i.e.transitions occurred): Recursive implementation of the prior statistics update is described in [55].
A real-life dynamic decision making requires an efficient and feasible learning that can be performed online.Markov models belong to the exponential family for which exact estimation is feasible.The estimation and prediction within this family is very simple, especially with the conjugate prior in the form of Dirichlet distribution.The needed update of functions (probability density functions, see (5)) is given by the algebraic recursive update of the finite dimensional sufficient statistics.This clarifies applicability of this learning in combination with decision making.

III. METHODOLOGY A. MDP FORMALISATION OF NASH DEMAND GAME
The considered repetitive scenario of the game is as follows.Two structurally identical players A and B are bargaining on splitting an amount of money q ∈ N. The roles of both players are the same.In each round, two stages are present: an action stage and a reward stage.During action stage, each player decides how much to claim from the total available amount.The players do not communicate and their interests can be competitive.At reward stage, the players announce their demanded shares, observe the demands of their opponents and reward is allocated.Note that in action stage each player has no information about their opponent's demand or preferences.The game runs for a fixed and known number of periods.
Let q ∈ N is a total amount to split.At the beginning of round t ∈ T, each player k ∈ {A, B} chooses action a k t ∈ A k that is a demanded share of q in the round.The minimum demand equals 1 and the maximum is q − 1.If the sum of demands is less than or equal to q, both players get what they asked for, otherwise the players get zero reward.Player's profit in round t ∈ T equals the amount of money player receives 2 : where z A t , z B t ∈ Z are profits of A and B respectively.Z = {0, 1, 2, ..., q − 1} is a set of possible profits in one game round, and The addressed distributed bargaining does not consider communication between the players or any agent-moderator.To find a fully distributed solution, the game is described from a point of view of a stand-alone player.) is preset to the same demand a A 0 = a B 0 = a 0 .Reward as motivation for negotiation.Let reward of player A be defined as follows: The first term in (10) is a pure economic profit of player A, cf.(8).The second term expresses efficiency of using the game potential at round t, i.e. whenever a A t +a B t < q some amount remains unclaimed and thus lost for the players.The same situation happens when an agreement is not reached and the entire amount q is lost.
Obviously reward (10) ensures that, given fixed a A t , player A will receive the maximum possible reward iff its opponent, B, demands q − a A t .The proposed form of reward, (10), "connect" A's action with that of B and thus encourages player A to indirectly negotiate with B during bargaining.The mechanism of dynamic indirect negotiation is as follows.Each player influences the amount left while their opponent observes this influence and changes their next demand.Let us assume that there is a tendency for some unclaimed amount to remain.Then, if one player has consumed a small portion of it, the other player will observe that and then may increase their demand in the next round.Another situation occurs when the joint claim of the players exceeds the available resources.Then any of the players may step back and reduce 2 Upper indexes indicate the player whom action or profit belongs to.
their demand in the next round.This behaviour can again lead to a large unclaimed amount and affects the future demands of the players.In particular, the desire to minimise the unclaimed amount, | q − (a A t + a B t ) |, (10), forces player A to modify the current demand while taking into account the history of the opponent's claims.By doing so, in each round, each player dynamically adapts their demand to the foreseen demands of their opponent, that is indirectly negotiates with the opponent.
Weight ω A ∈ [0, 1] in (10) reflects A ′ s preferences between pure economic gain and exploiting the game's potential.The value ω A = 0 implies player A considers pure economic profit only, while in case of ω A = 1 player A cares about efficient use of the game potential.The A ′ s reward (10) thus equals Definition 2 and considerations above describe DM of player A. Easy to see that the same considerations can be applied to formalise decision making of player B.
The conditional independence of the players' actions given by the game rules and the definition of the state, see Definition 2, imply ) From player A point of view, the first factor in ( 12) is a part of A's optimal policy while the second factor models DM of player B and can be recursively estimated using Bayesian paradigm [54] as described in Section II-D.

B. HEURISTIC MODEL OF OPPONENT
The proposed approach formalised and solved bilateral dynamic bargaining of learning self-interested player within MDP framework (Section II-C).To verify the approach we propose a probabilistic bargaining model for non-learning and non-optimising opponent.The model is based on the reported experimental evidence obtained with human-players, see [56], [28].For simplicity here we consider player B is serving as an opponent to A.
Heuristic behaviour of B reflects the dependence of its future demand on the results of the previous round.Once the previous round demands are incompatible, that is a A t−1 + a B t−1 > q, player B tends to decrease next demand.If there are unclaimed money left in the previous round, B, on the contrary, increases the next demand.The proportion (speed) of demands' increase/decrease may depend on personal traits (i.e.reflect the personality of B).
The remainder of this section introduces model that reflects the behaviour of an opposing player, B.

1) B had Low Demand in the Previous Round
Consider the previous demand of player B is low, i.e. less than the fair split would have been, a B t−1 ≤ q 2 .The next de-mand (in sense of its mean value) then depends on the success of the previous round, i.e. whether demands in the previous round were compatible or not.Below we distinguish these two cases and provide the respective probabilistic description of B's actions.
i) Incompatible Demands (a A t−1 + a B t−1 > q): B tends to keep its next demand close to the previous one, a B t , as the previous demand of A was certainly much higher than a B t−1 .Thus any further increase could cause players' demands to become incompatible again and implies zero profit.Therefore the new demand of player B can be modelled as follows: while ii) Compatible Demands (a A t−1 + a B t−1 ≤ q): opponent B will proportionally increase the next demand, expecting A to do the same in order to fully distribute the entire available amount, q.In other words player who received less in the previous round would also ask for proportionally less unclaimed money and vice versa.A model of B describing the new demand is then with 2) B had High Demand in the Previous Round Now let us consider a situation when the previous demand of B was high, i.e. its value was greater than the fair split would have been, a B t−1 > q 2 .Then B decreases/increases demand while keeping own share proportional to the previous round in order to fully distribute the entire amount.A player who received less in the last round would ask for less of proportionally less unclaimed money and vice versa.Then a model of B ′ s new demand has the same form as (14).

C. PRIOR MODELS USED IN LEARNING
Our approach considers decision making of the player in question, A, who models behaviour of the opponent, B, and optimises own demand in order to maximise the accumulated profit.The ability to accurately predict the opponent's behaviour significantly affects the success of A ′ s decision making, (12).To learn a model of the opponent, A follows the approach described in Section II-D.It exploits knowledge available in the form of a parameter prior that quantifies A ′ s belief about dynamics of the opponent, B. Following Bayesian paradigm this prior will be gradually updated with new data accumulated, see Section II-D and [55].The choice of prior model is important, especially when a number of game rounds is limited.In implementation we use three prior models reflecting different knowledge A about B: • a uniform prior distribution.This model is used when A has little or no knowledge about the dynamics of B • prior model describing "rational" heuristic, see Section III-B.It is used when non-optimising B follows some heuristic and does not optimise.In that case prior model has the same structure as ( 13) or ( 14), but with different (larger) standard deviation σ. • pre-trained prior model.The third way of building an a priori model mimics the natural learning process of human players, where the player first gathers some knowledge about the opponent's playing style and then updates this knowledge during the game.Practically it means we run game for 30 preliminary rounds and player A built prior model of B based on the data obtained during these rounds.This way of building prior is used whenever the both players optimise and learn.

IV. SIMULATED EXPERIMENTS
The proposed approach is illustrated with the Nash demand game, described in Section III-A, using simulated examples 5 .We selected the most representative experiments from a much wider set of the experiments differing in the number of rounds and horizons.The selected experiments are long enough to perform learning (because very short runs will not be sufficient to learn the models used), while longer runs will add no significant information about the results.

Goal of the experiments
The goal was to analyse the impact of the proposed distributed solution and indirect negotiation and to verify that player employing the proposed DM policy is capable of achieving better results than heuristic player playing the same role.The main objectives of the performed experiments are: • illustrate the distributed DM approach in repetitive bargaining; • show that the proposed form of the reward function leads to an indirect negotiation and to a coordinated course of actions of both players, that is, to a more efficient allocation of the available limited resources; • demonstrate influence of weight ω in (10) • show that DM policy with indirect negotiation brings higher profit to every player compare to the heuristic model.

Common settings of the experiments
Each game has 60 rounds and optimisation horizon h ∈ N equals 10 game rounds.The amount of money that players can split (if they reach an agreement) is q = 10 CZK per round.The reward (10) is evaluated for the optimal policy (3) resulted from the dynamic programming [48].The initial state of each player The simulation is performed for 11 different values of weight ω, (10).Weight (0 ≤ ω ≤ 1) expresses a trade-off between the individual profitability and efficiency of using the game resources.It thus reflects the extent to which the player is negotiating.Zero value of ω in (10) models the situation when the player is interested only in economic profit.Other values of ω (0 < ω ≤ 1) correspond to cases when the player maximises the personal profit while minimising the unclaimed amount of money.

Experiments performed
The players used in the simulation are artificial agents with either heuristic DM model (see Section III-B) or proposed DM policy that optimises reward (10), see Section II-C.In each game at least one of the players uses the observed behaviour to update the opponent's model, see Section II-D.In order to display behaviour of our bargaining model, five typical cases were considered: Test 1 : Both players are non-learning.The player in question, A, is of the MDP type and uses the proposed DM policy optimising (10)

Approach verification
The players have played the game repeatedly with different settings.The results are summarised in graphs depicting individual cumulative profits of the players, total profit of the game, and success rates of game depending on the value of parameter ω.The success rate is defined as a number of game rounds in which the players' demands were compatible and thus satisfied.In other words, the value of the success rate shows how successfully the players collaborated, i.e. respected the opponent's actions.High values indicate high collaboration.The results show minimum, mean and maximum values of the individual cumulative profits and the game success rate.Note that • The maximum success rate does not necessarily imply the maximum total profit of the game.• Compatibility of the players' claims does not guarantee zero unclaimed amount in the game.• It is not guaranteed either that the maximum profit will be obtained for the same value of the weight ω.Thus the total maximum (minimum) profit of the game is not equal to the sum of the individual maximum (minimum) profit of the players.Player B, behaves according to the heuristic model ( 13), ( 14) Player A is of MDP type and uses DM policy (3) that optimises reward (10).In optimisation A uses model p(s t+1 |s t , a t ) having structure of the heuristic model, see Section III-B, but with different parameter σ = 3.This imitates a situation when A has partial or vague knowledge of the opponent.
Cumulative profits of the players A and B are shown in Figure 1 and Figure 2. Total cumulative profit and success rate of the game as a function of parameter ω A are shown in Figure 3 and Figure 4.
The players are successful in more than 51% of the rounds on average.The results show influence of parameter ω A on profit: the higher the parameter, the higher the profits of individual players and the higher the total profit of the game.This indicates a positive effect of the second term (10), which prompts A to indirectly negotiate with B by minimising the unclaimed amount in each round.As a result the players start to implicitly cooperate.
The results show the saddle value of parameter ω A = 0.5 that provides the minimum values of A's profit and success rate of the game.The maximum is reached for ω A = 1.Obviously, optimising player A earned slightly less on average than non-optimising player B. It could be because player B used fixed decision making rules and A had to adapt to that.

B. TEST 2: A OPTIMISES AND LEARNS, B BEHAVES HEURISTICALLY
This experiment is similar to Test 1, see Section IV-A, i.e. player B behaves accordingly to the heuristic model, Section III-B, and player A uses optimal DM policy minimising the proposed reward, (10).Unlike Test 1, player A is learning.A considers a uniform prior as B's transition model and dynamically updates it via data gathered, see Section III-C.
Cumulative profit and success rate obtained in Test 2 are shown in Figures 5-8 and Table 2. Obviously the learning has a positive impact on the game results.On average, the players are successful in more then 66% of all rounds -the average success rate is about 15% higher than in Test 1, as is the cumulative profit.The minimum values for individual profits and overall success rate are significantly higher cf.Table 1.On the other hand, their maximum values have noticeably

C. TEST 3: BOTH PLAYERS OPTIMISE BUT NONE LEARNS
This experiment considers both players are of MDP type and select DM policy maximising reward (10).However neither of the players is learning.They use a fixed uniform model (see Section III-C) that models the situation when there is no information about the opponent.
Cumulative profits and success rate of the game vs. param-   3.
The results illustrate positive impact of i) optimal bargaining compare to heuristic behaviour, cf.results of Test 1 and Test 2 and ii) proposed reward (10) that prompts on indirect negotiation.Even with non-informative prior knowledge, the players get higher profit.If the players' weights are ω A ≥ 0.5 and ω B ≥ 0.5 the success rate is 100% and overall game profit gained is close to the maximum possible (600 CZK), see Table 3.By other words: when the players care about the optimal allocation of the resources (by assigning high weights to the second term in reward (10)), the bargaining is more profitable.On average, the players are successful in more than 76% of all rounds.

D. TEST 4: PLAYERS OPTIMISE AND LEARN WITH UNIFORM PRIOR
This experiment is similar to Test 3, i.e. both players are of MDP type and maximise reward (10).Unlike Test 3, the ability to learn the opponent's model has been added to the players.The agents dynamically enhance their non-  Cumulative profits and success rate of the game in dependence on parameters ω A and ω B are shown in Figures 13-16 and Table 5.
The results show significant improvement due to the learning.The minimum values of the individual profits and the success rate decreased but their maximum values increased on average, see Table 4.The significant improvement oc-     The results show further improvement, see Table 5, cf.Tests 3-4.The minimum values of profits and success rate do not change but the maximum and mean values noticeably increased, cf.Test 4 (Section IV-D).The players achieve much higher individual profits for low values of weights ω because they coordinated their demands to make them almost always compatible (see Figure 20).

V. DISCUSSION
Section IV describes simulation results obtained on the NDG.It can be seen that our DM model can help the players to effectively bargain and counteract the incomplete knowledge.The main advantages of the proposed DM model are as follows: • The proposed reward function respects individual eco- nomic profit of the bargaining agent and the unclaimed amount of money from the previous round.As the opponent's past actions enter the reward (10), the optimal policy of the agent implicitly respects them.And vice versa: the optimal policy of the opponent respects agent's actions.Hence both players are forced to implicitly cooperate.
• The weight ω in (10) expresses trade-off between the individual profitability and efficiency of using game potential.At the same time it also reflects agent's preferences and partially style of playing (personal traits).High values of the weight in the player's reward (10) indicate a high interest of the agent in efficient use of game resources, i.e. in minimising the remaining unclaimed amount.In each round thus the reward encourages the agent to dynamically "adapt" its current demand to the predicted demand of the opponent.In the next round, the resulting profit 6 together with the updated opponent's model, is used in (2), (3) to select a new demand.This is the essence of the proposed indirect dynamic negotiation.• Compared to the heuristic bargaining model, Section III-B, our optimal DM policy increased the mean value of the player's individual profit by more than 50% (in the case of an uninformative prior) and by about 65% (informative prior).• Learning significantly improves the bargaining results.
However optimising but not learning agent can have worse individual results compare with the heuristic op- ponent.The reason is that the optimising agent implicitly cooperates with the opponent during bargaining but does not use the correct opponent model for this 7 .On contrary, the opponent does not cooperate and it uses a fixed heuristic model.As a result, the agent's effort brings more profit to the opponent than to itself.• The best bargaining results were achieved if both players are learning and employ the proposed bargaining policy.Informative prior used in learning can significantly improve the agent's profit.The proposed solution can be further extended i) to cover multi-issue bargaining; ii) to respect human non-rationality given by social and cognitive aspects; iii) to respect emotional state of the agent that has been proved to significantly 7 and therefore cannot predict the opponent influence DM [57].

VI. CONCLUDING REMARKS
The paper addresses a problem of sequential bilateral bargaining with incomplete information.We proposed DM model that helps agents to successfully bargain by performing indirect negotiation and learning the opponent's model.Methodologically the paper casts heuristically-motivated bargaining of a self-interested independent agent into a framework of Bayesian learning and Markov decision processes.The proof of the main results is based on the standard methodology.However, the problem formulation and the gained solution are novel and practically important.The special form of the reward implicitly motivates the players to negotiate indirectly, via closed-loop interaction.At the same time the proposed method is privacy-preserving, since it does not require the exchange of data or models between the bargaining agents.We illustrate the approach by applying our model to the Nash demand game, which is an abstract model of bargaining.The paper provides our original formulation and solution of the practically important DM scenario.It presents the initial study that confirms that our formulation is meaningful and gives the promising results.The results indicate that the introduced DM model: i) leads to coordinating the players' actions and to their indirect negotiation; ii) results in maximising success rate of the game and iii) brings more individual profit to the players compare to the heuristic model.
The proposed bargaining policy minimises losses caused by: (i) insufficient use of the resources; (ii) demands that exceed the total resources available; and (iii) incomplete knowledge.
The results obtained indicate possibility to create a realistic and applicable methodology of cooperation and negotiation in flatly organised networks of interacting agents without a fixed structure, cf.[58].We believe that our approach is suitable for non-cooperative, multi-agent networks, since we provide an easy way to implicit cooperation.The solution does not rely on a central authority and the proposed DM model outperforms a heuristic model whenever both agents are rational, learning and follow the optimal strategy.
In future work we would like : • to cover the multi-issue bargaining; • to extend the approach to a multi-agent settings; • to implement the approach for other bargaining rules than NDG.
Further foreseen challenge is learning weights of individual players based on their bargaining history.The weights indirectly reflect agent's model of bargaining and preferences.Moreover the weights may depend on the agent's personality [59], which allows taking into account the influence of personality traits on decision making.

Figure 4 .
Figure 4. Test 1 -Success Rate of the Games.
. Its opponent, B, behaves heuristically, see Section III-B.Test 2 : This case is similar to Test 1 but player A dynamically learns the opponent's model.Test 3 : Both players are of the MDP type and non-learning.They have no knowledge of their opponent and do not model it either (i.e. they use uniform model).
A. TEST 1: A IS A NON-LEARNING MDP PLAYER, B BEHAVES HEURISTICALLY

Table 3 .
Test 3: Both players optimise but none learns.

Table 4 .
Test 4: Both players optimise and learn with uniform prior.

Table 5 .
Test 5: Both players have informative prior, learn and optimise.Test 3 has been performed with the resulting prior instead of uniform distribution.Cumulative profits and success rate of the game in dependence on parameters ω A and ω B are shown in