Solving the Datum Search as a Partially Observed Stochastic Game

The Flaming Datum (FD) problem refers to the search for a hostile Evader that is fleeing after momentarily revealing its position. If Evader’s direction is fixed but unknown, while its maximal speed is known, Koopman argued in 1980 that the best trajectory for Searcher is a spiral starting from the revealed position. The objective of our study is to verify the hypothesis of a spiral search path, by formulating the FD problem in the framework of a finite two-player zero-sum partially observed stochastic search game, where the opponent plays repeatedly a fixed but unknown pure strategy. Using a realistic sensor model, current information about the position of Evader is represented by an occupancy map, updated in the Bayesian framework. The utility is computed as the entropy reduction of the occupancy map. The game was implemented in software and solved using the maxmin method. By running repeatedly the described search game, we have found that the search pattern, although random (due to the uncertainties in sensing and mixed strategies), is indeed a spiral on every play of the game, thus confirming the hypothesis.


I. INTRODUCTION
Search is a ''hide and seek'' game with a long history. The theory was pioneered by Koopman [1]- [3] primarily in the military context, followed by developments by Stone et al. [4]. This framework provides the optimal a priori search plan for a given detection model, target motion model and the cost of the search. Searching strategies in biological applications [5]- [8], robotics [9] and security applications [10], [11] are driven by sequentially acquired measurements and result in random search paths. A taxonomy of search problems is presented in [12], distinguishing the case where the target is static and the case where the target can move as a reactive intelligent player. The latter class of problems falls under the theoretical framework of the game theory [13].
In situations characterised by dynamic interaction between the players, where the players do not have complete information about the environment, a class of partially observable stochastic games (POSG) [14] provides a very general The associate editor coordinating the review of this manuscript and approving it for publication was Derek Abbott . theoretical framework of study. A POSG can be viewed as an application of partially observable Markov decision processes (POMDP) [15] to game theory, in which there are multiple agents (players) with possibly conflicting goals, and their joint actions determine the state transitions and rewards.
In general POSG are difficult to solve because a multiple exponential number of histories is needed to be evaluated for each agent.
In this paper we focus on a specific POSG: a two-player zero-sum (TPZS) partially observable stochastic search game, in which Evader (target) repeatedly plays a fixed pure strategy, unknown to Searcher. This framework is appropriate to solve the so-called Flaming Datum problem [16]. Consider a situation where a submarine torpedoes a ship in a convoy and, in doing so, reveals its position. Suppose the attack happened at time t = 0. Retaliatory forces arrive at the position where the attack has happened with a time delay and engage in an active search for the submarine until time T . The torpedoed ship marks the position where the submarine was at t = 0 and is referred to as the flaming datum (FD). In the meantime, the submarine dived in an attempt to leave the search area undetected and survive. The payoff of the search is the probability that submarine will be found within the fixed search interval T . Koopman [17] studied this problem assuming the submarine travels with a fixed but unknown course, and at a known radial speed. In such circumstances (without using the game theory) Koopman established that the searcher's best trajectory is generally a spiral of some kind, starting from the FD (see also [18]). The FD problem has been studied as a TPZS game under different constraints and assumption on the behaviour and capacities of the two players [16], [19], [20].
In our formulation of the FD problem, Searcher is equipped with a sensor characterised by a realistic range-dependent probability of detection. Under the assumption that Evader's course is fixed but unknown, we cast the problem in the framework of a finite TPZS POSG. Current knowledge about the location of Evader is represented with a probabilistic occupancy map. This dynamic map is updated, after every stage of the game, by fusion of sensor data with the current map using the Bayes rule. The payoff assigned to each pair of actions by the two players is defined as the entropy reduction of the occupancy map. The described search game has been implemented in software and played repeatedly, because the outcome of each play is naturally random in the adopted framework of POSG. The objective is to verify the hypothesis that the search trajectory is indeed a spiral.
The rest of the paper is organised as follows. Sec. II describes the problem in a formal manner, followed by a game theoretic formulation and its solution in Sec. III. Numerical simulation results are presented in Sec. IV, and conclusions drawn in Sec. V.

II. A PROBLEM STATEMENT
The search area is discretised and modelled by a square grid consisting of M × N nodes. A node represents a square area (or a cell) of size R × R. The Cartesian coordinates of the (m, n)th node are denoted by l m,n = (x m,n , y m.n ). Thus the search area is fully specified with L = {l m,n ; m = 1, . . . , M , n = 1, . . . , N }.
Suppose Evader has been observed at time t = 0 in the square area represented by node (m * , n * ). Being aware that its position has been revealed, Evader will attempt to escape the search area in the quickest possible manner, by moving at its maximum speed (which is known to Searcher) at the course ψ 0 ∈ [0, 2π ) rad, which is unknown to Searcher. The searcher arrives at the FD coordinates l m * ,n * = (x * , y * ), corresponding to node (m * , n * ), and performs sensing at time τ > 0. Sensing, in the context of underwater surveillance, typically involves a deployment of a dipping sonar -an active sonar submerged under the sea surface [18]. The detection process is naturally imperfect: the probability of detection is decreasing with the range to the target, while there is a nonzero probability of false detections. The sensing model will be described later in this section. Based on the outcome of the sensing activity, Searcher will augment his knowledge about Evader's position and subsequently choose where to go next to perform sensing. An implicit assumption is that the speed of Searcher U s is much greater than the speed of Evader U e (e.g. helicopter vs submarine). The time it takes Searcher to move to its new location is denoted 1 and to carry out the sensing activity is 2 , such that 1 + 2 = is the update interval. The maximum time available for search, T , is limited by the available resources. The search is successful if Evader is found in time less than or equal to T . We will formulate the described problem as a two-player zero-sum partially observable stochastic search game. Then this game will be implemented in software and played repeatedly in order to study the characteristics of the search trajectories and estimate the probability of successful search as a function of the search time and the characteristics of the sensor.
Next we introduce the sensing model. Let Searcher at discrete-time k = 1, 2, . . . be in a node with coordinates p(k) ∈ L. The sensing activity at k results in a set of detections Z(k) = {z 1 (k), . . . , z µ k (k)}, where z ∈ Z(k) is a vector consisting of a measured range and azimuth from p(k) to the perceived target location. The probability of detecting Evader, located at coordinated l m,n ∈ L, is denoted P (m,n) d (k). The probability of detection is typically a function of the distance r (m,n) (k) between the target position l m,n and the searcher position p(k), i.e. r (m,n) (k) = p(k) − l m,n . The mathematical model of the probability of detection be adopted as where the sensing constant α (dimension of length) depends on the characteristics of the sensor and environment. We assume α is known and specified in terms of the grid period R. Parameter α determines the sensing area (for a given probability of detection). According to (1), a target at distance r = 3α will be detected with the probability of p o = 0.05. Assuming 360 • coverage, the sensing area L k for p o is as a circular area of radius 3α. The sensor could also report false detections. Their spatial distribution is assumed to be uniform over L k and homogeneous over the search area. The number of false detections in L k is modelled with the Poisson distribution whose mean rate is λ > 0. The measured range and azimuth (i.e. the components of vector z ∈ Z(k)) are assumed to be affected by additive zero-mean Gaussian noise with the covariance matrix R.

III. GAME THEORETIC FORMULATION
Because Evader is moving with a constant velocity from the FD, it effectively plays repeatedly the same (unknown) action. This fact will dramatically simplify the described search problem, because it can be expressed as a TPZS game in the normal (matrix) form of a constant size [21]. Such a game at each time k is defined with a triple (A k , B k , U k ), where A k and B k are the sets of admissible actions of Searcher (Player 1) and Evader (Player 2), respectively.
is the payoff matrix whose rows are labeled VOLUME 10, 2022 by the elements of A k , and columns with the elements of B k . The value of u ij is the payoff (utility) for Player 1 if he plays a i ∈ A k and Player 2 plays b j ∈ B k . Player 1 is referred to as the maximiser, because his goal is to act in such a manner to maximise the utility. Player 2 is referred to as the minimiser, because his goal is the opposite, to minimise the payoff for Player 1. In making a decision which action to take, each player is unaware what the opponent does. We will consider a finite TPZS game, in which both A k and B k are finite sets. Set A k is formed as a set of nodes in vicinity of the current Searcher position coordinates p(k) ∈ L. We restrict motion to the search area grid, specified by L, and assume the searching agent can move in a radial direction from its current coordinates. Thus, where notation in the subscript ''∈ L'' means ''the nearest grid coordinates in L.'' Set B k is formed by the linear motion from the FD at a fixed course, that is with t k = τ + (k − 1) being the time interval from t = 0 to the current sensing time.
The elements of a POSG in our case include the information state, the set of admissible actions for the two players and the reward function (the utility of Player 1; its negative value is the utility of Player 2 in TPZS games), assigned to each pair (a, b) ∈ A k × B k . The information state of the considered POSG is the current knowledge about Evader's location, available to Player 1 (while Player 2 plays always the same strategy) in making a decision how to act. We adopt the Bayesian probabilistic framework and adopt the information state to be a map of the search area, in which each node is assigned a probability that Evader is occupying it. This map will be referred to as the Evader occupancy map (EOM). The utility (the reward function) assigned to a pair (a, b) ∈ A k × B k is defined as the reduction of entropy in the EOM. This will be explained in more details in Sec. III-A.
Note that we formulate the search as is a sequence of sensing, decision making and the movement to a new location. This effectively decomposes the overall FD search game into a sequence of sub-games played after every sensing activity. This approach, although may lack a guarantee of theoretical optimality, avoids the computational complexity of solving the overall FD search game.

A. EVADER OCCUPANCY MAP
Following [22], the presence or absence of Evader in the (m, n)th node of the search area grid at time k is modelled by a Bernoulli random variable δ m,n (k) ∈ {0, 1}, where by convention, δ m,n (k) = 1 signifies that the target is present in this node (δ m,n = 0 means the opposite). The EOM at time k represents a two-dimensional array of posterior probabilities, for each grid-node of the search area, that Evader is present in that node at time k. It is computed sequentially, after processing a cumulative sequence of measurement sets Z 1:k := Z(1), · · · , Z(k). For the (m, n)th node this posterior probability is defined as The EOM is then a matrix At time t = 0, based on the full confidence in FD l m * ,n * = (x * , y * ), the EOM P 0|0 is specified with P m * ,n * (0|0) = 1 and zero otherwise.

1) UPDATING THE EOM
Let us denote the EOM, predicted to discrete-time k = 1, 2, · · · from the previous time k − 1, as P k|k−1 . Prediction from time k − 1 to k will be explained in Sec. III-A2. This predicted EOM is updated using the Bayes' rule as follows. Given a detection set Z(k) obtained by sensing at time t k = τ + (k − 1) , if none of the detections in Z(k) falls into the (m, n)th cell of the EOM, the probability P m,n (k|k) is computed as [22] P m,n (k|k) = where where

2) PREDICTION OF THE EOM VIA DIFFUSION
During the time interval [t k−1 , t k] , for k = 1, 2, . . . , evader can move and accordingly, the EOM needs to reflect the uncertainty induced by this motion. Conceptually, this can be achieved through diffusion of the EOM [22], [23]. Diffusion is carried out sequentially, with short time steps of duration δt < starting from t k−1 as follows: where t = t k−1 , t k−1 + δt, t k−1 + 2δt, . . . until t k . The term N m,n in (7) represents the set of nodes which are the neighbours of the (m, n)th node on the grid, that is N m,n = {(m − 1, n), (m + 1, n), (m, n − 1), (m, n + 1), (m − 1, n − 1), (m + 1, n + 1), (m + 1, n − 1), (m − 1), (n + 1)}. The first summand on the right hand side of (7) is due to the possible movement of the evader from the (m, n)th cell to one of its neighbours, while the second is modeling its possible motion into the (m, n)the cell, coming from one of its neighbours. The ''spilling'' factors η i,j are related to the maximum speed of Evader and the time interval δt. Because Evader can move in any direction, it makes sense to group the neighbouring cells contained in N m,n into non-diagonal and diagonal neighbours, and to treat them separately. The former are the first four members of N m,n listed above, and they are assigned an equal spilling factor η ; the latter are the last four members of N m,n listed above, and they are assigned an equal spilling factor η . Because the diagonal neighbours are further than the non-diagonal neighbours, it make sense to adopt η < η . Furthermore, in order to make sure the first summand on the right hand side of (7) is positive, we obtain a condition 1 − 4η − 4η ≥ 0, which implies η + η ≤ 1 4 .

B. UTILITY MATRIX
The last element of the POSG that remains to be specified is the reward function. The utility matrix U k = [u ij ] is an expression of the reward assigned to Player 1 in response to a pair of actions (a, b) ∈ A k × B k . First, let us define the entropy of the EOM P k|k : P m,n (k|k) log 2 P m,n (k|k) + 1 − P m,n (k|k) log 2 1 − P m,n (k|k) . (8) Note that the entropy of the initial EOM P 0|0 , in the case of the full confidence in the FD, is H 0|0 = 0. This reflects the fact that there is no uncertainty about Evader location at t = 0. The entropy grows from time t = 0 to t = τ because of diffusion of the EOM, reflecting the uncertainty in Evader's position caused by its (unknown) motion. Consider the time instant t k = τ + (k − 1) , after the update of the EOM resulting in the posterior EOM P k|k . At this point of time Searcher (Player 1) and Evader (Player 2) play the game, that is choose a pair of actions (a i , b j ) ∈ A k × B k . Let the predicted EOM to time t k+1 be denoted as P k+1|k . Due to diffusion, its entropy will increase, that is H k+1|k ≥ H k|k . The utility of (a i , b j ) for Player 1 is defined as the reduction of entropy (or the information gain) [24], as follows: Here H k+1|k+1 (a i , b j ) is the entropy of the EOM after it has been updated using the measurement set Z(k + 1). This measurement set, however, is only hypothetical, because the decision has to be made before the strategy profile (a i , b j ) has been carried out. Hence, the expectation operator E with respect to the probability density function of Z(k + 1)|Z 1:k features in (9). In the practical computation of the utility matrix, however, the expectation operator is discarded, while the fictitious measurement set is adopted as the least informative outcome, that is Z(k + 1) = ∅.

C. A SOFTWARE IMPLEMENTATION OF THE GAME
Given the utility matrix U = [u ij ], the corresponding TPZS game is solvable if there exist probability distributions p * a over In this case p * a and p * b are said to be the optimal mixed strategies, while ν is referred to as the value of the game. VOLUME 10, 2022   As von Neumann proved in 1928, all finite matrix games have a solution [21]. In solving the games we adopt the maxmin method, the rationale being the awareness that an intelligent competitor will always act in such a manner to create the worst possible situation for the opponent [21]. A solution of the matrix game refers to finding p * a , p * b , and ν, which in practice is carried out using the linear programming (LP) method [25].
The pseudo-code of the software implementation of described TPZS partially observable stochastic search game is given in Alg. 1. According to line 4, Searcher arrives at the FD at t 1 = τ , while Evader has moved to e(1) following the (unknown to Searcher) course ψ 0 . After the first sensing activity (line 6), the EOM is updated (line 7) and the peak of the EOM is tested for success (lines 8 and 9). If this peak probability is above threshold ξ 0 , the search is off; otherwise it continues in a while-loop, from lines 10 to 25. The game is solved in line 13 using LP, resulting (in general) with a mixed strategy p * a . An action a * is drawn at random from p * a in line 14, followed by the movement of Searcher to the next position (line 16). Evader continues following its course, and moves to its new position in line 18. Finally, after the sensing activity, the EOP is updated (lines 20 and 21) and the peak of the EOM is tested for success (lines 22 to 24). If the search has been unsuccessful during the search time T , this is reported as a failure in line 26.

A. A SETUP WITH ILLUSTRATIVE RUNS
The parameters used in simulations are listed in Table 1. Set A k includes the possible destinations of Searcher in directions {↑, , →, , ↓, , ←, } from its current position (hence V = 8). Evader moves in straight line with a course ψ 0 . We set W = 8, that is, utility matrix at every stage of the game is 8 × 8 matrix.
In simulations we ensure that the final position of Searcher at time t k is on the grid, i.e. p(k) ∈ L. The timing parameters are 1 = 2 units of time (u.t.) 2 = 1 u.t. and τ = 70 u.t.. Because of uncertainty due to the mixing strategy p * a (it involves randomness in decision making), and a random detection process, the trajectory of Searcher on every play of the game is random (and different), and so is the success of the search game. Fig. 1 displays the EOM at time: (a) t = and (b) t = τ , with white/black areas indicating a small/large probability of occupancy. Based on Fig. 1.(a) it is obvious that the FD (i.e. Evader's position at time t = 0) is (x * , y * ) = (25,25). The diffusion of the EOM via (7), reflecting the uncertainty caused by Evader's motion, can be observed in Fig. 1.(b): the black area represents the expanding-with-time circle in which Evader is confined. Fig. 2 show an example of: (a) the trajectory traversed by Searcher up to time t = 101 u.t.; (b) the corresponding EOM at t = 101 u.t. In this particular run of the search game, Evader has not been found yet, because all measurement sets Z(k) were empty. ''No-detection'' information is reflected in the EOM with the gray/white areas inside the expanding black circle, corresponding to the trajectory traversed by Searcher. Note that white areas with time become gray and then black due to diffusion of the EOM. Fig. 3 shows four different search trajectories obtained with the same setup. Evader was found in panels (a), (b) and (c), but managed to escape after T = 300 s in panel (d). Note from Fig. 3 that search trajectories, obtained by the described POSG, result in a randomised spiral search, starting from the FD, in accordance with the hypothesis in [17]. It has been consistently observed that: (1) the search trajectories are spiral; (2) only in the first few stages, the solution to the game is a mixed strategy; subsequently, the solutions are pure strategies. An avi movie of a typical run is available in Supplementary material.

B. MONTE CARLO RUNS
Next we investigate the probability of a successful search, as a function of the search time T . The results are shown in Fig. 4 for two values of the sensing parameter, namely α = 2 and α = 2.5. The probability of a successful search was estimated by running 5000 times the described search game (implemented in software) for each value of the search interval T . Naturally, the probability of successful VOLUME 10, 2022 search is zero for T < τ , and then grows quickly with T > τ . However, this growth slows down as T increases, to reach a certain saturation level below probability 1. The probability of successful search, as shown in Fig. 4, depends on α: the higher value of α corresponds to the higher probability of detection and therefore, increases the probability of success.

V. CONCLUSION
The paper studied the flaming datum search under the assumption that Evader's course is fixed but unknown. It has been established in the 1980s, using analytical methods, that the searcher's best trajectory in this case is a spiral, starting from the FD. The objective of our study was to verify the hypothesis of a spiral search path, when the FD search is formulated in the framework of a finite two-player zero-sum partially observed stochastic search game. Using a realistic sensor, current knowledge about the location of Evader in this framework is represented with a probabilistic occupancy map. This map is updated after every stage of the game using the Bayes rule, while the utility is expressed as the entropy reduction of the occupancy map. The game was implemented in software and solved using the maxmin method. By running repeatedly, in simulations, the described search game, we have found that the search pattern is random on every play of the game, because of the mixed strategies and uncertainty in sensing. However, the search trajectory is indeed always a spiral, starting from the flaming datum, thus confirming the hypothesis. Mixed search strategies have been observed only in the early stages of the search game.
ALEX SKVORTSOV received the Ph.D. degree in theoretical physics from the Moscow University of Applied Physics and Technology. He has significant research and development experience in defence sponsored projects (data fusion, sensor networks, flow noise, and new vibro-elastic materials) on which he worked in academia and industry. He has been working with the Defence Science and Technology Group, since 2005. Since 2014, he has been the Group Leader and the Principal Scientist in acoustic signature control. His research interest includes finding novel solutions for acoustic signature management. HAYDAR DEMIRHAN is currently a Senior Lecturer with the School of Science, Mathematical Sciences Discipline, RMIT University. He has 60 research publications in high-quality scientific outlets. His research interests include Bayesian learning and its application to AI, computer simulation experiments, fuzzy inference, and cryptographic randomness.