IN recent years we have seen widespread deployment of smart camera networks for a variety of applications. Proper placement of cameras in such a distributed environment is an important design problem. The reason is that the placement has a direct impact on the appearance of objects in the cameras which dictates the performance of all subsequent computer vision tasks. For instance, one of the most important tasks in distributed camera network is to visually identify and track common objects across disparate camera views. It is a difficult problem because the proper identification of semantically rich features or *visual tags* like faces or gaits depends highly on the pose of these features relative to the camera view. Using multiple cameras can alleviate this “visual tagging” problem but the actual number of cameras and their placement become a non-trivial design problem.

To properly design a camera network that can accurately identify and understand visual tags, one needs a visual sensor planning tool—a tool that analyzes the physical environment and determines the optimal configuration for the visual sensors so as to achieve specific objectives under a given set of resource constraints. Determining the optimal sensor configuration for a large-scale visual sensor networks is technically a very challenging problem. First, visual line-of-sight sensors are amenable to occlusion by both static and dynamic objects. This is particularly problematic as these networks are typically deployed in urban or indoor environments characterized by complicated topologies, stringent placement constraints and constant flux of occupant or vehicular traffic. Second, from infra-red to range sensing, from static to pan-tilt-zoom or even robotic cameras, there are a myriad of visual sensors and many of them have overlapping capabilities. Given a fixed budget with limited power and network connectivity, the choice and placement of sensors become critical to the continuous operation of the visual sensor network. Third, the performance of the network depends highly on the nature of the specific tasks in the application. For example, biometric and object recognition require the objects to be captured at a specific pose; triangulation requires visibility of the same object from multiple sensors; object tracking can tolerate certain degree of occlusion using a probabilistic tracker.

The earliest investigation in this area can be traced back to the “art gallery problem” in computational geometry. Though the upper bound exists [1], the minimum number of cameras needed to cover a given area is a NP-complete problem [2]. Heuristic solutions over 3D environments are recently proposed in [3], [4] but their sophisticated visibility models can solve only small-scale problems. Alternatively, the optimization can be tackled in the discrete domain [5], [6]. The optimal camera configuration is formulated as a Binary Integer Programming (BIP) problem over the discrete lattice points. These work however assume a less sophisticated modeling in a 2-D space rather than a true 3-D environment and the loss in precision due to discretization has not been properly analyzed.

In this paper, we continue our earlier work in [7] in developing a binary integer-programming based framework for determining the optimal visual sensor configuration. Our primary focus will be on optimizing the performance of the network for visual tagging. Our proposed visibility model supports arbitrary-shaped 3D environments and incorporates realistic camera models, occupant traffic models, self occlusion and mutual occlusion. In Section II we develop the visibility model for the “visual tagging” problem based on the probability of observing a tag from multiple visual sensors. Using this metric, we formulate in Section III the search of the optimal sensor placements as a Binary Integer Programming (BIP) problem. Experimental results demonstrating this algorithm using simulations are presented in Section IV. We conclude the paper by discussing future work in Section V.

SECTION II

## Visibility Model

Given a camera network, we model the visibility of a tag based on three random parameters *P*,**v**_{P} and β_{s}, as well as two fixed environmental parameters *K* and *w*. Pdefines the 3D coordinates of the center of the tag and **v**_{P} is the pose vector of the tag. We assume the tag is perpendicular to the ground plane and its center lies on a horizontal plane Γ. Note the dependency of *V* on **v**_{P} allows us to model self-occlusion—the tag is being occluded by the person who is wearing it. The tag will not be visible to a camera if the pose vector is pointing away from the camera. We model the worst-case mutual occlusion by considering a fixed occlusion angle β measured at the center of the tag on the Γ plane. Mutual occlusion is said to occur if the projection of the line of sight on the Γ plane falls within the range of the occlusion angle. In other words, we model the occlusion as a cylindrical wall of infinite height around the tag partially blocking a fixed visibility angle of β at random starting position β_{s}. *w* is half of the edge length of the tag which is a known parameter. The shape of the environment is encapsulated in the fixed parameter set *K* which contains a list of oriented vertical planes that describe the boundary wall and obstacles.

Our visibility measurement is based on the projected size of a tag on the image plane of the camera. The projected size of the tag is very important as the image of the tag has to be large enough to be automatically identified at each camera view. Due to the camera projection of the 3-D world to the image plane, the image of the square tag can be an arbitrary quadrilateral. While it is possible to precisely calculate the area of this image, it is sufficient to use an approximation for our visibility calculation: we measure the projected length of the line segment *l* at the intersection between the tag and the horizontal plane Γ. The actual 3-D length of *l* is 2*w*, and since the center of the tag always lie on *l*, the projected length of *l* is representative of the overall projected size of the tag. Given a single camera with the camera center at *C*, we can define the visibility function for one camera to be the projected length ‖ *l*′‖ on the image plane of the line segment *l* across the tag if the above conditions are satisfied, and zero otherwise. Fig. 1 shows the projection of *l*, delimited by *P*_{l1} and *P*_{l2}, onto the image plane Π. Based on the assumptions that all the tag centers has the same elevation and all tag planes are vertical, we can analytically derive the formulae for *P*_{l1}, *P*_{l2} as
TeX Source
$$P'_{li} = C - {\langle {\bf v}_{\bf C},O -C\rangle \over \langle {\bf v}_{\bf C},P_{li} -C\rangle}(P_{li}-C)\eqno{\hbox{(1)}}$$where 〈·, ·〉 indicates inner product, The projected length ‖ *l*′‖ is simply ‖ *P*_{l1}′ −*P*_{l2}′ ‖.

After computing the projected length of the tag, we proceed to check four visibility conditions as follows:

**Environmental Occlusion**: We assume that environmental occlusion occurs if the line segment connecting camera center Specifically, intersection between the line of sight *PC* and each obstacle in *K* is computed. If there is no intersection within the confined environment or the points of intersection are higher than the height of the camera, no occlusion occurs due to the environment. We represent this requirement as the binary function chkObstacle (*P*, *C*, *K*) which returns 1 if occlusion occurs and 0 otherwise.

**Field of View**: Similar to determining environmental occlusion, we declare the tag to be in the field of view if the image *P*′ of the tag center is within the finite image plane Π. Using a similar derivation as in (1), the image *P*′ can be computed as . We then convert *P*′ to local image coordinates to determine if *P*′ is indeed within Π. We encapsulate this condition using the binary function chkFOV (*P*,*C*,**v**_{C}, Π, *O*) takes camera intrinsic parameters, tag location, pose vector as input, and returns a binary value indicating whether the center of the tag is within the camera's field of view.

**Self Occlusion**: As illustrated in Fig. 1, the tag is self occluded if the angle α between the light of sight to the camera *C*−*P* and the tag pose **v**_{P} exceeds . We can represent this condition as a step function .

**Mutual Occlusion**: As illustrated in Fig. 1, mutual occlusion occurs when the tag center or half the line segment lis occluded. The angle β is suspended at *P* on the Γ plane. Thus, occlusion occurs if the projection of the light of the sight *C*−*P* on the Γ plane at *P* falls within the range of . We represent this condition using the binary function chkOcclusion (*P*,*C*,**v**_{P}, β_{s}) which returns one for no occlusion and zero otherwise.

Combining both ‖ *l*′‖ and the four visibility conditions, we define the projected length of an oriented tag with respect to camera Γ as *I*(*P*,**v**_{P}, β_{s}| *K*,Γ) follows:
TeX Source
$$\eqalignno{&I(P, {\bf v}_{\bf P}, \beta_s\vert w,K, \Gamma) = \Vert l'\Vert \cdot {\rm chkOcclusion}(P,C, {\bf v}_{\bf P}, \beta_s)\cdot\cr&\quad {\rm chkObstacle}(P,C,K) \cdot {\rm chkFOV}(P,C, {\bf v}_{\bf C},\Pi,O)\cdot\cr&\quad U \left({\pi \over 2}-\vert \alpha\vert\right)&\hbox{(2)}}$$where Γ includes all camera parameters. Most vision algorithms requires the tags to be big enough for detection. Thus, a threshold version is usually more convenient:
TeX Source
$$I_b (P,{\bf v}_{\bf P},\beta_s\vert w, T,K,\Upsilon)=\cases{1 &${\rm if}\ I(P,{\bf v}_{\bf P},\beta_s\vert w, K, \Upsilon) > T$\cr0 &${\rm ohterwise}$}\eqno{\hbox{(3)}}$$To extend the single-camera case to multiple cameras, we note that the visibility of the tag from one camera does not affect the other and thus, each camera can be treated independently. Assume that the specific application requires a tag to be visible by *k* camera. The tag at a particular location and orientation is visible if the sum of the *I*_{b}( ) values from all the cameras exceed *k* at that location.

SECTION III

## Optimal Camera Placement

In this section, we propose an binary integer program that finds the best placement given a target number of cameras. We first discretize the space of possible camera configuration space, including possible location, yaw and pitch angles into an uniform lattice *gridC* of *N*_{c} camera grid points, denoted as {Υ_{i} : *i* = 1, …, *N*_{c}}. We also discretize the tag space which includes possible tag position *P*, orientation **v**_{P} and occlusion β_{s} into a uniform lattice *gridP* with *N*_{p} tag grid points {Λ_{i} : *i* = 1, 2, …,*N*_{p}}.

The goal of the algorithm FIX_CAM is to maximize the average visibility, for a given number of cameras. We first define a set of binary variables on the tag grid {*x*_{j}: *j* = 1,…,*N*_{p}} indicating whether a tag on the *j*th tag point in grid *P* is visible at two or more cameras. We also assume a prior distribution {ρ_{j} : *j* = 1, …, *N*_{p}, ∑_{j} ρ_{j} = 1} that describes the probability of having a person at that tag grid point. We define binary variables on the camera grid {*b*_{i}: *i* = 1,…,*N*_{c}} to be one to indicate the placement of a camera. The cost function defined to be the average visibility over the discrete space is given as follows:
TeX Source
$$\max_{b_i} \sum^{N_p}_{j=1}\rho_jx_j\eqno{\hbox{(4)}}$$The relationship between the camera placement variables *b*_{i}'s and visibility performance variables *x*_{j}'s can be described by the following constraints. For each tag grid point Λ_{j}, we have
TeX Source
$$\eqalignno{&\sum^{N_c}_{i=1}b_iI_b(\Lambda_j\vert w,T,K,\Upsilon_i) - (N_c - k +1)x_j \le k + 1&\hbox{(5)}\cr&\qquad\quad \sum^{N_c}_{i=1}b_iI_b(\Lambda_j\vert w,T,K,\Upsilon_i)-kx_j \ge 0&\hbox{(6)}}$$These two constraints effectively define the binary variable *x*_{j}: if *x*_{j} = 1, Inequality (6) becomes ∑^{Nc}_{i = 1} *b*_{i} *I*_{b} (Υ_{j} | *w*, *T*,*K*,Γ_{i}) ≥ *k* which means that a feasible solution of *b*_{i}'s must have the tag visible at *k* or more cameras. Inequality (5) becomes ∑^{Nc}_{i = 1} *b*_{i} *I*_{b} (Λ_{j}|t *w*, *T*,*K*,Υ_{i})≤ *N*_{c} which is always satisfied—the largest possible value from the left-hand size is *x*_{j} = 0, Inequality (5) becomes ∑^{Nc}_{i = 1} *b*_{i} *I*_{b} (Λ_{j}| *w*, *T*,*K*,Υ_{i})≤ *k*+1 which implies that the tag is not visible by *k* or more cameras. Inequality (6) is always satisfied as it becomes ∑^{Nc}_{i = 1} *b*_{i} *I*_{b} (Λ_{j}| *w*, *T*,*K*,Υ_{i})≥ 0.

Two additional constraints are needed to complete the formulation: as the cost function focuses only on visibility, we need to constrain the number of cameras to be less than a maximum number of cameras or ∑_{j = 1}^{Nc} *b*_{j} ≤ *m*. For each camera location (*x*,*y*), we keep the following constraint to ensure only one camera is used at each spatial location or ∑_{all Υi at (x,y)} *b*_{i} ≤ 1.

SECTION IV

## Experimental Results

We first demonstrate the performance of FIX_CAM based on simulation results. All the simulations assume a room of dimension 10 m × 10 m with a single obstacle inside and a square tag with edge length *w* = 20 cm long. For the camera and lens models, we assume a pixel width of 5.6 μm, focal length of 8 cm and the field of view of 60 degrees. These parameters closely resembles the real cameras that we use in the real-life experiments. The threshold *T* for visibility is set to five pixels which we find to be an adequate threshold for our color-tag detector. The visibility is defined to be having *k* = 2 cameras looking at a tag. While we use a discrete space for the optimization, we compute the average visibility for a given camera configuration with Monte Carlo sampling using three orders number of sample points than that of the discrete lattice.

Table I shows the results of the average visibility under different number of cameras. The optimal average visibility over the discrete space is shown in the second column. The average visibility estimated by the Monte Carlo method is shown in the third column and the last column shows the computation time on a Xeon 2.1 Ghz machine with 4 Gigabyte of memory. The BIP solver is based on the software in [8]. The gap between the optimal solution from FIX_CAM is due to discretization. Fixing the number of cameras at eight and varying the density of the grids, Fig. 2 shows that the resulting camera planning improves and the gap between the continuous and discrete measurements dwindles. The drawback of using a denser grid is a significant increase in computational complexity—it takes hours to complete the simulation using the highest density. One solution is to use the approximate solution discussed in our earlier work [7].

Next, we show how one can incorporate realistic occupant traffic patterns into the FIX_CAM algorithm. The previous experiments assume an uniform traffic distribution over the entire tag space—it is equally likely to find a person at each spatial location and at each orientation. This model does not reflect many real-life scenarios. For example, consider a hallway inside a shopping mall: while there are people browsing at the window display, most of the traffic flows from one end of the hallway to the other end. By incorporating an appropriate traffic model, the performance should be improved under the same resource constraint. In the FIX_CAM framework, a traffic model can be incorporated into the optimization by using non-uniform weights ρ_{j} in the cost function (4).

In order to use a reasonable traffic distribution, we employ a simple random walk model to simulate a hallway environment. We imagine that there are openings on the either sides of the top portion of the environment. At each of the tag grid point, which is characterized by both the orientation and the position of a walker, we impose the following transitional probabilities: a walker has a 50% chance of moving to the next spatial grid point following the current orientation unless it is obstructed by an obstacle, and has a 50% chance of changing orientation. In the case of changing orientation, there is a 99% chance of choosing the orientation to face the tag grid point closest to the nearest opening while the rest of the orientations share the remaining 1%. At those tag grid points closest to the openings, we create a virtual grid point to represent the event of a walker exiting the environment. The transitional probabilities from the virtual grid point back to the real tag points near the openings are all equal. The stationary distribution ρ_{j} is then computed by finding the eigenvector with eigenvalue one of the transitional probability matrix of the entire environment.

Fig. 3(a) shows this hallway environment. The four hollow circles indicate the tag grid points closest to the openings. The result of the optimization under the constraint of using four cameras is shown in Fig. 3(b). Figs. 3(a) and (c) show the floor plan with the blue arrows indicating the optimal camera plans. Figs. 3(b) and (d) show the coverage of the environment by calculating the local average visibility at different spatial locations. Clearly the optimal configuration favors the heavy traffic hallway area. If the uniform distribution is used instead, we obtain the configuration in Fig. 3(c) and the visual map in Fig. 3(d). The average visibility drops from 0.8395 to 0.7538 as there is a mismatch of the traffic pattern. The performance of FIX_CAM under other experimental conditions such as mutual occlusion, camera elevations and tag elevations, as well as comparing with other schemes can be found in [7].

In this paper, we have described a binary integer programming framework in modeling, measuring and optimizing placement of multiple cameras. There are many interesting issues in our proposed framework and visual tagging in general that deserve further investigation. The incorporation of models for different visual sensors such as omnidirectional and PTZ cameras or even non-visual sensors and other output devices such as projectors is certainly a very interesting topic. The optimality of our greedy approach can benefit from a detailed theoretical studies. Last but not the least, the use of visual tagging in other application domains such as immersive environments and surveillance visualization should be further explored.