Literature Survey on Multi-Camera System and Its Application

A multi-camera system combines features from different cameras to exploit a scene of an event to increase the output image quality. The combination of two or more cameras requires prior settings in terms of calibration and architecture. Therefore, this paper surveys the available literature in terms of multi-camera systems’ physical arrangements, calibrations, algorithms, and their advantages and disadvantages. We also survey the recent developments and advancements in four areas of multi-camera system applications, which are surveillance, sports, education, and mobile phones. In the surveillance system, the combination of multiple heterogeneous cameras and the discovery of Pan-Tilt-Zoom (PTZ) and smart cameras have brought tremendous achievements in the area of multi-camera control and coordination. Different approaches have been proposed to facilitate effective collaboration and monitoring among the camera network. Furthermore, the application of multi-cameras in sports has made the games more interesting in the aspect of analyses and transparency. The application of the multi-camera system in education has taken education beyond the four walls of the class. The method of teaching, student attendance enrollment, determination of students’ attention, teacher and student assessment can now be determined with ease, and all forms of proxy and manipulation in education can be reduced by using a multi-camera system. Besides, the number of cameras featuring on smartphones is gaining noticeable recognition. However, most of these cameras serve different purposes, from zooming, telephoto, and wider Field of View (FOV). Therefore, future smartphones should be expecting more cameras or the development would be in a different direction.


I. INTRODUCTION
The capturing of still or moving images is an important step required in object recognition, object behavior analyses, and object monitoring processes. One of the possible methods of capturing an image is with the aid of a camera that could create a single image of an object (i.e., still image) or a sequence of images in rapid succession (i.e., video image). Currently, there are many image acquisition technologies (e.g., cameras). Most of the technologies are built on catadioptric or fisheye cameras, as well as image acquisition systems with static or moving parts. Camera classification The associate editor coordinating the review of this manuscript and approving it for publication was Wei Liu. or type can be based on one characteristic or another, from the FOV, image sensor, image quality, focusing properties, or power of projection. In this survey, the discussion will be based on unidirectional and omnidirectional cameras. There are six types of cameras considered in this survey; Pan-Tilt-Zoom (PTZ) camera [1], smart camera [2], orthographic camera [3], perspective camera (pinhole) [3], omnidirectional (panoramic) camera [4] and thermal camera [5] as shown in Figure 1.
The Pan-Tilt-Zoom (PTZ) cameras are usually applied in surveillance-based applications because of their highresolution image output. They have a dynamic field of view and can be configured to monitor a specified area of coverage. PTZ cameras have the capability of zooming far distant object to display the detailed information required to analyze a captured image. However, they can only cover a limited area of the scene at one time. Smart cameras are intelligent cameras that came into the limelight around the mid-80s. They have the power of extracting application-based information from a captured image together with creating event information of an image or making an intelligent decision that will be applied in an automated process. At the earliest time of its invention, there was a limitation in its capabilities in terms of sensitivity and processing power, but later there are great improvements in its capabilities [6]. The orthographic camera captures an image without any perspective distortion. They produce a two-dimensional (2D) image output without any image depth. Perspective cameras are cameras that display an image in a real-world view. They produce a three-dimensional (3D) image with depth. All pinhole cameras are also referred to as perspective cameras. Omni-directional cameras can cover 360 degrees FOV with a high-resolution image of about 1600 × 1200 pixels [4]. They have the capability of covering images over a wide area FOV. A thermal camera (infrared camera or thermal imager) uses infrared radiation to create an image. It senses infrared light with a wavelength from about 1µm to 14 µm.
A multi-camera system is an arrangement of sets of cameras used in capturing images or sequences of images of a scene. A multi-camera setup can be homogeneous (consist of the same types of camera setup) or heterogeneous (consist of different types of camera setup) that form a multi-camera system. The combination of two or more cameras can be employed to expand the span of the measuring area and when performing a high-precision measurement. The FOV of a multi-camera setup is more than that of the single camera. The advantages of high accuracy, and low cost of visual measurement techniques made the multi-camera system to be widely used in different areas of application. Figure 2 shows the coverage area of three heterogeneous cameras, which can never be obtained through a single camera.
According to Yilmaz et al. [7] and Mehmood [8], the omnipresence of high precision and low-cost cameras, with high computational resources, and the quest for automation in video analyses have generated great interest in multi-camera system developments. The awareness of multicamera has started around 1884, where Triboulet [9] used multi-camera consisting of seven cameras tightened to a balloon (one camera attached to the mouth of the balloon and six others attached to the circumference of the balloon) mainly to perform aerial imaging. Similarly, in the mid-nineteen century, multi-camera systems were employed in industrial machine automation, where the application of robots taken charge of humans in the industry [10]. Since then, the research on multi-camera developments has been expanding. This can be shown by the number of papers on multi-camera applications. To proof this, a search on IEEExplore was done using the keyword ''multi-camera systems''. The result of this search is shown in Figure 3. The figure shows that the number of publications related to the multi-camera system is in the increasing trend, which indicates that research on the multi-camera system is still very popular in the present day.  VOLUME 8, 2020 Even though several surveys concerning multi-camera systems have been published, virtually all of them focused on a selected aspect of the multi-camera system and particular areas of its applications (e.g., tracking or surveillance) [11]- [17]. These surveys, of course, do not cover the wide spectrum of multi-camera system studies. Because of the literature shortfall, this survey paper comprehensively covers the multi-camera system in four different areas of application, with emphasis given on the previous works, identifying the problems with the existing strategies, the recent advancements and highlight the future direction.
This literature survey is divided into eight sections. Discussion on single and multi-camera systems and the description of their formations and calibration are presented in Section II. Section III surveys the basic architectural formation employed in literature for multi-camera setup with their advantages and shortcomings. Algorithms used in multi-camera systems and their classifications with their various approaches under the concept of person re-identification and tracking fusion are discussed in Section IV. Then, an elaborate discussion on the multi-camera systems in the surveillance applications is discussed in Section V. Description of the multi-camera application, its evolution and image overlay in sports analysis are presented in Section VI. Overview of the multi-camera application and its evolution in the educational system are given in Section VII. An overview of the progress made in the integration of multi-camera systems in mobile phones in the past decade and an attempt on the challenges and future outlooks in this ubiquitous field unraveled in Section VIII. Section IX unveils the various multi-camera application algorithms with their features and limitations. Our conclusion is presented in Section X.

II. CALIBRATION IN A MULTI-CAMERA SYSTEM
The field-of-view (FOV) of the cameras in a multi-camera setup determines the calibration method to be used and the arrangement of the cameras. There are cameras arrangement with overlapping FOV and non-overlapping FOV. Figure 4 demonstrates the overlapping and non-overlapping FOV. In multi-camera systems with overlapping FOV, it is difficult for all cameras to simultaneously view the calibration target at the same time especially at the overlapped areas. In this type of scenario, the cameras in a multicamera system are distributed across a wide area coverage. However, calibrating the cameras through the calibration target cannot be easily realizable. For this type of situation, Devarajan et al. [18] proposed a distributed algorithm for calibrating distributed cameras on a network. The algorithm determines the position of the camera, its orientation, and the focal length. This approach proffers solutions to complications, memory limitation, and networking constraint, which are commonly found problems with the centralized calibration, by introducing a scalable and parallel algorithm that design a complete framework that senses the visual overlap between cameras and finishes it with an accurate parameter estimate of all cameras on the network [18]. A practical approach to video network surveillance was presented by Gemeiner et al. [19] where cameras are calibrated in a multi-camera system network. They perceived the problem as a localization problem, and then the image was treated as a 3D model assembled with a priori for a moving camera. The cameras are well distanced apart with nonoverlapping FOV. Feng et al. [20] proposed a novel global planar calibration method whereby a box of translucent glass with one side covered with a pattern was made as a target. The cameras to be calibrated are arranged on both sides of the target glass box. Feng et al. [20] adopts Zheng's traditional method [21] to estimate the intrinsic and extrinsic parameters of the camera but addressed the calibration error (glass refraction error) in line with Zheng's method. The glass error was removed using the refractive projection model and ray tracing.
In a multi-camera system, it is critical to find the accurate position for cameras. This will serve as the key to accurate vision measurement for all the cameras [20], [22]. It will also enhance good performance at a collaborative level between cameras on a network and computer vision communication such as tracking of multiple objects together, 3D reconstruction of objects in a scene, and a combination of novel views [18]. The process of finding an appropriate position for a camera to achieve an accurate vision is referred to as camera calibration. Calibration of a camera can be done by determining the parameters of the camera using images obtained from a special calibrated pattern. The parameters are intrinsic (fixed to a camera), extrinsic (may change with respect to the world frame), and distortion coefficients of the camera [23]. There are many different approaches to calibrate multi-cameras for a specific camera setup. Multi-camera calibration can be classified into two: (i) calibration method for overlapping FOV cameras [22] and (ii) calibration method for non-overlapping FOV cameras [24] as shown in Figure 5. Multi-camera calibration for overlapping FOV according to Shen et al. [22] are categorized into four: (i) Calibration that is based on the object referenced to 3D [25], they used the calibration of a precise 3D geometry object as their referenced point. (ii) Calibration based on a 2D plane object [26], it employed a plane pattern specifically designed for that purpose. (iii) Calibration based on 1D object [27], it was built on a 1D stick with 3 and above points. (iv) Self-calibration [28], this method was not built on any dimension of an object but uses the static scene to discover the intrinsic parameters of the camera. Note, this categorization was based on cameras that are connected to a central node apart from self-calibration that is been used in a distributed network. The multi-camera calibration for non-overlapping FOV are classified into six according to Xia et al. [24]: (i) Calibration using large range measuring equipment [29], [30]. (ii) Calibration using a mirror [31], [32]. (iii) Calibration using motion models [33], [34]. (iv) Calibration using auxiliary markers with supporting cameras [35]. (v) Calibration using laser projection [36], [37]. (vi) Calibration using a wide-range of targets [38].
Another calibration system was based on classical techniques used by Haraud et al. [39], it employed default patterns with fixed cameras. Smart camera calibrations by Basu and Ravi [40], and Borghese et al. [41] have received significant attention in research and development. This is because of their applications in a multi-camera system setup such as in surveillance, tracking, monitoring, and video conference applications. In multi-camera system setup, camera calibration can be performed either in a group (i.e., centralized architecture) or on individual single-camera (i.e., distributed architecture). In a centralized architecture, all the cameras in the multi-camera setup are connected through a central node that runs the calibration process to determine the parameters of all the cameras on that network. For a multi-camera system setup with centralized architecture, both extrinsic (i.e., relative position and direction) and the intrinsic parameters will be determined. In the case of the distributed architecture, the individual camera performs self-calibration which can only estimate the intrinsic parameters of the camera.
A one-dimensional calibration approach has been widely applied in many multi-camera systems [42]- [46]. The 1D calibration approach was first introduced by Zhang in 2002 [47]. In his approach, he used a one-dimensional calibration object that has three and above collinear points with a specified relative position. He demonstrated that to calibrate a camera, a moving one-dimensional calibration object cannot be used except it is stationary. To get an accurate result, he used the maximum likelihood (ML) approach to refine the estimate and performed computer simulation with real data obtained to test the algorithm which produced an encouraging result. One-dimensional camera calibration does not need 2D and 3D directions of marker to calibrate. It represents the most simplified method of calibration. One of the advantages of one-dimensional calibration in a multi-camera system is that they can be observed by all cameras and all the cameras in the system can be calibrated together simultaneously at the same time. Simultaneous calibration of cameras prevents error accumulation. However, 1D calibration has its peculiar disadvantages [48]: (i) Due to assumptions taken in the model applied, the exact linearity of points cannot be ascertained. (ii) The level of accuracy obtained in the extraction of the corner is below that of the 2D calibration method due to the tools employed in extracting the points.
A novel nonplanar target for fast calibration of a networked visual sensor was proposed by Shen and Hornsey [43]. They used two spheres built on a supporting rod as a calibration VOLUME 8, 2020 target. The nonplanar target was applied to each camera separately to save time instead of applying to all the cameras simultaneously at once. A three-dimensional calibration method was proposed by Shin and Mun [49]. He used a direct linear transform (DLT) to determine the calibration parameters of cameras using a 3-axis frame. The main problem with his approach was the error generated due to ellipse fittings which were caused by lighting conditions and noise. Another shortcoming of the approach was that the precision achieved at the center extraction which could not be gotten at the edges [48]. Multiple planar patterns of three-dimensional calibration targets were proposed by Quan and Lan [50] and Xu et al. [51] because of its limitations in the way the multicamera would be distributed, thereby limits its application.
For non-overlapping FOV camera calibration, Lu and Li [29] proposed a global calibration method that employed theodolite coordinate measurement system (TCMS). The system determines the 3D coordinates of the points on the calibrating target. Then, the global calibration of the cameras was performed in relation to the calibrating target position relative to that of the cameras. It employs the transformation matrix between TCMS and the cameras. Liu et al. [38] developed a global calibration method that uses multiple targets (MT). One sensor from the multiple vision sensor was used to obtain a global coordinate frame (GCF). The MT was placed in front of the selected sensor to capture the images of the corresponding sub-target for some time (at least four times). The coordinate frame for each sensor to GCF was used to obtain the transformation matrix.
In another approach, Lébraly et al. [31] used a planar mirror to calibrate two non-overlapping cameras placed on a vehicle for visual steersman ship. In this approach, the geometry of the scene was not determined. Xu et al. [32] proposed a multi-camera global calibration that unified the coordinates of two binocular vision pairs with non-overlapping views using a planar mirror. The method was then applied to obtain the global unification of the multi-camera system. Huang et al. [33] used a moving robot carrying a planar target to calibrate two fixed non-overlapping cameras. The relative position of the moving robot, the marker placed on it, and the image captured by the multi-cameras were used to calibrate the cameras. This type of approach can be applied in a large camera network because of its simplicity, low cost, and ease of implementation but the accuracy of the calibration will be very low. Wang and Liu [34] presented a tractable mechanism for choosing the best path a calibration target can take to enhanced the result in a camera network calibration.
Zhao et al. [35] used the chessboard augmented reality (AR) marker and supporting camera to calibrate nonoverlapping cameras. The transformation was performed using the AR marker and the supporting camera to estimate between the marker and the cameras to be calibrated. The effectiveness of this method depends on the resolution of the supporting camera which will be seriously impaired when calibrating a well-spaced camera network.
Zou and Li [36] used a laser projection method to calibrate inward and outward-facing cameras in a vehicle. A laser pointer was mounted on the calibrating target. The two cameras were connected using the laser rays from the pointer and the pose of the ray was determined using the calibration target boards' coordinates. The problem with this method is setting the cameras in-line with the laser ray on the calibration target to avoid linearity error and this will affect the accuracy of the method. Xia et al. described a global method of calibrating multi-camera without overlapping FOV. Two planar targets are fixed together by a bar length equal to the distance between the positions of the two cameras that are to be calibrated. A photogrammetry method was used to determine the relative position of the planar targets. The initial transformation matrix for the two cameras coordinates were determined using a linear method which requires a single captured image by the two cameras. A global calibration for multiple captured images was obtained using the Levenberg-Marquardt nonlinear (LM) optimization method where the linear method's output served as the input.

III. MULTI-CAMERA SYSTEM ARCHITECTURE
This is a carefully designed structure of cameras in multicamera system formation. A multi-camera arrangement could be of different forms, as obtained in different works of literature. They can be classified as (i) Centralized (ii) Distributed (iii) Hybrid, and (iv) Multi-tier [52]- [54]. In a centralized multi-camera architecture, analysis and accumulation of data are performed at the central unit whereby all the detailed information gathered by the individual cameras will be sent to a central system, which is usually a work station or a camera node. In this type of arrangement, no autonomous decision or processing will be done by any of the distributed cameras. The function of the camera could either be any or all of the followings: (i) integrating the image or data collected from the distributed cameras on the network, (ii) taking in or dissecting the information gathered from the subordinate cameras and (iii) controlling of other cameras as a means of a remote access point to them [11]. Many approaches have used this method, such as Kang et al. [54], Lim et al. [55], Kattnaker and Zabih [56], Lu and Payande [57], Evert et al. [58], Sommerlade and Reid [59], and Piciarelli et al. [60]. All the cameras are connected to the central work station. Therefore, when two or more cameras are exchanging data, it must go through the central work station that connected them. Figure 6 (a) shows the connection setup of centralized architecture.
In a distributed architecture, each camera does the processes locally. Smart cameras that have the capability of sensing, processing digital signal, and communication components are usually employed in a distributed architecture. This approach was used by the following researchers; Kim et al. [61], Quaritschet al. [62], Rinner et al. [63], Fleck and Strasser [64], and Fleck et al. [65]. A distributed network could also be a PC based method where the distributed cameras are seen as an autonomous body on the network. Examples of this type of arrangement can be found in Qureshi and Terzopoulos [66], Micheloni et al. [67], Morioka et al. [68], Park et al. [69], Hodge and Kamel [70], Li and Bhanu [71], and Song et al. [72]. Figure 6 (b) shows the diagrammatical illustration of distributed architecture.
The integration of the features of centralized architecture and distribution architecture form the hybrid architecture. In this case, the subordinate cameras perform monitoring and capturing of the target image and at the same time processing the image before sending it to the central system or workstation. The workstation articulates all the raw data or images gotten from the subordinate cameras to form the information. Subordinate cameras make certain decisions at their level while the higher-level decision would be left for the central controlling system or workstation. Prati et al. [73] used hybrid architecture in an integrated multi-sensor setup. Mult-tier architecture can also be called hierarchical architecture in the sense that the level of the decision making depended on the level of the hierarchy of the subordinate cameras. Multi-tier architecture was applied by Matsuyama and Ukita [74] and Bamberger et al. [75] as an alternative to hybrid, centralized and distribution architectures. Figure 7 shows a diagrammatic illustration of hybrid and multi-tier architectures. The advantages and disadvantages of these architectures are stated in Table 1.
The ubiquitous nature of cameras made it applicable in every aspect of human endeavor except in some areas where it has not been exploited. In this paper, the application of the multi-camera system is centered on the following areas: surveillance and tracking system, sports and entertainment, education, and mobile phones.

IV. MULTI-CAMERA SYSTEM ALGORITHMS
Human tracking is seen as automatic monitoring of the trajectories of an individual and the record of the timely ordered sequence of location data of the tracked person. Tracking using a multi-camera system provides complete coverage of human movement from many angles of view. Therefore, it produces comprehensive information on the scene which solves some of the single-camera problems such as occlusion, FOV limitation, and temporary disappearance [76]. A multi-camera tracking can be online (in real-time) or offline (in a stationary image(s) and motion pictures) [77]. In a multi-camera system tracking, the correspondences of the tracked individuals across the multiple cameras are very important unlike that of a single camera that is ignored. Most of the algorithms used in multi-camera tracking are adopted from single-camera tracking with a modification that caters to multiple cameras. There are many approaches adopted from literature towards classifying a multi-camera tracking system, some are based on cameras' FOV [17], motion detection, and camera architecture [16]. Figure 8 depict the multi-camera tracking classification.
Multi-camera tracking using overlapping FOV was employed by many researchers. For example, Cai and Aggarwal [78] used the intensity and geometry features to track people in an overlapping multi-camera setup. Multivariate normal distribution was used to determine the densities of the features and Bayesian classification schemes for matching points. Liem and Gavrila [79] presented multiple persons tracking systems. The three-body region (headshoulder, torso, and legs) is projected using overlapping camera views. A color histogram algorithm was used to describe the appearance and the tracking measurement was modeled using the Kalman filter. Calderara et al. [80] used overlapping multi-cameras to track objects using consistent labeling methods. The position of the object in the multiple views of the cameras is determined using the homograph of the first detection of the camera view. Generally, most of the above-mentioned tracking methods, appearance, and motion modeling was carried out using a Kalman filter and Bayesian network. Kalman filter allows more precise motion estimation especially for objects that are viewed by different cameras at the same time. The main setback of the Kalman filtering based approach is the arbitrary assumption of a linear dynamic system and posteriors as Gaussian problems in nature. The pattern of motion demonstrated by people are generally non-Gaussian in nature. This can be addressed by the Extended Kalman filter (EKF) which presumes linearization of a model and approximates the posterior estimate to be Gaussian [81]- [83]. Another setback for the Kalman filter is how to define the strategy for identifying sinking nodes for a distributed multi-camera network.
Tracking objects within overlapping FOV cameras is only possible for objects moving within the vicinity of the cameras' view. Non-overlapping camera tracking is required in blind areas that are out of the cameras' FOV coverage. Makris et al. [84] proposed a topographical model of a multi-camera network that performs unsupervised learning of activity around the camera view and then creates data from the large set of observations. The learning experience from data created is used for tracking targets across the whole area of the camera network. The method has no reference for correspondence, entry, and exit of the target from a camera's view. They are learned using the expectationmaximization algorithm and the zones are formed by a Gaussian Mixture Model (GMM). Stauffer [85] proposed a linking structure for non-overlapping cameras. This method established a technique of modeling correlations based on interarrival times instead of the constant rate assumption taken by Makris et al. [84]. It also creates a means of estimating the validity of a link based on the total distribution instead of using the statistical value. In addition to these approaches, Rahimi et al. [86] presented a method of recalculating the path taken by a target within the view of non-intersecting cameras and then reconstruct the calibration of the cameras. The location of the moving target was determined using the ground plane coordinates of the series of cameras field of view on the network. The performance of the technique was not tested on the online application. Zhang et al. [87] presented a framework for tracking multiple targets in a multi-camera setup. They applied the Structural Support Vector Machine (SSVM) to obtain status information of the targets as a group or individual. The tracking of multiple targets involved was converted to a network flow problem which was resolved using the K-shortest paths algorithm. Tesfaye et al. [88] tried to solve the problem of tracking an object in multiple non-overlapping cameras with a single three-layer approach. A version of standard quadratic optimization (constrained dominant sets clustering) was used to solve the tracking problem. An algorithm based on dynamics in evolutionary game-theory was used to solve the inter-camera tracking which is performed by merging the tracks of the target in all cameras.
The motion detection approach of multi-camera tracking is classified into temporal difference [89] and background subtraction [90]- [92]. The temporal difference technique is the subtraction between two consecutive frames and then set the output on thresholding value. The resulting pixels with the difference value higher than the threshold are referred to as the foreground pixels. This approach has a setback on changing background with time, therefore, it does not accommodate overlapping parts of the camera while detecting moving persons. The background subtraction approach is a technique of removing a background model from the frames of a video scene to determine the foreground pixels. This technique requires an update of the model for every change in the video scene. Also, it is generally employed in a fixed multi-camera tracking setup [93]. One of the researchers that employed background subtraction is Senior et al. [94], where they used 2D and 3D ground plane to determine the positional information of a tracked person and other three different algorithms. They compared the tracking performance of four different algorithms that is background subtraction, face detection-based tracker, feature matching particle filter, and edge alignment of a cylindrical model. The background subtraction method uses multiple Gaussian color tracking methods. The particle filtering tracker used frame differencing which is susceptible to noise and distraction. The face detection approach depends on the faces detected to track and the accuracy of the approach can be seriously affected in the case of occlusion, distance from the camera, and light intensity. Lastly, the edge-alignment-based tracker uses a 3D graphical human model designated with cylinders connected using kinematic chains. The method experiences failure after initialization which requires a re-initialization strategy. Liang et al. [95] presented multi-camera collaboration using head detection and the trifocal tensor pointer transfer method. They used Kalman and PDA algorithms for tracking people and then applied background subtraction to detect the head position.
Classification based on camera architecture is based on how data acquired by the multi-camera sensors are been processed. There are two categories; in the first category data detection and tracking is carried out after fusing the information acquired from different sensors. This is referred to as a centralized approach. In the second category, called a distributed approach, each camera on the network performs their detection and tracking separately and the output of the cameras is combined to obtain the final trajectories of the tracking. Figure 9 depicts the representation of the classification.

A. CENTRALIZED APPROACH
In this approach, the data obtained from the different cameras on the network are first combined before performing the tracking of the people. The data obtained from each camera are usually preprocessed before been fused. Data fusion is used to obtain scene information from different camera views. Some used data fusion to establish ground plane occupancy map, for examples, the works by Khan and shah [96], Figueira et al. [97], Baltieri et al. [98], Minh et al. [99], and Chen et al. [100]. Others used it to establish 3D geometry of the tracked person on the scene such as the works by Li et al. [101], Hirzer et al. [102], Bouma et al. [103], Wang et al. [104], Wen et al. [105] and Brendel et al. [106]. VOLUME 8, 2020 A centralized multi-tracking approach has a wellcoordinated and control center which makes the system to be more effective. Some of its limitations include cameras synchronization, redundancy problem and it is not suitable for an online application. Figure 10 shows a diagrammatic representation of the distributed approach. The centralized approach is usually applied in overlapping multi-camera setup to reduce noise and occlusion problems. Table 2 gives a summary of the centralized approaches used by different researchers.

B. HYBRID APPROACH
This approach balanced the limitations and advantages of centralized and distributed approaches. It allows the creation of multi-cameras in groups to form clusters where detection, tracking, and data fusion are done within the clusters of cameras using a centralized approach. Person re-identification is carried out by combining the feature signatures obtained from different clusters on the camera network to create the trajectories of the tracked persons. Among the studies that embraced this approach is Kim et al. [61], where they built a clustered wireless network of cameras that employed a Kalman filter to track people. Information communication is tracked based on the cluster location of the camera which determines the position of the tracked person. The cluster head of each cluster in the network then transmits the gathered information to the central station where the trajectories of the tracked individuals will be built. Besides, Agrawal and Davis [25] have proposed a hybrid formation when extracting features from images. They used an occupancy map (ground plane) at the fusion node point using the centralized approach and then performed the ground plane color mapping for tracked persons using a color likelihood for each camera in a distributed approach. The color functions are then combined at the fusion node to produce a color multiview function. The trajectories of the tracked people are obtained by combining the result of the occupancy map and color map of each individual.

C. DISTRIBUTED APPROACH
The distributed approach consists of autonomous camera nodes that perform detection and tracking of people independently and then collate the correspondences of the tracked individuals from the different cameras' views. The collation of the correspondences is done in such a way that the trajectories of the tracked persons obtained in the first camera are compared with the correspondences in other cameras before fusing them to obtain the final trajectory. The image obtained from the first camera used to compensate for images obtained from other cameras is called the probe image and the process is referred to as person re-identification (Re-ID). The distributed approach is generally applied in a non-overlapping multi-camera setup. Figure 11 shows a diagrammatic representation of the distributed approach.
The major advantage of the distributed multi-tracking approach is its scalability which makes it to be easily deployable in a disjointed multi-camera set up (Non-overlapping setup) without reference to the geometric relationship between the cameras. It is suitable for an online application. The major setback of this approach is that the failure of one camera renders the section of network non-trackable and also occlusion is difficult to manage using this approach. Table 3 shows a list of studies that applied the distributed approach.
Recognizing and tracking people across spatially disconnected cameras requires the knowledge of the transition from one camera to another. Person Re-ID provides a means of identifying people across disjointed cameras. Person Re-ID can be classified into two; (i) learning-based and (ii) non-learning based method.

1) LEARNING-BASED RE-ID
Person Re-ID in this approach is based on learning the distinct features of an individual from a dataset, extract the features which are then used to re-identify the person from the image of the tracked people. What differentiates the learning-based approaches is the learning metrics (classifier) that are used to identify or re-identify a person. Some of the metrics or classifiers that are normally been utilized are the shape model [127], boosting method [128], deep learning [123], [129], support vector machine (SVM) [130], [131], descriptors learning methods [132]- [135], distance learning methods [136], [137], nearest neighbor approach [108] and least square reduction [138].
A learning algorithm was used by Bak et al. [128] for person Re-ID based on the Adaboost and it uses Haar-like features. They employed histogram of oriented gradients (HOG) for tracking and detection. After detections and tracking of people, the Haar-like features of the tracked people are used to obtain the visual signature using the Adaboost scheme. The extracted visual signature was used to re-identify people in other camera detections. Su et al. [139] presented 3 stages of deep CNN learning metrics for person Re-ID, while Chen et al. [116] used a deep learning approach that employed the fusion pyramid multi-scale metric for Re-ID.
Another learning approach is the descriptors-based type. The approaches under this method combine the appearance and shape features to track people. The learning metric is used to identify the features to be extracted from the tracked individual. Huang et al. [33] presented a discriminative approach that used an algorithm to learn and extract appearance-based descriptors from the body of each tracked person. The extracted features are used to create a model using a multi-scale feature learning framework obtained from the work of Qian et al. [140]. Also, Kuo et al. [141] have presented an approach that used a discriminative appearance learning feature together with the Adaboost algorithm for a group of appearance features extracted from tracked individuals in a different location. The obtained discriminative feature is used to identify who has a close affinity to a tracklet.
Avraham et al. [125] introduced an algorithm that used correspondences of two people obtained from two overlapping camera views to train. The algorithm can differentiate or identify two tracked individuals using color histograms which shows positive pair if the detected people in the camera views are the same and a negative pair for two different individuals. The binary SVM classifier was used to differentiate between the two detected pairs. Similarly, Nakajima et al. [142] used a multi-class SVM classifier to detect the full-body of a person. The SVM algorithm classifier was trained using colored and shape-based features obtained from the tracked individual. Another study was done by Martinel et al. [143] that presented a method for training multiple re-identifications using features obtained from pairs of images for the same or different individuals. The model is used to determine from the new pairs of images obtained depending on the result of the models.
A learning-based distance approach used similarity measures in terms of the distance between pairs of images or features of an individual in different camera views. The distances (differences) between images that represent the same VOLUME 8, 2020 person (positive pair) and that of different persons (negative pair) are used for Re-ID instead of learning the individuals' visual features. One of the studies under this method was proposed by Hirzer et al. [136] where the concept of person Re-ID is classified into three stages; (1) feature extraction (2) metric learning and (3) classification.
Features like color, Haar-like, shape-based, body parts and texture are distinct characteristics for human identification. HSV, Lab color channels, and Local Binary Patterns can be used to extract the local features to create a global image representation. At the learning stage, an algorithm (for example PCA) is used to reduce dimensionality and noise. However, a small dataset or low dimensional representation is enough. During the evaluation, the distance between two samples for example X m and X n are calculated, where X m and X n describe the camera views of the same person. At the classification stage, the work is to find the image (probe image) that was first obtained from one of the cameras' views in all the images (gallery images) obtained from all other cameras. Similarly, Makris et al. [84] used relative distance comparison learning between a pair of true images and a pair of false images match to perform re-identification. The comparison model was used on the texture and color histograms extracted features of the tracked people.
Gou et al. [144] presented a comprehensive performance evaluation of person Re-ID (feature extraction and metric learning) algorithms. Table 4a and Table 4b summarize the algorithms. It was demonstrated that kLFDA and kMFA were the best metric learning methods because they resolve the problems of eigenvalue in a scattered matrices data and the best algorithms for feature extraction to be GOG, LDFV and LOMO as shown in Table 4b.
Finally, Shen and Hornsey [43] presented an approach that is based on space-time and the appearance between every two cameras view. It is used to determine whether the resulting output of two camera views will be the same or different. The appearance relationship is made up of the BTF between the two cameras' view while the information about the entrance point, exit point, direction of movement, and the required time for the tracked person to move from one camera view to the other is in the space-time correspondence.

2) NON-LEARNING-BASED RE-ID
These are person Re-ID methods that directly extract discriminative features from a tracked person without any training data. Non-learning-based Re-ID relied on the signature used in representing the tracked person, which can be extracted by local or global features. Appearance features are the common features employed in non-learning person Re-ID which include shape, texture, and color. One of the relevant studies that embraced this method is the work of Aziz et al. [153] that used the front or back appearance of the tracked individuals to perform person re-identification. They used SIFT and SURF algorithms to extract the discriminative signature of the segmented three-body level (i.e head, legs, and torso). For peoples' re-identification, the extracted signatures are matched with the body parts of the detected individuals. The effectiveness of the approach can be improved by employing a pose estimation approach, for example, the one presented by Hong et al. [154], [155] before performing person reidentification. Truong Cong et al. [119] used localized features of the body parts and extract their descriptors for person re-identification; where Alahi et al. [156] presented an approach for creating object descriptor built using cascaded grids of descriptors.
Wang et al. [157] presented an approach that combines shape and appearance features to prepare a matrix that indicates the descriptor of the tracked people. The descriptor is built based on the captured closed relation between the appearance labels. The same approach was used to build a framework for real-time computation of the occurrence matrix. Besides, Truong Cong et al. [119] presented a colorposition histogram obtained from the silhouette of the tracked persons. The presented color-position signature was classified into regions where RGB values were extracted from each to create a discriminative signature employed for matching tracked individuals from different camera views.
Jüngling et al. [127] presented the person Re-ID approach in a multi-camera setup that used the Implicit Shape Model (ISM) and SIFT features. The approach performs human detection and tracking, where SIFT feature models are built during tracking. ISM is used for human detection and tracking, while SIFT serves the purpose of feature extraction. The approach is not sensor independent (homogenous or heterogeneous) because it does not use color or other sensor specifics features for re-identification. Also, Hamdoun et al. [158] presented a method of identification using interest points collated from different motion images. SURF algorithm was used to obtain the feature points with descriptors to compute a signature for the tracked persons.
Approaches based on local features, Wong et al. [17] presented a method that used local features like the color histogram, size, the intensity gradient, and position of the tracked individuals to perform tracking using data association algorithms. Another approach under this is the work of Piciarelli et al. [60] used appearance features obtained from the upper body level of the detected person and the spatial position for re-identification in other camera views.

D. INFORMATION FUSION IN A MULTI-CAMERA SYSTEM
The fusion of information or data aims to produce something better than what can be obtained separately. However, information fusion in a multi-camera system is the combination of different features extracted from the same object, different instances of the same object, or the same scene of an object from different views of cameras. Views of cameras in a multi-camera system are best explained in the overlapping and non-overlapping form, also information fusion as a multisource data fusion process from different views of cameras. Therefore, information fusion can be well seen from the perspective of the overlapping and non-overlapping camera.

1) INFORMATION FUSION IN OVERLAPPING CAMERAS
Many information fusions approach or methods have been applied by researchers for overlapping cameras. Some of the approaches are particle filtering (Sequential Monte Carlo Methods), Bayesian estimation, support vector machine (SVM), and Kalman filtering. For example, Lu and Li [29] presented a novel image fusion approach that integrated particle filters and belief propagation as a unified framework. A dedicated particle-filter tracker was applied in each camera's view. Each local tracker in different views works together with other views through belief propagation as a means of passing genuine and important information across the view. Also, Zhao et al. [30] presented a distributed Bayesian formulation using multiple interactive trackers. The approach avoids joint state-space rendition to prevent tedious and complex joint data union. The method has a better approach for an online real-time tracking application. However, a conventional Bayesian tracking framework was modeled for problems like proximity, occlusion, or interactions between the objects' observation. Lébraly et al. [31] integrated Bayesian particle filtering with Dempster-Shafer theory to fuse the pieces of evidence obtained from multiple heterogeneous and unreliable sensors. The approach was used to solve the problem of tracking people in a multi-camera indoor environment. Xu et al. [32] also used the Bayesian framework to track a variable number of 3D persons in an overlapping multi-camera setup. In their approach, they employed joint multi-object state-space formulation, each object states are defined in 3D. This approach is not applicable in real-time applications because of the computational complexity and large joint data required. Huang et al. [33] presented a system that fuses tracking information in an overlapping multi-camera set up using an approach called mapview mapping. A particle filtering algorithm was adopted for target tracking and information fusion was performed according to reliability and weighting of each source of information.

2) INFORMATION FUSION IN NON-OVERLAPPING CAMERAS
Other researchers applied information fusion in a nonoverlapping camera setup. For example, D'Orazio et al. [159] proposed tracking of people in a multi-camera set up using appearance similarity. The color histogram mapping of the same object from different views was performed. The mean brightness transfer function (MBTF) and cumulative brightness transfer function (CBTF) were used to determine the appearance similarity but the performance of the two is quite similar in the phase of simple association problem. The two methods experienced a setback when it comes to the detection of a new entry scenario. In this method, further research can be centered on the major differences between two similar bodies so that a technique can be applied based on the differences. Chilgunde et al. [160] presented a real-time tracking system for targets in a non-overlapping multi-camera setup. Kalman filter was used to select the targets based on shape and motion from each cameras' view. For areas that are not covered by the camera, a prediction of each cameras' view is made using the Kalman filter prediction. Gaussian distributions of the tracking parameters were computed for each camera to determine the target position and motion. VOLUME 8, 2020 Lin and Huang [161] also presented a framework that applied the client-server system for tracking targets in a multicamera setup. The client aspect manages single cameras' object tracking and detection and the server is responsible for collaboration between the multiple cameras. Kalman filter was applied for object tracking and detection for a single camera view and homograph was implemented to determine the FOV lines for each camera's view. The features identified from the single-camera object tracking are fused for object matching and the FOV lines are used for switching between the cameras. The framework was designed to unify object tracking for overlapping and non-overlapping multi-camera systems. Leoputra et al. [162] proposed a unified framework that used a particle filter to track target objects in a nonoverlapping multicamera system. The blind section between the camera views is predicted by switching between the tracking predictions and visual tracking. The trajectory information of the target object as it moves through the blind areas is mapped by the Particle Filter algorithm.
Bauml et al. [163] presented a system for online capture and recognition that uses facial features for person re-identification in a non-overlapping multi-camera system. The system combines the support vector machine (SVM) and Discrete Cosine Transform (DCT) to determine the facial features and extraction. The challenges of face appearance feature for person re-identification like low resolutions facial image, lighting and pose are addressed. Since people's faces are sometimes very similar, future techniques should integrate face tracker with body tracker or associate face tracking with other close term features such as clothing for better identification. Avraham et al. [125] proposed an approach that modeled a camera with a transfer function for a multi-valued transformation for pedestrian re-identification. The system is metric independent and did not depend on learning object appearance from one domain to another. The approach used the radial basis function kernel (RBF) binary SVM classifier to reveal the unapparent differences between the camera's view. Prosser et al. [123] described a person re-identification as a sequencing problem rather than a distance measuring problem. They introduced a novel ensemble Rank SVM algorithm to perform the actual sequencing match and minimize the computational time suffered by the original SVM approach. The approach is more scalable, and it requires less memory.

V. APPLICATION OF MULTI-CAMERA IN SURVEILLANCE SYSTEMS
In literature, many papers have been written on the application of multi-camera, especially in surveillance systems. Table 5 shows the list of studies that have been conducted on the application of multi-cameras in the surveillance system and their description.
The earliest research carried out on the multi-camera system was performed with fixed view cameras. The resolution of the cameras employed was a low type because of their  wider view angle to cover wide areas. Eventually, with the emergence of Pan-Tilt-Zoom (PTZ) cameras, a clear image of the surveillance object can now be obtained. Initially, the problem of non-overlapping FOV of the static cameras has prevented the solution to occlusion and localization of images in the real world. With the invention of PTZ cameras, a particular area of interest (AOI) can now be focused. PTZ cameras are mostly configured as a master-slave system to quickly sense and capture a human face and objects' shape in any direction.
There are series of development in the area of multi-camera system control and coordination, especially for the heterogeneous camera system. PTZ cameras are configured as a master-slave setup to collaborate, sense and monitor targets in a multi-camera arrangement. Several approaches like the game theory, decision theory, and control theory were used to create collaborative monitoring among cameras, while others applied optimization structures [11]. The discovery of smart cameras contributed a lot in the area of object tracking, on-board processing and getting detailed information of the captured image from the scene. In distributed smart cameras (DSCs) information is shared between the individual cameras with distributed sensing and processing capability in a smart camera network. Pervasive smart cameras (PSCs) create autonomous and adaptation functioning of smart cameras which provides easy usage and operation in the various areas of application. With this development in smart cameras, the attention of the researchers was directed towards self-control and collaboration among distributed cameras [52].
Many works of literature have identified different problems associated with surveillance using multi-camera systems while some have proffer solutions to the problems identified by others. Presently, surveillance problems using multi-camera systems rest on multi-camera coordination when it involves multiple targets to be observed. The number of targets that a particular camera can observe without compromising the quality of the captured image has not been stipulated. Especially when the camera to target ratio is less. Also, the problem of occlusion in a multi-camera setup is a persistent issue for areas where there are physical obstructions and privacy issues. Furthermore, there is a problem with computation or processing, especially for real-time applications. Surveillance is a real-time activity that demands a lot of processing power from cameras on a surveillance network. Presently, the processing capabilities of smart cameras are still fell short when it involved many cameras. Even if the smart cameras can process faster, the communication medium still limits the rate of data transfer from one camera to another because of the traffic load.

VI. APPLICATION OF MULTI-CAMERA AND IMAGE OVERLAY IN SPORT ANALYSIS
A multi-camera system has a wider area of applications in the human endeavor or factually in every aspect of human activities [179] and entertainment is not an exception. This area ranging from film making, sports, TV shows and other performances or activities that keep people's attention. This study will concentrate on the applications of the multi-camera system in sports where football and other games will be the center of discussion. For example, the automatic display of views that draws special attention in sports programs, videoon-demand display [180], and other aspects like tracking of football as a way of making comments by pundits in sports videos [181] will be the areas considered for review. Needham and Boyle [182] applied a monocular non-moving camera to monitor players in an indoor 5-aside soccer fiesta. The tracking of the players was done with an algorithm called ''CONDENSATION''. This was carried out with a single camera that has low-resolution.
The single-camera view system has some shortcomings ranging from low-resolution image output, lack of total coverage of the pitch due to limited FOV of the camera, and players occlusion problem. Especially, when the players queue up in-line in the direction of the camera. Also, the position of the ball on the pitch or when the ball is rolling on the ground creates constraints on the 3D view of the ball while using a single camera. This will further demean the application of a single-camera view because the direction of the movement of the ball draws the special attention of the viewer [183]. Table 6 shows a summary of the applications of multi-camera systems in sports.
The earliest research according to Table 6 on the application of camera(s) in entertainment or sporting activities was a graphical overlay on a calibrated camera image. This is done by placing a sensor on the camera stand and lens which allows the recording of the cameras' panning, tilting, and zooming. This sensor allows images to be overlaid on the moving video and placed as a background for the video recorded by the camera. The method can also be used to create graphics that can measure the distance of the player to the goal post and mark the off-side line by calibrating the position of the camera with regards to the scene manually. This was one of the reasons why video image and single-camera were the most common applicable sensor as the first type of graphic technology in sport [191], [195]- [197]. Furthermore, a typical application of a mechanical sensor on-camera stand together with other groups of cameras was applied [185]. The 'FoxTrax' was a puck-tracking system that used graphics overlay to display the movement of an ice hockey puck. A 20 IR LED was implanted into the puck, the signals of the LEDs are picked by the set of IR cameras which were hanged up at the roof. However, the position of the puck was made to be located using the images of the IR cameras. For the easy location of the puck during motion, the system added a 'comet tail' and a blue trailing light [198].
The introduction of image processing into sporting display brought about a color segmentation algorithm (Chromakeyer) which allows the segmentation of the foreground color from the background. This technology was used to make the background color of a sporting field uniform, for example, the green background color in a soccer pitch. A typical application of this can be found in '1st and Ten TM ' [199] American football and ski race is shown as if they were physically present. The application of the multi-camera view system to soccer game coverage or sports analysis in general tackles the single-camera shortcomings. Multi-camera system application in sports coverage and analysis provides a wider FOV that will cover the whole pitch of play, reduces the dynamic occlusion, and allows camera output integration or collaboration. Though a multi-camera system application in sports coverage requires camera calibration and the camera position is also paramount [200]. Generally, the use of the multi-camera system has found its purpose in various sports applications such as football, ice hockey, snooker, diving and ice skating, down-hill sky race, tennis ball, and tracking of an athlete.
In the event of viewing sports activities, it became difficult to show an athlete's motion or objects trajectories such as ice hockey puck, football, tennis, and others' evolution over time and space. This is because the FOV of a fixed photographic camera cannot capture the entire spatial and time of an athlete's motion. A new idea was employed in which a set of cameras are arranged along the path of the athlete or object to be monitored to snapshot the athletes or objects as they pass. The resulting images of the athletes or objects' motion are joined together to form the total view of the event. A typical example of this was applied [188] to show athletes' motion in sports like ice skating and swimming. This method compared to image overlay improves player resolution and provides full coverage of the total sporting event with the application of multiple fixed cameras [183].
Steins [201] and Del Bimbo et al. [202] reportedly computed homography transformation for images obtained from the field of view of two overlapping uncalibrated cameras. All the targets on the images are considered to be on the same plane. Therefore, the homography computation result was meant for all the targets between the FOV of the two cameras. Khan et al. [203] and a similar work presented by Cai and Aggarwal [184] monitor a target using uncalibrated multi-cameras. It was highlighted in their approach that for cameras with overlapping FOV, the current camera should hand the field of view to the neighbor camera once the target leaves its field of view.
A different approach to the above method was the use of calibrated cameras to create a 3D position of an athlete or ball in a sporting event. Several approaches have been discussed in the literature. First, using the foreground segmented image of the athlete together with calibrated cameras. The lowest point of the segmented image of the athlete from a calibrated camera should be assumed to be in direct contact with the ground. Another approach was to place the foreground segmented image of the player on a 3D model of a stadium. This allowed a seamless creation of a virtual view of the game other than the actual view captured by the physical camera which can be tilted to a different direction of view other than that of the physical camera. This approach was employed by the Red Bee Media (RBM) [204] in segmenting player's images from the pitch.
The problem with the player or stadium modeling approach was that there are limitations in its applicability when using a single camera. The degree of tilting the virtual camera view relative to that of the physical camera is too small and the players' occlusion problem is also difficult to resolve. An alternative approach to the above method was the use of the multi-camera method. This approach made use of high frame rate calibrated broadcast cameras to capture the pitch or field of the sport. The system was first used for tracking tennis game in a 3D view [190].
One more peculiar problem in sport was the determination of the location and tracking of the individual players and the ball. There are a lot of approaches to this problem in the literature. Though it was a general assumption that tracking players are more difficult than the ball because of some obvious reasons. The ball rolls on a pitch alone and it has definite shape while players are many, running after the ball [193]. Therefore, there will be a problem of occlusion between players. The ball has a particular pattern of movement which can easily be modeled but players move erratically following the direction of the ball. One more thing was that players need to be identified whether by number or jersey color before been tracked.
Distributed multiple fixed cameras at different locations on the pitch of the game was one of the methods [193] applied in tracking players and the ball, though it required more cameras and somehow costly to implement. For commercial purpose, most multi-camera-based player tracking system employs the use of automated cameras together with manual camera tracking approach. Areas of spectacular tackling and accurate passes are been identified and manually logged into live and latter highlighted those areas for analysis at the end of the match.
Another method is the use of two cameras positioned to work together, then applying the triangulation method to determine the points in 3D space. Here, the matrices of the cameras (camera projection function from 3D to 2D) involved must be known.
A lot has been done in the application of multi-camera in the area of sports and entertainment from image overlay, single-camera with a mechanical sensor, single-camera with an image overlay and multiple cameras. Monitoring the position of the limbs of an athlete during sporting activities remains a challenge in this field of study. Marker-based motion approach [200] was sometimes proposed but this only applicable during training and exercise, in live matches analysis, factually not possible. Future studies should focus on automated systems for tracking athletes' limbs during live matches or on recorded videos.

VII. APPLICATION OF MULTI-CAMERA IN EDUCATION
The wind of the digital revolution blowing across all the facets of our life has brought tremendous growth and ease into the way we do things traditionally. Environmental digitalization was at the current leading front of creating new ways of human relations. The effort of carrying out unobtrusive monitoring, distinct and trackable actions with large numbers of the audience gave us the chance of looking at the diverse facet of human undertakings. The ubiquitous presence of cameras in most automation applications especially those that deal with light sensing, photo capturing, monitoring and tracking system has made cameras relevant in every facet of our life. For example, in the area of education, technological inventions gave us the idea of considering a scenario where teaching can be carried out without a teacher (intelligent teaching systems) [205], or in another perspective; teaching unreal students (distance learning) [206].
The latest technological innovation was focusing on the level under which lectures can hold, and the possibility of transmitting it with quality to large numbers of students (Massive Open Online Courses (MOOCs) [207], [208]. All these are made possible due to advancements in the area of video capturing, processing, compression and delivery techniques. In the late '60s, it all started outside the class which can now be referred to as distance learning. Distance education was based on noncontiguous communication between a school (represented by the teacher) and its students [209]. In other words, this was two-way communication. The first way was the communication from the teacher or supporting organization to the student through the sending of instructional materials and other learning materials. The other way was represented as feedback from the students back to the teacher. Other areas where cameras are applicable in education are in the area of estimation of students' attention and smart attendance monitoring system, teacher's evaluation/appraisal exercise and feedback on student performance, automatic students face recognition/detection and teachers and student security and protection. All these can be achieved through technological innovations in education and they are majorly carried out by the application of camera or multi-camera systems. There are several papers related to cameras or multi-camera applications in educational settings found in the literature. Table 7 summarizes the related survey papers on camera or multi-camera applications in education. As described by most papers, the majority of the work focus on lecture capturing and broadcasting, students' attention estimation, teacher's evaluation, protection and security of students and teachers, attendance monitoring, face recognition and detection, motion detection, and behavior analysis during lectures where cameras or multi-camera systems are applied.

A. EVOLUTION OF CAMERA AND MULTI-CAMERA IN EDUCATIONAL SYSTEM
This section briefly discusses the evolution of multi-camera in education. The application of multi-camera started around 1985 where the majority of its usage was in the area of object detection and recognition. The evolution of multi-camera was due to the limitation in the application of the single-camera in terms of area of coverage, resolution of the image and accuracy of the image output. However, this led to the introduction of multi-camera where the accuracy of the camera can be improved by fusing the image obtained from multiple cameras. Thereby increasing the resolution of the output image and allow a wider area of coverage. Earlier work of multi-camera focuses on object detection and image classification [220]. Up till early 2000, the majority of the cameras used in surveillance and tracking of an object are fixed cameras which are of low resolution. Though, they have a wider view but cannot capture objects that are of far distance. Due to these shortcomings, Pan Tilt Zoom (PTZ) was invented to produce a high-resolution image with the power of capturing far distance target objects. The first set of PTZ cameras are manually configured to capture specific target areas. For example, focusing on the presentation slide board, surveillance of the entry and exit point and tracking of the individual students' faces for recognition. In a multi-camera setup, the PTZ cameras are later configured automatically to focus on specific areas based on defined features or configured as master-slave for automatic control and coordination [11].
In early 2010, smart cameras and wearable devices are introduced to carry out intelligent monitoring of events and tracking of human limbs. For example, Bianchi and Way [221] described the concept of the automatic auditorium (AutoAuditorium system), where audiovisual lectures produced in a lecture auditorium can be automatically captured in real-time without the support of any human control apart from turning it on and off. Erdmann and Gabriel [222] applied an automated smart camera system to perform audio-visual tracking of the lecturer and audience in a lecture hall. The system also performs automatic video editing with quality near to the one performed by a human coordinated system. This setup completely automates events or lecture capturing for distribution and easy access by students. In the aspect of human action recognition and position monitoring, generally, they can be observed in two ways. There are wearable sensor-based devices and vision monitoring devices. For wearable sensor-based devices, the devices are to be worn by the target to determine or monitor his / her activities [223]. This approach used action models to infer the behaviors and actions of the target. For example, Zhang et al. [213] used wearable devices like head, pen, and eyes-focus modules to analyze students' attention. These modules gathered information through smart cameras, gyroscope, and accelerometer embedded in those wearable devices. Another approach [224] was the use of an eye-tracker to determine the attention level of the student. However, this approach will have a serious negative impact on the eyes of the target [225] and the gadgets are quite expensive. The use of wearable devices has tendencies of higher accuracy but there are some major challenges like: • The number of human features to be used for inference, each feature will require a sensor and each sensor will generate data. So, there will be a diverse set of data to handle together.
• After gathering the data from different features, selecting the feature to use to determine the best result will be difficult because it involves physical and emotional feedback.
• Another challenge is individual differences. The reaction or behavior of an individual might be different towards different things at the same time. An individual might pretend to be focused but his / her attention is somewhere else. The vision monitoring devices employ a camera or multicamera-based system to detect the actions of the targets [226]. Steriadis et al used video cameras [227] to monitor students' behavior and their facial expressions to determine their level of attention [216]. A lot of factors will affect the correctness of this approach like lighting intensity and image background obstruction. It also requires high processing computer capability, especially in real-time applications.
Mothwa et al presented a conceptual model of a face recognition student attendance monitoring system. The authors make use of a full multi-camera view to capture and detect the faces of the students. The system is designed to perform periodic real-time recognition of students during lectures and performs update recognition after a specific interval of time to ensure that some of the students did not leave the lecture hall after the first capture. The captured images of the students are compared against the database image of the student already stored. This is then used to determine the presence or absence of a student in the lecture. The approach used centralized architecture where all the cameras are connected to a single facial recognition. In this type of architecture, there is no redundancy in the setup when the central interface that connected the multi-camera is down. The authors employ histogram equalization, bilateral filtration, and elliptical cropping to perform preprocessing tasks on the captured images.

B. OBJECT TRACKING ALGORITHM AND IMAGE RECOGNITION
This subsection highlights the various tools and algorithms used in object detection or recognition from images or video frames. Object recognition in image or video of multiple frames is performed using object tracking techniques. Most of the object tracking algorithms composed of three features; object representation, dynamic model and search procedure [228]. There are two ways of representing objects; holistic and local description object representation. Examples of the holistic descriptor are values of the raw pixels and color histogram while that of the local descriptor is local histogram and color information. The dynamic model narrates the motion between two successive frames thereby reducing the computational problem in object recognition. Also, the searching procedure in the object recognition algorithm is seen as a problem of optimization. A deterministic and stochastic approach is always employed to solve it. The deterministic approach is only applicable when there are no local minima involved otherwise Stochastic or sampling methods are employed [228], [229]. However, there are a lot of object recognition algorithms proposed by researchers, some of them are good for real-time (online method) object detection which is mostly used in educational applications while others are used for object recognition in still-image. Examples of the algorithms are Tracking-Learning-Detection (TLD) [230], resonant tunneling device (RTD) [231], Multiple Instance Learning (MIL) [232], incremental visual tracker (IVT) [233], Beyond semi-supervised (BeSemiT) [234], L1 tracker (L1T) [229], visual tracking decomposition (VTD), semi-supervised tracker (SemiT) [235], variance ratio tracker (VRT) [236], fragment-based tracker (FragT) [237], and online boosting tracker (BoostT) [238]. It was shown that TLD was the most reliable and robust online object VOLUME 8, 2020 tracking algorithm when tested on different video frames and images [228].
Wang et al. [228] in Table 8 tested the performance of the algorithms based on the modeling of their motion, the state vector movement between two successive frames (dynamic model), and their searching methods. Boris et al. presented multiple instant learning (MIL) to separate an object from its background. The researchers employed the appearance model which composed of the discriminative classifier, the classifier load itself and then extract the examples (positive and negative) from the most recent frame. A little change in the tracker can cause incorrect labeling of the trained examples and this debases the classifier. However, the motion model employed in the approach was too simple. It could be replaced with a more robust and sophisticated one like a particle filter. Kainz et al. [239] used LTD to determine the number of the student attending a lecture and later on applied face recognition algorithm to identify the particular student in the detected faces.
Feature extraction is an essential part of face recognition. It deducts salient features subsets from the main data following some rules. The advantage of feature extraction is that it increases the speed of machine training and reduces space complications. The different categories of feature extraction methods are the holistic method, feature-based and hybrid matching methods. The holistic methods are the most widely used because they employed the whole face as input data. Examples of the holistic methods are Linear Discriminant Analysis (LDA) [240], Principle Component Analysis (PCA) [241], Independent Component Analysis (ICA) [242] and Local Binary Patterns (LBP). Mothwa et al. [219] used the Viola and Jones face detection algorithm [243], this algorithm employed Haar cascade to identify faces and it can work in real-time.
Mothwa et al. [219] employed three different cameras of the same quality to capture student's faces and used PCA, LDA, and LBP to extract the features of the face images. The Euclidean distance was calculated to determine the accuracy between the image tested and the train data. The accuracy of the feature extraction algorithms was tested and compared. High recognition accuracy was obtained when the PCA and LDA were combined and used on the first camera.
All these algorithms mentioned above can be found in computer vision tools. Computer vision tools are software for implementing image and video frame processing, examples of these tools are OpenCV, VXL, LTI, OpenTLD, MatLab and fast CV (produced by Qualcomm). The fast CV tool can be operated on mobile devices. A comparison of these vision libraries shows that open CV is faster when used on computers of the same specifications [244].
Fuzail et al. [245] implemented real-time detection algorithms for managing student attendance. They used the Haar classifier for face detection and it was implemented in the OpenCV computer vision tool. The recognition aspect was done using an algorithm in python named pyfaces. Apart from its fast speed of recognition, this algorithm relies on the pose of the image, scale, and color to differentiate between the compared image and the one stored in the database. However, the system cannot identify the individual student available in class. Tamimi et al. [246] presented a real-time group face detection system. This approach is similar to Fuzail et al.'s approach. The face detection algorithm was implemented in MatLab. 2012, which is another example of computer vision software. This system cannot identify the individual student in the captured image.
Zhang et al. [213] demonstrated student attention level determinant using wearable devices built on four modules; they are head movement, pen movement, visual focus, and Apps modules. In the head movement module, Speed-Up Robust Feature (SURF) [247] algorithm was used to determine the correlation between the position of the students' head (whether up, down, or center) and the information on the white marker board. SURF was implemented in an open CV library.

VIII. APPLICATION OF MULTI-CAMERA IN MOBILE PHONES
The growing competition in the mobile communication world has prompted many manufacturers of mobile phones to introduce more features in their mobile products like imaging-related functions such as video recording, video calls, and video conferencing [248], [249], [222]. The importance attached to the camera on smartphones by manufacturers and users has driven most mobile phone manufacturers to work hard on how to improve the features and image qualities of their products. The initial concern of most mobile phone manufacturers was the resolution of the camera on the smartphone which lead to the production of lots of megapixels cameras. Another possible feature was the introduction of object digital zooming in on mobile phones which was entirely based on software interpolation when it came out initially due to the limitation in the camera focal lens and the size of the mobile phone. The thin shape of the smartphone body [250] and the design of the lens makes it complicated to fit a zoom lens. The implementation of multicameras on mobile phones has made several features possible and much easier such as object zooming (through optical), portrait mode, 3D, better high dynamic range (HDR) and lowlight photography. The most popular smartphone manufacturers have moved to stereo camera design. However, not all cameras on multi-camera smartphones are serving the same function. Some smartphones have a primary camera that does the actual capturing of images while the secondary camera is either a telephoto lens (thick and wider) or monochrome with a wider field of view (FOV). A telephoto lens has a longer focal length and it can bring a distant object or scene closer as shown in Figure 12. The advantage of using a dedicated telephoto camera module on a mobile phone is that it solves the problem of zoom-in and produces better images at the long focal length. The result will be better than using cropping and scaling the image output of the main camera. The output of a two-camera module can be integrated to produce an improved image. Though, this poses an image processing challenges such as white balance and focus distance problem because the two images will slightly offset each other.
According to the literature, this area has not been well studied. There will be a need for alignment in terms of the geometry and photometric properties of the images produced by the two asymmetric lenses. Anirudh et al. [250] proposed a computational algorithm that can solve the problem of fusion of images produced by multiple asymmetrical cameras (tele-wide) in terms of color and brightness integration. The images obtained from the telephoto and wider view lenses were first set to the same scale and FOV (refer to as global image registration) using different algorithms (Oriented Fast and Rotated (ORB), Library for Approximate Nearest Neighbour (FLANN), Random Sample Consensus (RANSAC), and affine transform). The brightness and color of the two images are then matched since they are both in the same FOVs. The brightness and color of the resulted image were then corrected at Y-channel and U and V-channels respectively. However, the algorithm was tested on static image and scene with occlusion but not on motion pictures. The approach of Liu and Zhang [251] was performed on six symmetric wide-view cameras and the cameras are all calibrated. A global correction gain was calculated for each camera view using the appropriate color index that matches the global brightness and color. The differences between the corresponding overlapping sample of the views were minimized by the joint optimization of the optimal gain. Then, the tone marking curve was applied to remove the photometric misalignment. In this approach, a lot of assumptions were made and the method is not adjustable to different lighting conditions.
The focal length limitation imposed on the mobile phone by the thin size flat body shape can be resolved using folded optics. This is a method whereby a mirror or prism is used to direct the light reflection from the scene to the lenses and the sensor. Tremblay et al. [252] examine the performance of conventional miniature refractive lenses used by mobile cameras and multiple fold reflective optics of less thickness, huge light collection, and high resolution. In miniature conventional optics, the focal length of the lens is smaller compared to the size of the pixel and its array (pixel pitch). A smaller pixel pitch means higher pixel density and higher resolution. Pixel pitch is important because it influences viewing distance. The smaller the pixel pitch, the closer the viewing distance. Folded optic creates a longer focal length VOLUME 8, 2020 without increasing the physical distance between the lens and the image sensor. It also increases the diameter of the image sensor thereby increases the light collection of the aperture area. The effective aperture diameter of a circular folded optic lens compared to that of the miniature conventional is given as: where o is the inner aperture diameter divided by the outer aperture diameter (obscuration ratio), d outer is the outer diameter (OD) of the folded optic and d eff is the diameter of an unobscured circular aperture of the same aperture as the folded optic. Multi-camera arrangement at the front side and the rear back of the smartphones are usually fixed at the top. Each camera lens is fixed on the same module with the sensor. The rear cameras are usually placed in a cluster whether at the top right of the phone or top center [253].

A. MULTI-CAMERA PHONE DEPTH ESTIMATION
The presence of the homogenous or heterogeneous cameras on a phone made it possible for the camera to be able to perform depth estimation of the objects in the scene. Depth estimation is a method of using images obtained from two or more cameras to carry out survey triangulation and estimate distance [254]. This process is used to determine the distance of the object in the image from the two cameras (refers to as parallax). The objects closer to the cameras will be quite far apart in the image and those far away from the cameras will look closer in the image. One of the advantages of the depth of the object in a scene is that it allowed the introduction of special portrait modes in multi-camera phones which makes the image sharp and displays a nice blurring background. Another method of blurring the background of an image is referred to as ''Bokeh'', it is the natural method of blurring the background by using wide aperture optics. This is usually done through the hardware but hardly difficult to replicate computationally. Depth estimation is determined using the different methods; one is carried out with two or more input images (stereo vision [255] and shape from a motion [256]) and the second one is from a single monocular image which has been recently proposed [257]- [259]. The first methods produce the most accurate depth information. Guo et al. [260] used the Markov Random Field (MRF) model to estimate the depth map and determine the relationships between the different parts of a single image. The model of the MRF was trained using supervised learning. Then, the estimated depth details and the geometric information were used to generate a pedestrian candidate. Saxena et al. [254] combine monocular and stereo methods (triangulation) to estimate depth information. Markov Random Field was used to obtain the monocular cues and the result was incorporated into the stereo formation. The advantage of this approach is that the result can be applied in both areas where any of the systems (stereo and multi-camera) performs poorly. Other depth estimation approaches are summarized in Table 9.

B. IMAGE DETAILS ENHANCEMENT THROUGH MORE CAMERAS
The image details can be enhanced when multi-camera is employed in mobile phones. For example, in image demosaicing, a typical camera sensor cannot record color on its own except an array of the color filter is laid over it. Each photosite displays in one color (Red, Green, or Blue). From this, an RGB image can be produced through a process called demosaicing. Though, this method has disadvantages such as a reduction in resolution, and less sensitive image. However, most smartphones use a color camera together with a monochrome sensor that can capture the available light and in full resolution. The image output of the two (color camera and monochrome sensor) is then combined which produced a better-detailed image. Also, some manufacturer uses two high-resolution color cameras and then combined their output image but the process of combining and aligning is complex, the result of the combination is not as detailed as that of the monochrome sensor. For low light and high contrast condition, the combination of the image from a more lightsensitive monochrome camera with image from high color camera produce better image output. In this combination, there is a result of artifacts due to the alignment difference in the output of the two cameras.
Multi-camera phones have the advantage of using the differences in images produced by their cameras to create a depth map. This map can be used to enhance various augmented reality (AR) applications. Although it is not only through a multi-camera module that depth information can be measured, some manufacturers used a dedicated depth sensor that makes use of the time of flight (TOF) or other technologies to generate the depth map required for AR improvement.
However, the addition of more cameras to mobile phones or smartphones is in no doubt makes smartphones to be more robust but some challenges accompany it. The problem of cost and space are not the only limitations, the processing power of smartphones is also a thing of concern. The processing of multiple flows of images is significantly complex than working with images obtained from a single camera. Additional work is required to align the images obtained from multiple cameras properly to reduce the ghosting artifacts and other actions required to create a quality and detailed image output from the cameras.

IX. MULTI-CAMERA APPLICATION ALGORITHMS
The ubiquitous nature of multicamera has made it to be widely applied in different areas of human life ranging from surveillance, sport, mobile phone, and other areas. Most multi-camera applications are based on tracking, matching, and surveillance. Human and Objects can be correctly observed, tracked, or identified using scale, appearance, and shape changes, provided they reveal enough texture. Local features are the most generally used features because they are good invariants. They are categorized into two: (i) local features based on absolute value and (ii) the one based on relative value or discriminative descriptor.
One of the most common local features techniques built on absolute value is the scale-invariant feature transform (SIFT). SIFT is invariant to image scale, rotation angle, and brightness, it can build a histogram with gray and gradient quantization. Some of the advantages of SIFT are: • Locality: It recognizes all features as local, and therefore, it is a good option for occlusion and clutter (no early subdivision).
• Distinctiveness: Each feature can be integrated into a large database of objects.
• Quantity: It can work with many features.
• Efficiency: Its performance is comparable to that of the real-time.
• Extensibility: It has a wide range of support for many feature types, with each adding robustness.
However, SIFT is computationally intensive, and it is practically not implementable for real-time applications like visual odometry and low-power devices, for example, mobile phones because of its computational demand. Ke and Sukthankar [267] described a principal component analysis (PCA-SIFT) technique that substituted the histogram in SIFT which improves the computational speed. Speeded up robust features (SURF) [268] is another algorithm that has better performance than SIFT, especially in terms of computational speed. Oriented FAST and Rotated BRIEF (ORB) [269] is binary descriptors based on BRIEF, it is a robust algorithm that is illumination and rotation invariant. It is highly resistant to noise and can compute 10 times faster than SURF, but it has a scale variation problem. In general, local feature algorithms based on relative value have challenges in local feature points identification and description, local features description capability, and computational intensity. Examples of the relative value-based methods include Binary Robust Independent Elementary Features [270], modified feature point descriptor based on Binary Robust Independent Elementary Features (MBRIEF) [271], Ordinal Spatial Intensity Distribution [272], and Binary Robust invariant scalable keypoints [273].
Some researchers used the texture of an object in tracking non-rigid objects. Nummiaro et al. [274] implemented color distribution in particle filtering together with edge-based image features to perform real-time tracking of non-rigid objects. The method is not susceptible to partial occlusion, rotation, and scale variation and it is computationally balanced. Mathes and Piater [275] combined low-level features to form a model of shape and appearance of an object. The resulting model performs very well in serious partial occlusions images. The approach is built to detect and track texture object in a clumsy scene for non-static cameras. The algorithm was tested on 160 frame soccer sequence tracking 6 players as targets.
Others perform tracking using 2D or 3D geometry of the objects' shape. These approaches are employed in surveillance, human-computer interface (HCI), and communication support services. Senior et al. [276] applied tracking technology in 2D and 3D ground plane to determine the positional information of a tracking object. In their approach, four different algorithms were used in tracking a person in an indoor environment. One of the approaches is particle filtering, this method used particle trackers adapted from the work of Nickel et al. [277]. This method does not use a background model subtraction because of its limitation in handling moving targets. It employed frame differencing which is also susceptible to noise and distraction. The algorithms employed are background subtraction, par- ticle filter, face detection and edge alignment of a cylindrical model. The second approach is face detection which depends on the faces detected to perform tracking. This method can be seriously affected in the case of occlusion, distance from the camera and light intensity. Next is the background subtraction approach which works on keeping a good background model that cannot be certainly guaranteed. Lastly, the edge-alignment method, it used a 3D graphical model of a human represented by cylinders coupled in kinematic chains aligned in multi-camera views. This method requires a re-initialization strategy whenever it fails. Therefore, it is not suitable for an online application. Liang et al. [95] presented object matching with multi-camera collaboration using head detection and trifocal tensor pointer transfer method. They used Kalman and PDA algorithms for tracking people and then applied background subtraction to detect the head position. Using the corresponding head points, trifocal tensor transfer was used to detect objects in the upper view of the two cameras. Straw et al. [278] body inclination of animals by tracking their 3D motion and location using the Kalman filter and the nearest neighbor filter algorithm. The system was designed to study the neurobiological behaviors of freely flying animals. The system used 11 cameras to track three flies simultaneously at 60 fps in real-time.
Subsequently, Dockstader and Tekalp [279] proposed a method of tracking persons in motion using multi-view implementation of the Bayesian belief network which integrates the 2D features of each camera view. Sparse motion image estimate and Kalman-like state propagation were used for observation and filtering respectively. Mittal and Davis [280] presented a suitable method for tracking people in a cluttered area using synchronized cameras. The system employed a region-based stereo algorithm in detecting the 3D points and Bayesian classification for segmentation. Calderara et al. [80] used overlapping multi-cameras to track people using consistent labeling methods. The position of the people in the multiple views of the cameras is determined using the homograph of the first detection of the camera view. Table 10 shows a summary of multi-camera application tracking algorithms.

X. CONCLUSION
Multi-camera systems have gained significant attention during the past few years especially in the area of surveillance, tracking, image recognition, image sensor, and computer vision. This becomes an excellent opportunity where we combine the techniques (multi-camera system algorithms) and advancement in these fields with that of multi-camera systems to proffer sustainable solutions to the problems of humanity. In this survey paper, we discussed the aspect of camera calibration and architecture in a multi-camera formation because of their importance. Also, there has been a focus on the application of multi-camera systems in the area of surveillance, sports, education, and mobile phones. Multicamera systems application algorithms are also discussed with references to the area's application. More importantly, we have discussed the current challenges faced, progresses made, and potential directions for the future to guide the researchers and scientists who are in need to understand how this area of research is evolving.