Video Big Data Analytics in the Cloud: A Reference Architecture, Survey, Opportunities, and Open Research Issues

The proliferation of multimedia devices over the Internet of Things (IoT) generates an unprecedented amount of data. Consequently, the world has stepped into the era of big data. Recently, on the rise of distributed computing technologies, video big data analytics in the cloud has attracted the attention of researchers and practitioners. The current technology and market trends demand an efficient framework for video big data analytics. However, the current work is too limited to provide a complete survey of recent research work on video big data analytics in the cloud, including the management and analysis of a large amount of video data, the challenges, opportunities, and promising research directions. To serve this purpose, we present this study, which conducts a broad overview of the state-of-the-art literature on video big data analytics in the cloud. It also aims to bridge the gap among large-scale video analytics challenges, big data solutions, and cloud computing. In this study, we clarify the basic nomenclatures that govern the video analytics domain and the characteristics of video big data while establishing its relationship with cloud computing. We propose a service-oriented layered reference architecture for intelligent video big data analytics in the cloud. Then, a comprehensive and keen review has been conducted to examine cutting-edge research trends in video big data analytics. Finally, we identify and articulate several open research issues and challenges, which have been raised by the deployment of big data technologies in the cloud for video big data analytics. To the best of our knowledge, this is the first study that presents the generalized view of the video big data analytics in the cloud. This paper provides the research studies and technologies advancing video analyses in the era of big data and cloud computing.

with these, various leading industrial organizations have successfully deployed video management and analytics platforms that provide more bandwidth and highresolution cameras collecting videos at scale and has become one of the latest trends in the video surveillance industry. For example, more than 400 hours of videos are uploaded in a minute on Youtube [5], and more than one hundred and seventy million video surveillance cameras have been installed in china only [6]. It has been reported that the data generated by various IoT devices will see a growth rate of 28.7% over the period 2018-2025, where surveillance videos are the majority shareholder [7].
Such an enormous video data is considered as "big data" because a variety of sources generates a large volume of video data at high velocity that holds high Value. Even though 65% of the big data shares hold by surveillance videos are monitored, but still, a significant portion of video data has been failed to notice [8]. That neglected data contain valuable information directly related to realworld situations. Video data provide information about interactions, behaviors, and patterns, whether its traffic or human patterns. However, handling such a large amount of complex video data is not worthwhile utilizing traditional data analytical approaches. Therefore, more comprehensive and sophisticated solutions are required to manage and analyses such large-scale unstructured video data.
Due to the data-intensive and resources hungry nature of large scale video data processing, extracting the insights from the video is a challenging task. A considerable size of video data poses significant challenges for video management and mining systems that require powerful machines to deal with large-scale video data. Moreover, a flexible solution is necessary to store and mine this large volume of video data for decision making. However, large-scale video analytics becomes a reality due to the popularity of big data and cloud computing technologies.
Cloud computing is an infrastructure for providing convenient and ubiquitous remote access to a shared pool of configurable computing resources. These resources can be managed with minimal management effort or service [9]. Big data technologies, such as Hadoop or Spark echo system, are software platforms designed for distributed computing to process, analyze, and extract the valuable insights from large datasets in a scalable and reliable way. The cloud is preferably appropriate to offer the big data computation power required for the processing of these large datasets. [10], Amazon web service [11], Microsoft Azure [12], and Oracle Big Data Analytics [13] are some examples of video big data analytics platforms. Large-scale video analytics in the cloud is a multi-disciplinary area, and the next big thing in big data, which opens new research avenues for researchers and practitioners.
This work aims to conduct a comprehensive study on the status of large scale video analytics in the cloudcomputing environment while deploying video analytics techniques. First, this study builds the relationship between video big data and cloud computing and defines the terminologies that govern the study. Then serviceoriented and a layered reference architecture have been proposed for large-scale video analytics in the cloud while focusing on architectural properties like reliability, scalability, fault-tolerance, extensibility, and intermediate results orchestration. Further, an intensive survey has been conducted to project the current research trends in video analytics that encompass the taxonomy of video analytics approaches, and cloud-based scholarly and industrial study. Finally, open research issues and challenges are discussed, with a focus on proposed architecture, i.e., the deployment of an array of computer vision algorithms for large-scale videos in the cloud.

A. VIDEO BIG DATA, CLOUD COMPUTING, AND THEIR RELATIONSHIP
The term big data appeared and popularized by John R. Masey in the late 1990s [14], which refers to a large volume of data that are impractical to be stored, processed and analyzed using traditional data management and processing technologies [15]. The data can be unstructured, semi-structured, and structured data, but mostly unstructured data is considered. The definition of big data evolved and has been described in terms of three, four, or five characteristics. In literature, among these characteristics, three are shared, i.e., Volume, Velocity, and Variety, while the others are Veracity and Value [16]- [19]. Various video stream sources generate a considerable amount of unstructured video data on a regular bases and becoming a new application field of big data. The data generated by such sources are further subject to contextual analysis and interpretation to uncover the hidden patterns for decision-making and business purposes.
In the context of a large volume of video data, we specialize the generic big data characteristics. The size of data is referred to as Volume [20], but the majority of the shares, i.e., 65%, are held only by surveillance videos. The type of data generated by various sources such as text, picture, video, voice, and logs are known as Variety [20]. The video data are acquired from multimodal video stream sources, e.g., IP-Camera, depth camera, body-worn camera, etc., and from different geolocations, which augments the Variety property. The pace of data generation and transmission is known as Velocity [21]. The video data also possess the Velocity attribute, i.e., the Video Stream Data Source (VSDS) primarily produce video stream 24/7 and acquired by the data center storage servers. Veracity can be defined as the diversity of quality, accuracy, and trustworthiness of the data [22]. Video data are acquired directly from real-world domains and meet the Veracity characteristic. The Value refers to contextual analysis to extract the significant values for decision-making and business purpose [23], [24]. Video data has high Value because of its direct relation with real-word. Automatic criminal investigation, illegal vehicle detection, and abnormal activity recognition are some of the examples of Value extraction. Almost all the big data properties are dominated by the video data, which encourage us to give birth to Video Big Data.
These five characteristics impose many challenges on the organizations when embracing video big data analytics. Storing, scaling, and analyzing are some apparent challenges associated with video big data. To cope with these challenges, converged and hyper-converged infrastructure and software-defined storage are the most convenient solutions. Distributed databases, data processing engines, and machine learning libraries have been introduced to overcome video big data management, processing, and analysis issues, respectively.
These big data technologies are deployed over a computer cluster to process and manage a massive amount of video data in parallel. A computer cluster may consist of few to hundreds, and even thousands of nodes work together as a single integrated computing resource, on different parts of the same program [25], [26]. Deploying an indoor computer cluster is an option for big data technologies, but hardware cost and maintenance issues are associated with it. An alternative solution can be cloud computing that elegantly reduces the costs associated with the management of hardware and software resources [27].
Typically, cloud services are provided on-demand in a "pay-as-you-go" manner for the conveniences of end-users and organizations [28]. Cloud computing follows the philosophy of the "as-a-Service" and offers its "services" according to different models, for example, Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), Software-as-a-Service (SaaS) [9].
Under IaaS (e.g., Amazon's EC2 ), the cloud service provider facilitates and allows the consumers to provision fundamental computing resources and deploy arbitrary software. In PaaS, the service provider provides a convenient platform enabling customers to develop, run, and manage applications without considering the complexities of building and maintaining the infrastructure. The examples of PaaS are Google's Apps Engine and Microsoft Azure. In SaaS, applications (e.g., email, docs, etc.) are deployed on cloud infrastructure by service providers and allow the consumer to subscribe. These applications can be easily accessed from various client devices using a thin client or program interfaces [9].

B. RESEARCH OBJECTIVES AND CONTRIBUTIONS
This paper presents a detailed survey and review of cloud-based large-scale IVA. We also propose big data technological solutions for the challenges faced by IVA researchers and practitioners. The contributions of this paper are listed below: • We standardize the basic nomenclatures that govern the IVA domain. The term video big data has been coined and clarified how it inherits the big data characteristics while establishing its relationship with cloud computing. • We propose a distributed, layered, service-oriented, and lambda style [29] inspired reference architecture for large-scale IVA in the cloud. Each layer of the proposed Cloud-based Video Analytics System (CVAS) has been elaborated technologically, i.e., layerwise available big data technological alternatives. The base layer of the proposed architecture, i.e., video big data curation layer, is based on the notion of Intermediate Results (IR) orchestration, which can play a significant role in the optimization of the IVA pipeline. Under the proposed architecture, we perform a thorough investigation of scalable traditional video analytics and deep learning techniques and tools on distributed infrastructures along with popular computer vision benchmark datasets. • To the best of our knowledge, this is a first surveying video big data analytics, our research targets the most recent approaches that encompass broad IVA research domains like content-based video retrieval, video summarization, semantic-based approaches, and surveillance and security. • We also investigate how the researchers exploit big data technologies for large-scale video analytics and show how the industrial IVA solutions are provided. • In this study, we further write real word IVA application areas, which projects the significance of big data and cloud computing in IVA. • Finally, we identify the research gap and list several open research issues and challenges. These research issues encompass orchestration and optimization of IVA pipeline, big dimensionality, online learning on video big data, model management, parameter servers, and distributed learning, evaluation issues and opportunities, IVA services statistics maintenance and ranking in the cloud, IVA-as-a-Service (IVAaaS) and cost model, video big data management, and privacy, security and trust. The remaining paper is organized as follows. Section II and III is about recent studies, and scope and nomenclature, respectively. Section IV discusses the proposed CVAS. Literature review has been presented in VOLUME 4, 2016 Section V and VI. The IVA applications can be seen in Section VII. In Section VIII several video big data challenges and opportunities are discussed. Finally, Section IX concludes this study.

II. RECENT STUDIES
Various studies have been conducted that discus video analytics, which can be classified as IVA and Big data, as shown in Table 1. In the former class, the surveys overlook the big data characteristics and challenges of video analysis in the cloud. The majority of the work focuses on a specific domain of IVA and the application of video analyses. For instance, a comprehensive survey was presented by Liu et al. [30], and Olatunji et al. [31], where they have discussed IVA and its applications in the context of surveillance conventionally. Context-Based Video Retrieval (CBVR) was reviewed by Hu et al. [32], Patel et al. [33], and Haseyama et al. [34]. Similarly, vehicle surveillance systems in Intelligent Transportation System (ITS), abnormal behavior analytics, in surveillance videos were studied by Tian et al. [35], and [36], respectively.
Furthermore, big data studies were conducted from different perspectives, i.e., cloud computing, ML, mining, multimedia, and ITS. Hashem et al. [16] and Agrawal et al. [39] presented research issues and challenges on big data analytics in the cloud computing. Tsai et al. [38] and Khan et al. [37] focus on big techniques and general applications along with challenges. Zhou et al. [40], Che et al. [41], and Pouyanfar et al. [42] studied and reported the research issues of big data practices concerning ML techniques, data mining algorithms, and multimedia data respectively. Zhu et al. [43] presented a comprehensive study on big data analytics in ITS, analytical methods and platforms, and categories of big data video analytics. In the study of big data, IVA has been overlooked, and minimal discussion can be found in the study of Khan et al. [37], Tsai et al. [38], Pouyanfar et al. [42], Zhu et al. [43], H. and Zahid et al. [44].
From the literature, it is clear that the recent studies ignored large-scale unstructured video analytics in the cloud in the discussion of big data. Some surveys focus on big data management and its related tools, while others have limited investigation of IVA in a particular context. They also do not consider the growing unstructured videos as video big data. Unlike existing work, this paper provides a comprehensive assessment of stateof-the-art literature and proposes an in-depth distributed cloud computing-based IVA reference architecture.

III. SCOPE AND NOMENCLATURE
We clarify some nomenclature being used and scoping this study. The first one is IVA, which is "any surveillance solution that utilizes video technologies to automatically manipulate and/or perform actions on live or stored video images [45]". The IVA services are implemented through hardware called Video Analytics System (VAS) [30]. VAS assist acquires videos continuously and monitors unblinkingly. The VAS falls into four categories, i.e., Embedded Video Analytics System (EVAS), On-site Video Analytics System (OVAS), Fogbased Video Analytics System (FVAS), and CVAS.
EVAS embeds IVA solutions and performs video analysis directly on the edge device, e.g., camera or encoder, and can produce alerts in case of abnormality. EVAS provides very plain video analytics solutions and can simultaneously perform two or three rules on its stream and cannot accomplish complex algorithms such as fire detection, facial recognition, or cross video stream analytics. Under OVAS, small and middle-sized companies consist of networked or wireless cameras, a network router, a system running the video analytics and management software (e.g., IBM smart surveillance system [46], and Zoneminder [47]), and a storage device. All the cameras send the video data for analytics against contextual video analytics algorithms and warn the operator if anomaly detected. OVAS has many limitations, e.g., maintenance, software up-gradation, expensive hardware, scalability, and unable to deal with largescale video data.
When IVA solutions are provided in fog and cloud computing environment, then it is called FVAS, and CVAS, respectively. In such environments the IVA solutions are made available under the as-a-Service (aaS) paradigm, i.e., IVAaaS. In FVAS, the IVA solutions are geographically distributed and configured near the edge devices, i.e., video stream sources, to meet the strict realtime IVA requirements of large-scale video analytics, which must address latency, bandwidth, and provisioning challenges. Whereas the CVAS is more suitable for offline IVA because of the relatively high response time and latency. The hierarchy and relation among CVAS, FVAS, and EVAS is shown in Fig. 1. The batch video data analytics are performed in the cloud, while the realtime IVA is performed in the fog. The EVAS can play a passive role in the proposed architecture, e.g., feed the video streams to the CVAS if motion is detected.
The scope of this paper is FVAS, and CVAS, i.e., the real-time and batch IVA solutions are deployed in the fog and cloud computing environment, respectively, while utilizing big data computing technology. However, for the ease of understandability, throughout this paper, we use the notion of CVAS.
To analyze a video in CVAS, a video undergoes through different phases, as shown in Fig. 2a. In Fig. 2a, Video Source are the sources which either generate videos streams from sources connected directly to realworld domains such as IP-camera or can be already  acquired videos in the form of datasets residing in a file system. If IVA in the cloud are performed on real-time video streams, then we called it Real-time IVA (RIVA) and Batch IVA (BIVA) if performed on batch videos. The Ingest phase implement interfaces to acquire videos from the Video Source. In the context of IVA, the acquired videos can be represented as a hierarchy, as shown in Fig. 2b. An acquired video from a Video Source may be decomposed into its constituent units either in the temporal or spatial domain. In a video, a frame represents a single image, whereas a shot denotes a consecutive sequence of frames recorded by a single camera. A scene is semantically related shots in a sequence that depicts a high-level story. A collection of scenes composes a sequence/story. Frames and shots are low-level temporal features suitable for machines, while scenes and sequence/story are considered to be the highlevel features that are suitable for human perceptions. Such constituent units are further subject to low, mid, or/and high-level processing. In low-level processing, primitive operations (in Transformations phase) are performed e.g., noise reduction, histogram equalizer. The Infer phase encompasses mid and high-level processing. The mid-level processing extracts features from the sequence of frames, e.g., segmentation, description, classification, etc. The high-level processing, make sense of an ensemble of recognized objects; perform the cognitive functions normally associated with vision. Finally, the extracted information can be persisted to the data store and/or published to the end-user.
Furthermore, the basic unit of an IVA service pipeline is an algorithm, e.g., encoder, feature extractor, classifier, etc. The input of an algorithm can be a Video Source, keyframes, or features. Similarly, the output of an IVA algorithm can be high, mid, or low-features. Throughout this study, all possible outputs of IVA algorithms are termed as IR. Multiple algorithms can be pipelined to build a domain-specific IVA service. The input and output of an IVA service are restricted to the Video Source and IR, respectively. The User represents the stakeholder of the CVAS, such as administrator, consumer, IVA researchers, and practitioners. A Domain is a specific real-word environment, e.g., street, shop, road traffic, etc., for which an IVA service needs to be built for automatic monitoring. Domain knowledge facilities IVA in discovering interesting patterns from domain video streams. The combination of software and hardware constitutes s distributed System (cloud environment) where IVA service and algorithms can run fast. Nevertheless, upgrading existing IVA algorithms to distributed architecture requires customization of how IVA algorithms should be implemented and deployed. VOLUME

IV. LAMBDA CVAS: A REFERENCE ARCHITECTURE
In this section, we briefly presents the technical details of the proposed CVAS (called Lambda CVAS (L-CVAS)) and the technical detail of each layer in the consecutive sub-sections. Fig. 3  VBDCL is the foundation layer and is responsible for large-scale big data management throughout the life-cycle of IVA, i.e., from data acquisition to early persistence to archival and deletion [48]. VBDPL is responsible for distributed video pre-processing, feature extraction, etc. The VBDML deploys IVA algorithms on the top of distributed processing engines intending to produce high-level semantics from the processed sequence of frames. On top of the VBDML layer, the KCL layer has been designed to link the low-level features in spatial and temporal relation across videos in a multistream environment. KCL deploys a generic video ontology. The KCL layer maps the extracted IR to the video ontology to bridge the semantic gap between the lowlevel features in Euclidean space and temporal relation across videos while utilizing semantic rich queries. The proposed architecture incorporates top-notch functionalities of the above four layers into a simple unified role base WSL, which enables the L-CVAS users to manage, built, and deploy a wide array of domain-specific near RIVA and BIVA services.
All the layers are made available as aaS. These IVAaaS are provided to the domain experts and allow them to pipelined in a specific context to built an IVA service. These IVA services are made available as IVAaaS to which users can subscribe Video Sources.
Functionalities like security, scalability, loadbalancing, fault-tolerance, and performance are mandatory and common to all the layers, which are shown as a cross-cutting in Fig. 3. The cloud infrastructure provides the underlying hardware and software under IaaS, on which the L-CVAS can be deployed. The cloud infrastructure is out of the scope of this paper. It has already been studied in detail in the context of big data by [16], [36], [39]. However, in the discussion of IVA in the cloud, some resources like CPU, GPU, FPGA, HDD, and SSD can be considered.

A. VIDEO BIG DATA CURATION LAYER
Effective data management is key to extract insights from the data. It is a petascale storage architecture that can be accessed in a highly efficient and convenient manner. We design the VBDCL for L-CVAS to efficiently manage video big data. L-CVAS's data storage stack consists of three main components: Real-time Video Stream Acquisition and Synchronization (RVSAS), Distributed Persistent Data Store (DPDS), and VBDCL Business Logic.

1) Real-time Video Stream Acquisition and Synchronization
The real-time video stream needs to be collected from the source device and forwarded to the executors for onthe-fly processing against the subscribed IVA service.
Handling a tremendous amount of video streams, both processing and storage are subject to lose [52]. To handle, large-scale video stream acquisition in real-time,  to manage the IR, anomalies, and the communication  among RIVA services, we design the RVSAS component  while assuming a distributed messaging system. Distributed Message Broker, also known as messageoriented-middleware [53], is an independent application that is responsible for buffering, queuing, routing, and delivering the messages to the consumers being received from the message producer [54]. Message broker should be able to handle permission control and failure recovery. A message broker generally supports routing methods like direct worker queue, and/or publish-subscribe [55]. Similarly, the message consumer component receives the messages from the message broker either periodically (cron-like consumer) or continuously (daemon-like consumer). Generally and for the sake of scalability, message consumers are deployed on separate servers independently of message producers [55]. Some popular distributed messaging systems are shown in table 2.
RVSAS provides client APIs on the top of a distributed messaging system for the proposed framework. The RVSAS component is responsible for handling and collecting real-time video streams from deviceindependent video data sources. Once the video stream is acquired, then it is sent temporarily to the distributed broker server. The worker system, on which an IVA service is configured, e.g., activity recognition, reads the data from the distributed broker and process. DMBM are used to manage the queues in the distributed message broker cluster considering RIVA services. Three types of queues, RIVA_ID, RIVA_IR_ID, and RIVA_A_ID as shown in Fig. 4, are automatically generated by the DMBM module on the distributed message broker when a new RIVA service is created. Here RIVA, ID, IR, and A stands for RIVA service, unique identifier of the service, IR, and Anomalies, respectively. These queues are used to hold the actual video stream being acquired by VSAS, IR produced by an algorithm, and anomalies detected by the video analytics services.
b: Video Stream Acquisition Service VSAS module is used to provide interfaces to VSDS and acquires large-scale streams from device-independent video data sources for on-the-fly processing. If a partic-VOLUME 4, 2016 ular video stream source is subscribed against an RIVA service, then the VSAS gets the configuration metadata from the Data Source DS in Immediate Structured Distributed Data Store (ISDDS) and configure the source device for video streaming. After successful configuring the source device, VSAS decodes the video stream, detects the frames, and then performs some necessary operations on each frame such as meta-data extraction and frame resizing, which is then converted to a formal message. These messages are then serialized in the form of mini-batches, compressed, and sent to the Distributed Broker. If a video-stream source "C 1 " is subscribed to the RIVA service "S 1 " then the VSAS will rout the mini-batch of the video stream to queue RIVA_1 in the Broker Cluster as shown in Fig. 4.

c: Video Stream Consumer Service
As the acquired video streams are now residing in the distributed broker in different queues in the form of mini-batches. To process these mini-batches of the video stream, we have different groups of computer cluster know as Video Stream Analytics Consumer (VSAC) Cluster. On each cluster, three types of client APIs are configured, i.e., RIVA services, VSCS, and LVSM. Each VSAC cluster has different domain-specific RIVA services where the VSCS are common for all. The VSCS assists the RIVA service to read the mini-batches of the video stream from the respective queue in the distributed broker for analytics, as shown in Fig. 4. The VSCS module has two main functions. First, this module allows RIVA service to read the mini-batches of the video stream from the respective queue in the distributed broker. The second task is to save the consumed unstructured video streams and its meta-data to the row video space in the DPDS and to the Data Source DS metastore, respectively.

d: Intermediate Results Manager
During the IVA service life-cycle, a sequence of algorithms are executed. Thus the output of an algorithm can be the input of another algorithm. The IR demands proper management in the distributed environment because one algorithm may be on one computer while the other may be on another computer. Thus we design IRM that sends and gets the IR to and from the topic RIVA_IR_ID in the distributed broker cluster. Similarly, this module is also responsible for reading the IR from the respective queue and persists to the IR data store for future use so that to avoid recomputation.

e: Lifelong Video Stream Monitor
The domain-specific RIVA service process the video stream for anomalies or abnormal activities. If an anomalies are detected, then it is sent to the distributed broker queue (i.e., RIVA_A_ID) by using the LVSM instance. To generate notification base response, LVSM follow standard observer-based concept [56]. Based on this approach, the LVSM module reads the anomalies from the respective distributed broker queue, i.e., RIVA_A_ID and notify the clients in near real-time and simultaneously persisted to the ISDDS.

2) Distributed Persistence Data Store
The second component of the VBDCL is DPDS. The DPDS component provides the facilities of permanent and distributed big-data persistence storage of both structured and unstructured data. The DPDS provides abstraction in two levels on the acquired video data ie. ISDDS and Unstructured Persistent Distributed Data Store (UPDDS). The philosophy behind DPDS and two levels of abstraction in the context of L-CVAS is many folds. From the users' perspective, the CVAS demands geo-based, real-time, low latency, and random read-write access to the data in the cloud. Similarly, the DPDS should also provide high-performance locality-based ac-cess to the data when the other layers deploy dataintensive IVA services. To meet such a diverse amount of requirements of the DPDS component, technologically, a Distributed File System (DFS) and Distributed Big Datastore (DBDS) can be leveraged.

a: Immediate Structured Distributed Datastore
The ISDDS is provided to manage large-scale structured data in the distributed environment over DBDS. Because of the data-intensive operation and according to the requirements of the other layer, technologically, a distributed big data store, can be deployed. The ISDDS hosts five types of data. The detailed description of each type of data has been described in this section.
L-CVAS provides role-based access to its user. L-CVAS user logs and the respective role information are maintained through the User Profile and Logs metastore. The proposed framework manages two types of video data sources through the Data Source metastore. These are video data sources, for example, IPcameras, Kinect, body-worn cameras, etc., and batch video datasets. The former one can be subscribed to RIVA service while the later one is eligible for BIVA services. The meta-information of these sources, along with access rights, are managed through the Data Source meta-store. Administrator and developer roles can develop, create, and deploy video analytics algorithms through the L-CVAS. Similarly, different IVA algorithms can be pipelined into an IVA service. The management of video analytics algorithms and services is managed through Video Analytics Algorithm and Service metastore, respectively. As stated that in IVA pipelining environment, the output of one IVA algorithm can be the output of another algorithm. In this context, we design a general container called IR datastore to persist and index the output of an RIVA algorithm, and services. This datastore is significant and can play a vital role in IVA pipeline optimization, and fast content-based searching and retrieval. Finally, the L-CVAS users are allowed to subscribe to the data sources to the IVA services. The subscription information is maintained through the Subscription meta-store, and the anomalies are maintained through the Anomalies meta-store.
The ISDDS Data Model of the L-CVAS demands an efficient distributed data store. The distributed data store should have the ability of horizontal scalability, high availability, partition tolerance, consistency, and durability. Furthermore, the data store should fulfill the read/write access demands of the BIVA operations, and RIVA, interaction, and visualization. It is a fact that traditional relational databases have little or no ability to scale-out to accommodate the growing demands of the big data, and resultantly new distributed data stores have emerged. The distributed data stores can be grouped into two major categories, i.e., NoSQL (Not Only SQL) and NewSQL (excluding graph data stores).
NoSQL is a schema-free data store designed to support massive data storage across distributed servers [63], [64]. The features of NoSQL data stores include horizontal scalability, data replication, distributed indexing, simple API, flexibility, and consistency [65]. NoSQL lacks true Atomicity, Consistency, Isolation, Durability (ACID) transactions, unlike Relational Database Management System (RDBMS). In the context of Consistency, Availability, Partition Tolerance (CAP) theorem [66], it has to compromise on either consistency or availability while choosing partition tolerance. NoSQL can further be categorized as Document, Key-value, and Extensible stores. A key-value data store is responsible for storing values and indexes for searching. Document datastore is used for document storage, indexing, and retrieval. Extensible data store stores extensible records that can be partitioned vertically and horizontally across the nodes.
NewSQL data stores provide the characteristics of both NoSQL and RDBMS: ACID transactional consistency of relational databases with facilities of SQL; and the scalability and performance of NoSQL. MySQL Cluster, VoldDB, ClustrixDB are examples of NewSQL. Some of the popular NewSQL and NoSQL datastores are shown in Table 4 along with the respective properties. The 'A' and 'C' in CAP is not equal to the 'A' and 'C' in ACID [67]. Raw Video Space is used for the management of the video data. Raw Video Space is further divided into two types of video spaces. The first type is a batch video which has been uploaded to the L-CVAS for batch analytics, where the second type is acquired and persisted from the VSDS. The entire acquired stream  The UPDDS is supposed to be designed on the top of DFS. In this context, diverse types of open source DFS have been developed. The implementation quality of big data applications is relative to the storage tier's file system. From an architectural perspective, working and handling large volumes and throughput of data is challenging. Commonly, big data solutions exploit a cluster of computers, ranging from few to hundreds of computers, connected through high-speed networks while deploying specialized distributed data and management software. In distributed data-intensive applications, large-scale data is always moving across the cluster and thus demands a distributed, scalable, reliable, and fault-tolerant file system [73] known as DFS. DFS is a file system that provides access to replicated files across multiple hosts on a computer network, ensuring performance, data locality, high availability, scalability, reliability, security, uniform access, and fault-tolerance. Currently, various DFS are available and may differ in terms of performance, fault-tolerance, content mutability, and read/write policy. Some state-of-the-art popular DFS, which can be used for UPDDS, along with a short description, are shown in Fig 3.

c: Active and Passive Data Readers and Writer
This module gives read-write access to the underlying data securely according to the business logic of the VBDCL Business Logic and according to the registered user access rights. This sub-module is composed of four types of readers and writers, i.e., ISDDS Active Data Reader, ISDDS Passive Data Reader, UPDDS Active Data Reader, and UPDDS Passive Writer. For real-time, read-write operation over the data residing in the ISDDS and UPDDS, such as CRUD operation, file creation, video stream writing to DFS, etc. the active data reader and writer are used. In the context of offline analytics over the bulk of videos while using distributed processing engines. The Passive Data Reader and Writer (PDRW) is provided to allow processing engines to load the bulk of data and persist the same to the ISDDS and UPDDS. The User Manager module encapsulates all the user-related operations such as new user account creation, access role assignment, and session management. Through the Data Source Manager Model Manager modules, the user can manage the VSDS, video data uploading, and model management. The (R/B)IVA Algorithm and Service Manager are built to manage, develop, and deploy new IVA algorithms and services, respectively. The former one is provided aaS to the L-CVAS developers, while the latter one is provided aaS to the consumers. The developer role can create and publish a new video analytics algorithm. The algorithm is then made available aaS to other developers and can use it. Once IVA services are created, then the L-CVAS users are allowed to subscribe to the streaming video data sources and batch data against the provided RIVA and BIVA services, respectively.

3) VBDCL Business Logic
Similarly, the Ontology Manager allows the developer to get the IR for decision making. The ontology manager provides a secure way of getting the IR and maps it according to ontology. This module also allows the user to manage the functionalities of the KCL. Finally, Cluster Management and Monitoring allows the administrator to monitor the health of the cluster.

B. DISTRIBUTED INTELLIGENT VIDEO ANALYTICS
IVA performs the complex tasks of extracting significant knowledge and information of interest from the video data, i.e., structural patterns, behavior patterns, content characteristics, event patterns [74], [75] and their relationship in the form of classification and clustering. Extracting knowledge and information from video big data is a processing-intensive task. Value extraction from video big data is behind the capabilities of traditional tools and demands for technological solutions to meet the processing requirements.
Therefore, video big data analytics is preferably performed in the distributed environment in a parallel manner while utilizing distributed scale-out computing technologies [76], such as MapReduce and Apache Spark. Distributed IVA not only significantly improves the performance, but also reduces the analytics cost. Based on the generic IVA life-cycle and motivated by scikit-learn [77], the distributed IVA are divided into two layers, i.e., VBDPL, and VBDML. These two layers are further elaborated in the following sub-sections.

1) Video Big Data Processing Layer
IVA requires video data pruning and strong feature extraction. With such intentions, the VBDPL layer consists of three components, i.e., Video Preprocessing, Feature Extractor, and Dimensionality Reduction is designed.

a: Video Preprocessing
The quality of data plays an active and significant role in solving a problem with ML. Raw videos have an unstructured format and contain noise/uncertainties, making it unsuitable for knowledge and information mining. Video Preprocessing component is designed with the same objectives and is supposed to deploy several distributed video preprocessing operations including frame extraction [78], frame-resizing, frameconversion from RGB to grayscale [79], shot boundary detection [80], segmentation [81], transcoding [82], and many more. In the first step, frames are extracted from a video for processing. Several frame selection algorithms are available, for example, keyframe extraction. The number of frames to be extracted is dependent on the user objective and task-dependent. Candidate frames can be all frames, step frames (every second frame, fifth VOLUME 4, 2016 frame, etc.) or keyframes [83]. The spatial operations highly depend on the scenario and objective. Spatial operations include frame resizing (for reducing computational complexity), corrections (brightness, contrast, histogram equalization, cropping, keyframes), mode (RGB, Grayscale, etc.), and many other operations. Segmentation is used for various purposes, such as partitioning video into semantically related chunks.

b: Feature Extractor
The Feature Extractor component implements distributed feature extraction algorithms. The performance of ML is highly dependent upon the type of data representation or features [84]. Features represent the characteristics of classes in the dataset and have a heavy impact on the ML algorithm's generalizability and performance. The data features that used to train ML models have a huge influence on the performance of the algorithm. Inappropriate or irrelevant features affect the performance of the algorithm. Thus, feature extraction extracts the features from the raw videos that can be interpreted by the ML algorithm [85], [86]. In this context, several feature extraction algorithms have been introduced for video data. These feature extraction approaches can be categorized into static features of keyframes [87]- [89], object features [90], [91], dynamic/motion feature extraction [85], [92], [93], trajectory-based features extraction [85], [86], [94], [95], and deep learning-based feature extraction [96]- [104].

c: Feature Selection and Dimensionality Reduction
Feature selection and dimensionality reduction reduce the size of the features. Large sizes of feature sets are expensive in terms of time for training and/or performing classification acquired by trained classifiers. For example, Principal Component Analysis (PCA) and its variants are used to reduce the size of features. During feature selection, most relevant features are selected by discarding irrelevant and weak features. The performance of ML classifiers is also directly related to the quality of features; GIGO (garbage in garbage out). Inappropriate or partially relevant features can negatively affect model performance. Therefore, only a limited set of features should be selected and used for training classifiers. This is what precisely the purpose of this component is and deploy different algorithms in this context. Similarly, some feature reduction techniques available that selects the specific set of limited features in real-time. For example, Online Feature Selection selects and inputs a specific number of small features to the classifiers in realtime [105]. In order to accelerate the training process, [106] used non-linear and group-based feature selection techniques, based on Adaptive Feature Scaling (AFS), to process the data with substantial dimension sizes. [107] proposed an unsupervised feature reduction technique that selects extremely relevant features and indicates suitable weights to the distinctive feature dimensions.

2) Video Big Data Mining Layer
The VBDML utilizes diverse types of machine-learning algorithms, i.e., supervised, semi-supervised, and unsupervised algorithms to find different type of information from the videos [74], [75]. In this context, VBDML layer hosts three types of components, i.e., Classification, Regression, Clustering.
Classification component provides various ML algorithms, e.g., Support Vector Machine (SVM), Nearest Neighbors, Random Forest, Decision Tree, Naïve Bayes, etc., that identifies that a particular object in a video frame belongs to which category while using predefined classes. The Regression component includes different algorithms, e.g., Linear Regression, Decision Tree Regression, Logistic Regression, and many more, predicting a continuous-valued attribute associated with objects rather than discrete values. The Clustering component encapsulates algorithms, e.g., K-Mean, spectral clustering, etc., that produces groups of data depending upon the similarity of data items.
L-CVAS has the ability of BIVA and RIVA. In this context, the VBDML should support batch learning and online learning. The former case considers the complete training data to learn and generates models. The batchlearning algorithm is expected to generalize, that usually does not perform well in the real environment. Unlike batch learning, online learning continuously learns from new input without making any statistical assumptions about the data [108]. In the context of model generalization, online learning is expected to work well by accurately predicting the predefined set of inputs [108]. Online learning is used in the environment when continuous learning from the data is required to learn new patterns instead of batch learning.

3) Distributed Deep Learning for IVA
Handcrafted features, e.g., Scale Invariant Feature Transform (SIFT) [109], Local Binary Pattern (LBP) [110], Histogram of Oriented Gradients (HOG) [111], etc., generates high dimensional features vectors and resultantly facing the issue of scalability. Recently, Convolutional Neural Network (CNN) based approaches have shown performance superiority in tasks like optical character recognition [112], and object detection [113]. The motive of the deep learning is to scale the training in three dimensions, i.e., size and complexity of the models [114], proportionality of the accuracy to the amount of training data [115], and the hardware infrastructure scalability [116]. Results of deep learning are so promising 12 VOLUME 4, 2016 -The error rate was decreased for deeper networks -Introduced the notion of residual learning -Improves the effect of vanishing gradient issue -Identity mapping based skip connection -A bit complex architecture -Lowers information of feature-map in feed forwarding -Over adaption of hyperparameters for a specific task, due to the stacking of the same modules that soon the deep learning will give equivalent or higher performance compared to humans when trained over large data sets [113]. A CNN or ConvNet is a type of neural network that can recognize visual patterns directly from the pixels of images with less preprocessing. CNN based video classification methods have been proposed in the literature to learn features from raw pixels from both short video and still images [97], [98], [117], [118].
In the proposed L-CVAS framework, both the VBDPL, and VBDML are capable to deploy deep-learning approaches for distributed IVA.
Since on the dawn of deep learning, various opensource architecture have been developed. Some of the well-known and state-of-the-art CNN architectures are LeNet-5 [119], AlexNet [118], ZFNet [120], GoogleNet [121], VGGNet [122], and ResNet [123]. The comparison of these architectures can be found in Table 5. Similarly, several frameworks have been developed to eliminate the need for a manual definition of gradient propagation. Table 6 summarizes these libraries along with the comparisons.
TensorFlow is a popular deep learning library designed for ML and deep learning. It supports the deployment of computation on both CPUs and GPUs. Ten-sorFlow allows the fast implementation of deep neural networks on the cloud. TensorFlow is also suitable for other data-driven research purposes and is equipped with TensorBoard (a visualization tool). Higher-level programming interfaces such as Luminoth, Kera, and Ten-sorLayer were built on the top of TensorFlow. Caffe2, developed by Berkeley AI Research, is another library to build their deep learning models efficiently along with GPUs' support in a distributed environment. Py-Torch, maintained by Facebook, is a scientific computing framework with wide support for machine learning models and algorithms. PyTorch offers rich pre-trained models that can be easily reused. MXNET is a deep learning library suitable for fast numerical computation for both single and distributed ecosystems. Likewise, some more deep learning libraries have been developed, such as CNTK, Deeplearning4j, Blocks, Gluon, and Lasagne, which can also be employed cloud environment.
Big DL models training with large-scale training data is a challenging task. For example, S Gao at al. [130] utilized six learning algorithms, i.e., biogeography-based optimization, particle swarm optimization, genetic algorithm, ant colony optimization, evolutionary strategy, and population-based incremental learning, for the best combination of neural network user-defined parameters during training. It is a hectic job for a single system when VOLUME 4, 2016
In the second approach, the deep learning model is replicated to all the cluster's worker-agents, as shown in Fig. 7. The training dataset is partitioned into nonoverlapping sub-dataset, and each sub-dataset are loaded to the different worker-agents of the cluster. Each worker-agent executes the training on its sub-dataset of training data. The model parameters are synchronized among the cluster worker-agents to updates the model parameters. The data distribution approached naturally fits in the distribute computing MapReduce paradigm [132]. The MapReduce splits the input based on some predefined parameters. The map tasks, then, process these chunks in parallel a manner. After processing, the output is shuffled for relevance and is directed to map tasks for generating intermediate results. The output from the map tasks is shuffled for relevance and is given as input to the reduce tasks for generating intermediate results. The intermediate results are combined to produce the complete result. Hadoop and Spark require and process data naturally distributed in the manner and popular research trend nowadays.
In the third case, the DL model is partitioned, and each worker-agent loads a different segment of the DL model for training, as shown in Fig. 8. The training data are given to the worker-agents that carry the input layer of the DL model. In the forward pass, the output signal is computed, which is transmitted to the workeragents that hold the next layer of the DL model. In the backpropagation pass, gradients are calculated starting at the workers that carry the DL model's output layer, propagating to the workers that hold the input layers of the DL model [133].

4) Big Data Engines, ML Libraries, and IVA
The VBDPL, and VBDML are assumed to be built on the top of distributed computing engines. This section overview some latest big data engines that can be utilized for scale-out IVA.
Hadoop MapReduce [145] is a distributed programming model, developed based on GFS [73], for dataintensive tasks. Apache Spark follows a similar programming model like MapReduce but extends it with Resilient Distributed Datasets (RDDs), data sharing abstraction [135]. Hadoop's MapReduce operations are heavily dependent on the hard disk while Spark is based on in-memory computation, which makes Spark a hundred times faster than Hadoop [135], [146]. Spark support interactive operations, Directed Acyclic Graph (DAG) processing, and process streaming data in the form of mini-batches in near real-time [147]. Apache Spark is batch centric and treats stream processing as a special case, lacking support for cyclic operations, memory management, and windows operators. Such issues of Spark has been elegantly addressed by Apache Flink [136]. Apache Flink treats batch processing as a special and does not use micro-batching. Similarly, Apache Storm and Samza is another prominent solution focused on working with large data flow in real-time. A brief description and comparison of the open-source big data frameworks are listed in Table 7.
To achieve scalability, big data techniques can be exploited by existing video analytics modules. The VB-DPL is not provided by default and needs its implementation on the top of these big data engines. However, The ML approaches can be categorized into two classes in  [124], DeepLearning4J [140], Keras [141], Caffe [125], H20 [148], BigDL [142], and Py-Torch [126]. All these libraries provide support for various ML algorithms and feature engineering. These libraries introduce an independent layer between frontend algorithms and a back-end engine to facilitate the migration from one big data engine to another, as shown in Table 7. These algorithms can be used to process large datasets, just like processing it on a single machine by providing the distributed environment abstraction and optimization.

5) Computer vision benchmark datasets
In the advancement of IVA, public datasets always play a vital and active role. Over the year, server benchmark datasets have emerged. Some of the recent popular datasets are listed in Table 8. ImageNet [149] is one of the significant datasets in deep learning and is utilized for training neural networks such as ResNet, AlexNet, and GoogleNet. Some more datasets have been developed aiming human action and motion recognition, including [151], [151], [153], [156], [157]. Google released YouTube-8M [160] and consisting of eight million diverse types of automatically labeled videos. Deng A et al. [159] proposed the HowTo100M dataset comprising of web videos with narrated instructions. Dataset like MediaEval2015 [154], and Trecvid2016 [151] are designed to support CBVR related research. For sports IVA, the Sports-1M dataset karpathy2014large is proposed and composed of 487 classes along with ground truth. YACVID [155] is a labeled image sequence dataset for benchmarking video surveillance algorithms. All these benchmark datasets are used for different IVA, such as action recognition, event detection, and classification.

C. KNOWLEDGE CURATION LAYER
Videos low-level processing produces feature descriptors that summarize characteristics of data quantitatively. The high-level analytics is more associated with the visual data understanding and reasoning. The features descriptors work as input to the high-level analytics and generate abstract descriptions about contents. The difficult problem is to bridge the semantic gap between the lowlevel features and high-level concepts suitable for human perception [161].
With the same indentations, the KCL layer has been proposed under L-CVAS architecture, on the top of VBDML, which map the IR (both online and offline) into the video ontology in order to allow domain-specific semantic video and complex event analysis. The KCL Video Ontology Vocabulary standardizes the basic terminology that governs the video ontology, such as concept, attributes, objects, relations, video temporal relation, video spatial relation, and events. Video Ontology is a generic semantic-based model for the representation and organization of video resources that allow the L-CVAS users for contextual complex, event analysis, reasoning, search, and retrieval. Semantic Web Rules express domain-specific rules and logic for reasoning. When videos are classified and tagged by the VBDML then the respective IR are persistent to VBDCL and also mapped to the Video Ontology while using the FeatureOnto Mapper. Finally, SPARQL based semantic rich queries are allowed for knowledge graph, complex event reasoning, analysis, and retrieval.

D. WEB SERVICES LAYER
Finally, to provide the functionality of the proposed L-CVAS over the web, it incorporates top-notch functionality into simple unified role-based web services. The Web Service Layer is built on the top of VBDCL Business Logic. Sequence diagrams for IVA algorithm and service creation is shown in Fig. 10. Whereas, role-based use case diagram of the proposed platform is shown in Fig. 9.

E. IVA SERVICE EXAMPLE SCENARIOS
We show two example scenarios, i.e., how to develop BIVA and RIVA services under L-CVAS. Hadoop and Spark MapReduce type operations naturally fit in a BIVA. Fig. 11, shows the block diagram along with sample script for distributed BIVA, i.e., object classification, where a DFS is configured to read and store video files, e.g., in a standard video file format such as MPEG, AVI, H.264, etc. First, the videos are loaded, the distributed video transcoding (a preprocessing algorithm under VB-DPL) are performed. For example, a user uploads a MPEG or other video files to the DFS. The transcoder algorithm first split the file into image frames, which may be throttled to key-value frames, and converts them into a sequence file format. These frames are then mapped, and features are extracted (utilizing some feature extractor from VBDPL). The features are then classified (using a classification algorithm of VBDPL). For each detected object, a key-value and coordinates of where the object is located within the image are computed. For each detected object in each frame, the map element process the frame. The map operation generates and provides output as composite visual value pair that includes the visual key, a time-stamp that identifies the frame, and the coordinates. The map stages then send the output in the form of a composite key-value pair to the respective reduce stages. The reduce stages provide the output to an output stage, which is persisted in the IR-DS of ISDDS.
The flow for a single RIVA along with an example pseudo-code is shown in Fig. 12. The VSAS component sends the video stream to the queue in the broker cluster. Then the consumer service (VSCS) is used to extract the mini-batch of video streams from the queue and process it. Once the mini-batch is consumed, then it is transcoded, features are extracted for classification, and finally, the classification results are persisted to the required destination (IR-DS and/or dashboard).

F. EXECUTION SCENARIOS
L-CVAS follows the lambda architecture style [29], and the execution scenarios undergo through two types of execution scenarios, i.e., Streaming Execution Scenario, and Batch Execution Scenario. These two scenarios aim to execute a massive amount of real-time video stream and batch videos against the subscribed IVA services. In literature, these execution scenarios are referred to as Speed Layer, and Batch Layer respectively [29]. The data of both the scenarios are managed through a common layer called Serving Layer. L-CVAS components are deployed on various types of clusters in the cloud, and each cluster is subject to scale-out on-demand. Fig. 13 illustrates these execution scenarios, and the explanation is given in the following subsections.

1) Streaming Execution Scenario
L-CVAS is supposed to deploy a pool of contextual RIVA services that are made available to the user for the subscription. Once a video stream data source is subscribed to service in the pool of contextual RIVA services, then the life-cycle of Streaming Execution Scenario encompasses through different stages while using distinct L-CVAS components. For the ease of understandability, these components are deployed on six types of computing clusters in the cloud, which are labeled explicitly as 'P', 'V', 'K', 'S', 'N', 'I', as shown in Fig. 13.
The cluster 'P' hosts VSAS and provides interfaces to external VSDS. On configuration, the video streams are captured and transformed into a proper message, which is then grouped into micro-batch, compressed, and loaded to the respective queue in the cluster 'K'.
The cluster 'K' deploys the distributed messaging system, where the acquired video streams, IR, and anomalies produced by LVSM are buffered. In this context, the cluster 'K' is composed of a set of three types of queues for each service, i.e., RIVA_ID, RIVA_IR_ID, and RIVA_A_ID, as described in IV-A1.
The mini-batches of video streams residing in the distributed broker's RIVA_ID queue need to be persisted to the UPDDS and ISDDS data stores. For this purpose, the cluster 'V' deploys three types of L-CVAS modules, i.e., VSCS, Video Processor and Persistence. The first module allows the cluster 'V' to read the video stream mini-batches from RIVA_ID topics in the Cluster 'K'. The cluster 'V' then processes the consumed video data, encodes it, and extracts the metadata from the video. Finally, the video stream persistence module saves the video data and the respective metadata to the UPDDS and ISDDS, respectively.
The cluster 'S' is responsible for processing the video stream in near real-time while using the IVA services. Different stream processing engines, e.g., Apache Spark Stream, can be used for RIVA. The cluster 'S' deploys four types of L-CVAS modules. The first module is VSCS and is used to consume the video streams from the RIVA_ID queue in the cluster 'K'. The second type of module is RIVA services. The RIVA service is the actual video stream analytics service that analyzes the videos. The video RIVA service is loaded according to the RIVA services subscription contract made by the L-CVAS user. The RIVA services can be pipelined, and the IR might need other applications in the multi-subscription scenario. Thus the IR producer/subscriber is used to send and receive the IR according to the application logic to and from the IR queue in Cluster 'K'.
The fourth type of module is LVSM producer. The contextual IVA services (IVAS) instance deployed in the cluster 'S' should have some domain-specific goal and can produce anomalies if analyzed any. The L-CVAS support real-time anomalies delivery system. The IVAS sent the anomalies continuously to the LVSM producer and the LVSM producer to the respective anomalies queue RIVA_A_ID in the cluster 'K'.
The cluster group 'I' read the IR from the RIVA_A_ID queue in cluster 'K' continuously and sent it to the ISDDS's IR data store for persistence. If the subscribed service is using the ontology then the IR are also mapped to the VidOnto triple residing in the Knowledge Curation Server 'T'.
The final type of cluster in the Streaming Execution Scenario is cluster 'N' and is known as Anomalies Notification Cluster. This cluster aims to read anomalies from the RIVA_A_ID queue in cluster 'K' and send the same to the ISDDS for persistence and also delivered in real-time to the video stream source owner in the form of alerts.

2) Batch Execution Scenario
The L-CVAS architecture is also equipped with BIVA. Unlike the RIVA, the BIVA is analyzed as an offline manner. The execution time of offline analytics is proportional to the video dataset size and the subscribed BIVA service computation complexity. The Batch execution life-cycle undergoes through three types of clusters, i.e., 'R', 'M', 'B'.
The cluster 'R' allows the L-CVAS user to upload the batch video dataset to the L-CVAS cloud and configure three types of L-CVAS modules. The first type of service is Batch Video Acquisition Service, which is used to acquire batch video datasets. Once uploaded to the node buffer, the batch dataset is processed by the activated Video Processor to extract the metadata from the batch videos. After processing the batch video dataset and the respective metadata are persisted to the UPDDS and ISDDS, respectively. Similarly, the cluster 'M' works the same way as that of cluster 'R', but this one is responsible for model management.
In the batch video analytics, the supporting layers deploy various contextual multi-domain offline BIVA services. This cluster loads the instance of the RIVA services as per user contract and processes the videos in an offline manner. Once subscribed, this cluster loads the batch video data set and model from the UPDDS. Similarly, the IR and anomalies are maintained in the ISDDS. VOLUME 4, 2016 The acquired video streams residing in the UPDDS is also illegible for offline analytics.
Finally, the Web Server 'W' deploys the Web User Interface (as described in IV-D), i.e., allow the users to interact with L-CVAS.

V. IVA; CONSTITUENTS, AND PREDOMINANT TRENDS
This section review the existing IVA literature and can be classified into four classes under the umbrella of IVA, i.e., CBVR, IVA Surveillance and Security, Video Summarization, and Semantics Approaches, as shown in Fig. 14. We also show that how L-CVAS can be used under these application areas.

A. CONTENT-BASED VIDEO RETRIEVAL
CBVR has applications from video browsing to intelligent management of video surveillance and analysis. To uphold advancement in CBVR, since 2001, the National Institute of Standards and Technology has been sponsoring the annual Text Retrieval Conference Video Retrieval Evaluation [162], [163]. The CBVR is an active research area, and several surveys are available, i.e., [32], [163]- [166]. However, here we discuss some of the scalable CBVR system being proposed in the literature.
In literature, some researchers tried to exploit distributed computing technology for the development of large-scale CBVR systems. Shang et al. [167] utilize the time-oriented temporal structure of videos and the relative gray-level intensity distribution of the frames as a feature base. Their method is expensive in terms of parallel processing due to the video semantics, and then all the frames of a video must be processed within the same execution environment. Hence this approach is challenging to parallelize accurately and efficiently even for the state-of-the-art big-data frameworks. Wang et al. [168] proposed a novel MapReduce framework called Multimedia and Intelligent Computing Cluster for nearduplicate video retrieval for large-scare multimedia data processing by joining the computing power of CPU's and GPU's to speed up the video data processing. They extract the keyframes using uniform sampling, store the keyframes to HDFS, perform local feature extraction using the Hessian-Affine detector [169] to detect interest points. K-means clustering over the feature vectors is utilized to generate visual words following the BoF [170] model, thus generating BoF-based feature vectors. Ding et al. [171] used big data processing technologies to design a human retrieval system on extensive surveillance video data called SurvSurf. Motion-based segments called M-clop were detected, which were utilized to remove redundant videos. Hadoop MapReduce framework was used to process M-clips for human detection and motion feature extraction. Vision algorithms were accelerated by processing only sub-areas with significant motion vectors rather than entire frames. Further, a distributed data store called V-BigTable on top of HBase was designed to structuralize M-clips' semantic information and enables large-scale M-clips retrieval. They stated that SurvSurf outperforms the baseline solutions in terms of computational time and with acceptable human retrieval accuracy.
Authors in [172] proposed Marlin for video big data similarity search. They used parallel computing to extract features from the acquired video micro-batches, which are then persisted in a distributed feature indexer. The proposed indexer was able to index incremental updates and real-time queries. They designed a finegrained resource allocation with a resource-aware data abstraction layer over streaming videos to upsurge the system throughput. They reported Marlin achieved 25X and 23X speedup against the sequential feature extraction algorithm and similarity search, respectively. The challenge of extracting distinctive features is addressed by Lv et al. [173] for the efficiency of extraction closely related videos from the large scale data based on local and global features utilizing Spark. To balance precision and efficiency, they introduced a multi-feature based distributed system, including local and global features. They combined local feature SIFT, Local Maximal Occurrence, and global feature Color Name. Lastly, they developed the system in a distributed environment based on Apache Spark. Further, M. N. Khan et al. [174] proposed FALKON for large-scale CBVR that utilized distributed deep learning on top of Apache Spark for accuracy, efficiency, fault tolerance, and scalability. Motivated by the fact that Apache Spark, by default, does not provide native video data structure, they developed a wrapper on the top of Spark's RDD called VidRDD. Utilizing VidRDD, first, they performed structural analysis on the videos, and then index the extracted deep spatial and temporal features in their designed distributed indexer. Finally, they evaluate their proposed system and show performance improvement in terms of scalability and accuracy. Likewise, Lin FC. et al. [175] put forward a cloud-based face video retrieval system while utilizing deep learning. First, pre-processing operations like termination of blurry images, and face alignment are performed. Then the refined dataset is constructed and used to pre-train the CNN models, i.e., ArcFace, FaceNet, and VGGFace for face recognition. The results of these three models are compared, and the efficient one was chosen for the retrieval system development.
The input query in their proposed system is a person's name. If the system detects a new person, it performs enrolling that person. Finally, timestamped results are returned against a query. A prototype of the proposed face retrieval system was implemented and reported its recognition accuracy and computational time.

1) CBVR under L-CVAS Architecture
L-CVAS provides an elegant and flexible six steps solution for the implementation and customization of scalable video indexing and retrieval, as shown in Fig. 15. First, the VSAS component acquires the video streams from VSDS in the form of mini-batches, and in the case of batch analytics, video data is loaded from the RAW DS and feeds to the VBDPL.
In the second step, VBDPL perform pre-processing operations and feature extraction. The former one encompasses structural exploration of a video, i.e., video scenes, shots detection, frames, and keyframes extraction, while the latter one extracts the low-level features. These low-level features can be keyframes' static features (texture-based, color-based, and shape-based), object features, and/or motion features (trajectorybased, statistics-based, and objects' spatial relationshipsbased). The extracted features are then handed over to the VBDML for classification and annotation.
In the third step, the semantic and high dimensional video feature vectors' indexes make the representative index for persisted video sequences in IR-DS. The IR-DS is synchronized with RAW-DS.
In the fourth step, through the WSL, the users are allowed to query the desired videos. In literature for video retrieval, various types of queries have been utilized for video retrieval. The queries can be categorized as Query by Example, Sketch, Objects, Keyword, and Natural Language. In theQuery by Example, similar videos or images to the given sample video or image is extracted using feature similarity. In Query by Sketch, the features are extracted from users drawn sketches (sketch represent the required videos) and compared against the features generated from the stored videos [176]. In Query by Object, the user give an object image then the CBVR system search for all occurrences of the object in the video database [177]. Likewise, the last two approaches use Keywords and Natural Language [178] as query. Combination-based Query combines different types of queries for CBVR, was also adopted by some researchers such as [179], [180].
In the fifth step, the CBVR system search for similar videos against the user query. For similarity measure approaches can be classified as feature, text, Ontology and combination based matching. The similarity measure depends on the type of query. Feature matching based similarity measure is the average distance between the features of the corresponding frames [181]. Query uses uses low-level features for extracting relevant videos. The benefit of this approach is that the video similarity can be found easily from the features but is not appropriate for semantic similarity, which is conversant to users. In text matching based similarity measure, the name of each concept is matched with given query terms so that to find the videos that satisfy the query. The best example of this approach is that of Snoek et al. [182]. This approach is simple, but the query text must include the relevant concepts to get pleasing search results. In Ontology-Based Matching, the similarity between between semantic concepts is measured utilizing ontology. Query descriptions are enriched from knowledge sources, such as the ontology of concepts. Adding extra concepts can improve the retrieval results [183] but can also decline search results unexpectedly. This approached is further explained in section V-D. Combination-based matching "leverages semantic concepts from a training collection by learning the combination strategies [87], and queryclass-dependent combination models [180]" [89]. Up to some extent, using this approach, the concept weights can be automated, and hidden semantic concepts can be handled but are difficult to learn query combination models.
Finally, the ranked result-set is presented to the user for browsing. To increase efficiency and to optimize the results of the CBVR system, many researchers use the obtained relevance feedback from the user. This feedback can be categorized as explicit, implicit, and pseudo feedback. In the first case, users are asked to select relevant videos from the previous results actively [184]. The explicit feedback approach obtains better results than the other two approaches, but direct user involvement is required. In the second case, the retrieval results are refined by exploiting the user's interaction with the system, i.e., clicking pattern [185]. In the third case, there is no involvement of the users. The user's feedback is produced through positive (closely related to the query sample) and negative sample (different from the query sample) from the previous retrieval. These samples are directed to the system for the next search. Yan et al. [186], and Hauptmann et al. [187] approaches are based on pseudo-relevance feedback approach. This approach substantially reduces the user interaction, but the semantic gap between low and high-level features obtained from different videos does not always agree with the similarities between the user-defined videos.

B. REAL-TIME INTELLIGENT VIDEO ANALYSIS AND SURVEILLANCE
In the context of growing security concerns, the surveillance meant to criminality and intrusion detection. Video surveillance is not limited to these, but it encapsulates all aspects of monitoring to capture the dynamics of diverse application areas, e.g., transportation, healthcare, retail, VOLUME 4, 2016 and service industries. A generic use case for RIVA has been shown in section IV-E, Fig. 12. Likewise, domainspecific RIVA services for security and surveillance can be created under L-CVAS architecture. In this section, we discuss recent literature on how the researchers advancing the activity and behavior analytics in the video streams with the aim of intrusion and crime detection, scene monitoring, and resource tracking.

1) Video Segmentation for Action Recognition
A combination of numerous actions, objects, and scenes forms complex events [188]. Video analytics against complex events is a nontrivial task. Complex event detection demands the association of multiple semantic concepts because it is almost impossible to capture the complex event through a single event class label [189]. Video segmentation is required to mine informative segments regarding the event happening in the video. For effective event detection in video segmentation, it is vital to take into account the temporal relations between key segments in a particular event. The event videos hold intra-class variation, and several training videos are required to consider all possible instances of event classes. Song and Wu [190] suggested a methodology to extract key segments automatically for event detection in videos by learning from a loosely tagged pool of web videos and images. For the positions of key segments and content depiction in the video, they used an adaptive latent structural SVM model and semantic concepts, correspondingly. They also developed two types of models, i.e., Temporal Relation Model and Segment-Event Interaction Model, for the temporal relations between video segments, and for evaluating the correlation between key segments and events respectively. They adapted labeled videos and images from the Web into their model and employed 'N' adjacent point sample consensus [191] for noise elimination. A knowledge-base was produced by Zhang et al. [192] to decrease the semantic gap between complex events. To effectively model eventcentric semantic concepts, they used a large-scale of web images for learning noise-resistant classifiers.
Action recognition in videos encompasses both segmentation and classification. Action recognition can be tackled by sliding windows and aggregation in sequential as well as isolated manner, or by performing both tasks parallel. [189], [193] are some of the good examples of video segmentation based event detection. The related literature on video segmentation based action recognition can be classified as action segmentation, a depth-based approach, and deep learning-based motion recognition.
The most popular action segmentation model is dynamic time warping scheme [194]. Action segmentation from videos using an appearance-based method considers a comparison between the start and end frames of contiguous actions. Quantity of movement [195] and KNN along with HOG [196] are commonly used to identify the start and stop frames of actions. Similarly, for action recognition, depth-based approaches have been developed which consist of binary range-sample feature [197], capturing local motion and geometry information [198], histogram oriented 4D normal [199], and combination of depth motion map and HOG [200]. [201]- [203] have applied deep learning approaches to depth-based action recognition methods. Besides, for motion recognition, deep learning has been utilized in many ways. One is a suboptimal way, in which video represented as still images is fed to the channel of a CNN [97], [117]. Sometimes video as compact images are given as input to the already trained CNN to achieve good performance [204]. Video in the form of a sequence of images is input to the Recurrent neural network (RNN) for sequential parsing of video frames for long as well as short-term patterns [205], [206]. In order to introduce a temporal dimension, video is regarded as a volume and substitutes the 2D filters of CNN with 3D filters [207], [208].
In many application areas, e.g., resource tracking, action recognition, human behavior recognition, and traffic control, object tracking and motion detection in videos are vital. The process of detecting moving objects in videos is known as object tracking. For object tracking, initially, the foreground information is extracted in videos, and then the background modeling of the scene is captured using a background subtraction algorithm, i.e., IAGMM [209], [210]. To increase the accuracy, subsequently, shadow elimination algorithms are applied to the foreground frame [211]. A connected component algorithm is used to determine the bounding box of an object. To ensure frame-to-frame matching of the detected object, a method such as adaptive mean shift [212] can be used for comparison. Factors like size and distance are used to object matching between frames. Finally, the occlusion scheme is used to detect and resolve occlusion. On the other hand, motion detection can be detected via foreground images extracted by the Gaussian Mixture Model background and connected component algorithm for noise removal. The area of detection is refined using a connected component algorithm and produces the bounding box information of the moving objects [209]. The output of the detection is a binary mask representing the moving object for each frame in a particular sequence. Object detection for moving objects is challenging, especially in the case of shadows and cloud movement [213]. The related literature for moving object detection and classification can be categorized as a stationary camera with a moving object and moving objects with 20 VOLUME 4, 2016 moving camera.

a: Moving object detection with stationary camera
In fixed camera video, the background image pixels in each frame remain the same, and thus simple background subtraction techniques are required. The object detection approaches using a fixed camera can be grouped as feature-based, motion-based, classifier-based, and template-based models [214]. Categorization of object tracking in videos into point tracking, kernel tracking, and silhouette tracking as well as feature-based, regionbased, and contour-based was performed by [214], [215] respectively. Unlike a fixed camera, moving object detection with a moving camera is relatively challenging because of camera motion and background modeling for generating foreground and background pixel fails [216].
• Trajectory classification involves computing long trajectories for feature point and discriminating trajectories that belong to different objects from those backgrounds using the clustering method. Some recommended algorithms include compensating long term motion based on flow optic technique [217], bag-of word classifier, and pre-trained CNN method for detecting moving object trajectories [218]. • In background modeling based methods, for each sequence, the frame-by-frame background is created utilizing the motion compensation method. Some popular algorithms are Mixture of Gaussian [219], complex homography [220], gaussian-based method [221], adaptiveMoG [222], multi-layer homography transform [223], thresholding [224], and CNN-based method [225] for background modeling. • In extension of background subtraction method, low rank and sparse matrix decomposition method for static camera [226] are extended to moving camera. If there exists coherency between a set of image frames, then a low-rank representation of the matrix created by these frames contains the coherency, and the sparse matrix representation contains the outliers, which represents the moving object in these frames. Low rank and sparse decomposition involve segmenting moving objects from the fixed background by applying principal component pursuit. It is a valuable technique in background modeling. Mathematical formula and optimization of this method can be found in [227].

2) Behavior Analysis
Usually, a camera is mounted nearby the digital displays to analyze and understand human behavior by investigating user interfacing with digital display [228]. Commercial tools have been developed to analyze au-dience behavior using video analytics while considering parameters like age, gender, distance from the display, and sight and spent time. The obtained data can then be used to improve advertising campaigns in combination with sales data [229].
Recently, crowd analytic, i.e., human detection and tracking, have attracted the attention of the researches. The exploration of both group and individual behavior to govern abnormality scope the crowd analysis. Congestion analysis, motion detection, tracking, and behavior analysis are the main attributes of crowd analytics. While performing crowd analysis, factors like terrain features, geometrical information, and crowd flow can be considered.
For analysis of the crowded scene, motion features are vital, and can be categorized as flow-based features, local spatiotemporal features, and trajectory features [230]. These features have applications in crowd behavior recognition, abnormality detection in-crowd, and motion pattern segmentation.

a: Flow-based Features
The flow-based features are pixel-level features, and in literature, different schemes have been proposed [231], [232], which be further classified as optical flow, partial flow, and streak flow. Optical flow technique encompasses computing pixel-wise motion between successive frames and can handle multicamera objection motion. It has been applied to detection crowd motion and crowd segmentation [233]. This approach, optical flow, is unable to capture spatiotemporal attributes of the flow and long-range dependencies. Particle flow contains moving a grid of particles with the optical flow and providing trajectories that maps a particle's initial position to its future or current position. It has an application in crowd segmentation and detection of abnormal behavior [230]. Optical flow is unable to handle spatial changes. Mehran et al. [232] proposed streak flow to overcome the shortcomings of particle flow and to analyzing crowd video while computing motion field. This approach, streak flow, captures motion information similar to particle flow; changes in the flow is faster and performs well in dynamic motion flow.

b: Local Spatiotemporal Features
Flow-based features fail on the very crowded scene, and resultantly local spatiotemporal features techniques are developed, which are 2D patches or 3D cubes representation of the scene. Spatiotemporal features can be categorized as spatiotemporal gradients, and motion histogram. To capture steady-state motion behavior, Kratz and Nishino [234] used a spatiotempo-VOLUME 4, 2016 ral motion pattern model and confirmed the detection of abnormal activities. On the other hand, motion histogram considers motion information within the local region. It is not appropriate for crowd analysis because it takes a substantial amount of time, and is subject to error. However, some improvements have been shown in the literature to motion histogram, e.g., [231], [235].

c: Trajectory Features
Trajectory features signify tracks in videos. The distance between object-based motion features can be extracted from the trajectories of objects and can be utilized to analyze crowd activities. The failure to obtain a full trajectory in dense crowd leads to the concepts of tracklet. The tracklets are extracted from the dense region and enforce the spatiotemporal correlation between them to detect patterns of behavior. Tracklet is a fragment of a trajectory obtained within a short period, and the occurrence of occlusion leads to closure. Tracklets have been used for human action recognition [31], [236] and for the representation of motion in crowded scenes [237], [238].

3) Anomaly Detection
Anomaly detection an application area of crowd behavior analysis and is domain-dependent. Anomaly in a video occurs when the analyzed pattern drifts from the normal in a training video. The related literature of anomaly detection can be categorized into three, i.e., trajectory-based, global pattern-based, and grid pattern-based method of anomaly detection.

a: Trajectory-based method of anomaly detection
In trajectory-based anomaly detection, objects are formed from the segment scenes, and then the object is followed in the video. A trajectory is caused by the tracked object, which describes the behavior of the object [239]. For the evaluation of abnormality in trajectory-based methods have been used i.e., singleclass SVM [240], zone-based analysis [241], semantic tracking [242], String kernels clustering [243], Spatiotemporal path search [244], and deep learning-based approach [245] have been used.

b: Global pattern-based method of anomaly detection
In a global pattern-based technique, the video sequence is analyzed in whole, i.e. low, or medium-level features are extracted from video using Spatiotemporal gradients or optical flow methods [246]. The technique is suitable for crowd analysis because it does not individually track each object in the video but is challenging while locating the position where the anomaly occurred. Approached used in the global pattern-based method are Gaussian Mixture Model [247], energy model [248], SFM [249], stationary-map [250], Gaussian regression [251], PCA model [252], global motion-map [253], motion influence map [254], and salient motion map [255].

c: Grid pattern-based method of anomaly detection
In a grid pattern-based method, splits frames into blocks and individually analyze pattern on a block-level basis [256]. If ignoring inter-object connections that lead to processing efficiency. Spatiotemporal anomaly maps, local features probabilistic framework, joint sparsity model, mixtures of dynamic textures with Gaussian Mixture Model, low-rank and sparse decomposition, cellbased texture analysis, sparse coding and deep networks are used in evaluating grid pattern-based methods [257].

C. VIDEO SUMMARIZATION
Video big data are facing the challenge of sparsity and redundancy, i.e., hours of videos with less meaningful information, which creates many issues for viewing, mining, browsing, and storing videos. It has motivated researchers to find ways to shorten hours of videos and led to the field of video summarization. Video summarization is a process of generating a shorter video of the original one without spoiling the capability to comprehend the meaning of the whole video [258]. The video summarization can be classified as Static Video Abstracts, Dynamic Video Skimming, and Video Synopsis.

a: Static Video Abstract
These approaches include a video table of contents, a storyboard, and a pictorial video summary. For example, Xie and Wu [259] propose an algorithm to generate a video summary for broadcasting news videos automatically. An affinity propagation-based clustering algorithm is used to group the extracted keyframes into clusters, aiming to keep the relevant keyframes that distinguish one scene from the others and remove redundant keyframes. J. Wu et al. [260] were motivated by the notion from high-density peaks search clustering algorithm. They proposed a clustering algorithm by incorporating significant properties of video to gather similar frames into clusters. Finally, all clusters' centers were presented as a static video summary. Bhaumik et al. [261] proposed a summarization technique where they detect keyframes from each shot that eliminates redundancy at the intra-shot and inter-shot levels. For frames redundancy elimination, SURF and GIST feature descriptors were extracted for computing the similarity between the frames. The quality of the summaries obtained by using SURF and GIST descriptors are also compared in terms of precision and recall.
Similarly, Zhang et al. [262] propose a subset selection technique that leverages supervision in the form of human-created summaries to perform automatic keyframe-based video summarization. They were motivated by the intuition that similar videos share similar summary structures. The fundamental notion is to nonparametrically transfer summary structures from annotated videos to unseen test videos. Concretely, for each fresh video, they first compute the frame-level similarity between annotated and test videos. Then the summary structures are encoded in the annotated videos with kernel matrices made of binarized pairwise similarity among their frames. Those structures are then combined into a kernel matrix that encodes the summary structure for the test video. Finally, the summary is decoded by feeding the kernel matrix to a probabilistic model called the determinantal point process to extract a globally optimal subset of frames. M. Gygli et al. [263] used a supervised approach to learn the importance of the global characteristics in summary by extracting deep features of video frames. J. Mohan et al. [264] proposed a technique that utilizes sparse autoencoders. Motion vectors have been used for the elimination of redundant frames, and then high-level features are extracted from frames using sparse autoencoders. These high-level feature vectors are then clustered using the K-means algorithm. The frames closest to the centroid of each cluster are selected as keyframes of the input video. Ji, Zhong, et al. [265] employed tag information, i.e., titles and descriptions, as the side information for the generation of summarization. A sparse auto-encoder was used as the primary model to generate the final summary, where the input and output were multiple videos and keyframes set, respectively. They fused the visual and tag information to guide visual features, which constrained the sparse auto-encoder to select the candidate keyframes.

b: Dynamic Video Skimming
A summary video is formed from the video segments of the original video to remove redundancy or to summarize based on object action or events. As an example of object-based skimming, Peker et al. [266] proposed a video skimming algorithm while utilizing face detection on broadcast video programs. In the algorithm, the attention was given to faces, as they establish the focus of most consumer video programs. Ngo et al. [267] represent a video as a complete undirected graph and exploit the normalized cut algorithm to form optimal graph clusters. One-shot was taken from each cluster of visually alike shots to remove duplicate shots. Xiao et al. [268] mine frequent patterns from a video. A video shot importance evaluation model is utilized to choose useful video shots to create a video summary. For personal videos, Gao et al. [269] developed a video sum-marization technique, which encompasses a two-level redundancy detection procedure. First, they terminated redundant video content with the hierarchical agglomerative cluster method at the shot level. Then parts of shots were selected, based on the ranking of the scenes and keyframes, to generate an initial video summary. Finally, to terminate the redundant information, a repetitive frame segment detection step was utilized. They verified the proposed technique through a prototype while using TV datasets (movies and cartoons videos) and reported the performance in terms of compression ratio (81%) and recall (87.4%). In [270], for user-generated video summarization, both the representativeness and the quality of the selected segments from an original video were considered. They stated that user-generated videos contain semantic and emotional content, and its preservation is vital. They have designed a scheme to pick representative segments that include consistent semantics and emotions for the whole video. To ensure the quality of the summary, they computed quality measures, i.e., motion and lighting conditions, and integrate them with the semantic and emotional clues for segment selection.
Ji, Zhong, et al. [271] addresses the issue of supervised video summarization by formulating it as a sequence-to-sequence learning problem, where the input and output is a sequence of original video frames and a keyshot sequence, respectively. The notion is to learn a deep summarization network with attention mechanism to mimic the way of selecting the human keyshots. The proposed framework was called attentive encoderdecoder networks for video summarization. They utilized the BiLSTM encoder for encoding the contextual information among the input video frames. For the decoder, two attention-based LSTM networks, are explored by using additive and multiplicative objective functions, respectively. The results demonstrate the superiority of the proposed framework against the state-of-the-art approaches, with remarkable improvements. J. Wu et al. [272] were motivated by the fact that multi-video summarization is significant for video browsing and proposed a technique where multi-video summarization was formulated as a graph problem. They also introduced a dynamic graph convolutional network to measure the importance and relevance of each video shot locally as well as globally. They adopted two approaches to address the inherent class imbalance issue of video summarization. Additionally, a diversity regularization to encourage the model to generate a diverse summary was introduced. The results demonstrate the effectiveness of our proposed model in generating a representative summary for multiple videos with encouraging diversity. Z. Sheng-hua et al. [273] proposed a deep learningbased dynamic video summarization model. First, they addressed the issue of the imbalanced class distribution VOLUME 4, 2016 in video summarization. The over-sampling algorithm is used to balance the class distribution on training data. They proposed two-stream deep architecture with costsensitive learning to handle the class imbalance problem in feature learning. RGB images are utilized to represent the appearance of video frames in the spatial stream. Likely, multi-frame motion vectors with deep learning framework are introduced to represent and extract temporal information of the input video. Moreover, they stated that the proposed method highlights the video content with the active level of arousal in effective computing tasks and can automatically preserve the connection between consecutive frames.

c: Video Synopsis
In this approach, activities from the stated time interval are collected and moved in time to form a smaller video synopsis showing maximum activity, as shown in Figure  16. The notion of video synopsis was pioneered by [274] in 2006 and proposed a two-phase approach, i.e., online and offline. The former phase includes the queuing of the generated activities. The later phase started after selecting a time interval of video synopsis with tube readjustment, background formation, and object stitching. A global energy function was defined and encompassing activity, temporal consistency, and collision cost. Then the simulated annealing method was applied for energy minimization. The video synopsis domain was further researched in single and multi-view scenarios.
Some recent examples of a single-view are [275]- [277]. He et al. [275], [276] brought advancement in activity collision analysis by describing collision statuses between activities such as collision-free, colliding in the same direction and opposite directions. They also offered a graph-based optimization technique by considering these collision states to improve the activity density and put activity collisions at the center of their optimization strategy. Baskurt and Samet [277] concentrated on rising robustness of object detection by suggesting an adaptive background generation method. In another study, Baskurt and Samet [278] planned the object tracking method specified for video synopsis requirements. Their approach focused on long term tracking to represent each target with just one activity in video synopsis. Single view scalable approaches for video synopsis were projected by Lin et al. [279] while utilizing distributed computing technology. Their proposed video synopsis approach encompasses steps like object detection, tracking, classification, and optimization, which were performed in a distributed environment. Ahmed, S. A. [280] proposed a query-based method to generate a synopsis of long videos. Objects were tracked and utilized deep learning for objects classification (e.g., car, bike, etc.). Through unsupervised clustering, they identified regions in the surveillance scene. The source and the destination represented spatiotemporal object trajectories. Finally, user queries were allowed to generate video synopsis by smoothly blending the appropriate tubes over the background frame through energy minimization.
Further, the examples of multi-view video analytics are that of [281], [282]. Zhu et al. [281] proposed a framework to generate a unified synopsis of multiview videos. The synopsis is visualized by mapping multiple views to a common ground plane. Multiple camera activities were allied via trajectory matching in overlapping camera views. The process of producing a synopsis requires a balance among minimizing the synopsis length, maximizing the information coverage, and reducing the collision among objects' track that are presented concurrently. Likewise, Mahapatra et al. [282] proposed a multi-camera approach for an overlapping camera network and modeled the synopsis generation as a scheduling problem. They utilized three distinct methods, i.e., table-driven approach, contradictory binary graph coloring approach, and simulated annealing. Action recognition modules were integrated to recognize significant actions, i.e., walking, running, bending, jumping, hand-shaking, and one or both hands waving. The inclusion of such essential actions can help in the synopsis length reduction while preserving the value. The synopsis length was further reduced while utilizing a fuzzy inference system that computes the visibility score for each object tracking. They stated that through the contradictory binary graph coloring approach, they achieved a maximum reduction in synopsis length. Zhang, Z. et al. [283], tried to address the issue of video synopsis by joint object-shifting and camera view-switching to show multiple synopsis results more compactly and understandably. The input videos were synchronized and grouped the same object in different videos together. Then they shifted the grouped objects with respect to the time axis to gain multiple synopsis videos. They constructed a simultaneous object-shifting and view-switching optimization framework to achieve encouraging synopsis results. To address unified optimization, they further presented an optimization strategy composed of graph cuts and dynamic programming. Fig. 17 shows the flow of the video summarization flow under the proposed L-CVAS architecture. Through the WSL, the user first subscribes to the video datasource to the video summarization service. Then the user preferences allow the users to set the parameters required for video summary service and personalization. The summary parameters encompass granularity level, type of summary to be performed (e.g., overview, highlights, synopsis, etc.) and any other as per the video summary service scenario. The personalization can be in terms of specific features of the video, like people, objects, events, etc. Through the VBDCL, videos are acquired from the video data-source and sent to the VBDPL for pre-processing. In pre-processing, video units are extracted (segmentation, shots, frame extraction, etc.) as per the requirements of the video summarization service. Then multiple low and high-level features, such as motion, color, aesthetics, semantics, etc., are extracted from the video units. The extracted features from the basic video units are input to the VBDPL for object/activity identification and clustering. Once done then the next step is video summarization. The video summarization phase deploys the actual logic, i.e., video unit selection, and redundancy removal. The video unit selection and redundancy removal decide which video units should be included in the video summary based on unit significance, summary length, and other user's parameters. This block also removes similar video units within the video summary to achieve the best possible video summary, covering the required details in the original video. Finally, the summary results are delivered to the respective user through the WSL.

D. SEMANTIC-BASED VIDEO ANALYSIS
To bridge the semantic gap and to allow the machine to understand the visual data, Semantic Web technologies can be incorporated [161]. Semantic Web unlocked a new avenue for knowledge-based computer vision while enabling data exchange between video analysis systems in an open and extensible manner. The scientific community has exploited the Semantic Web concept for intelligent video analytics to bridge the so-called semantic gap between the low-level features and high-level humanunderstandable concepts. State-of-the-art scholarly work can roughly be classified as semantic-based low-level, mid-level, high-level analysis, and semantic-based video search and retrieval.

1) Semantic-based Low-level analysis
Semantic-based low-level analysis refers to the formalization of the extracted objects of interest from the videos. It then performs reasoning on formalization such as detection and tracking across domain-specific videos [284], [285]. Dasiopoulou et al. [286] proposed a multimedia ontology for domain-specific video analysis. In semantic concepts, they consider object attributed, lowlevel features, spatial object relations, and the processing approaches while defining F-logic rules for reasoning that govern the application of analysis methods. García et al. [287] proposed a knowledge-based framework for video object segmentation, where relationships among analysis phases are utilized. The main contribution is to provide a detailed description of the scene at low, mid, and high semantic levels through an ontology. The notion is to offer the semantic rich description of a scene via an ontology that includes occurrences in the scene from high to low semantic level, controls iterative decisions on every stage. The low-level analysis modules (background subtraction, short-term change detection, point or region tracking, etc.) are provided with a structure to collaborate and achieve consistently, and contextual results. The results of the vision algorithms are mapped to the ontology, representing low-level scene occurrences. The following stages build Point Hypothesis Maps and Region Hypothesis Maps. These were the most probable occurrences of each point and region. The points and regions are coded according to the ScenePoint and SceneRegion hierarchies of the analysis ontology. The quality of the results was evaluated through a feedback path. Gomez et al. [288] proposed a computer vision framework for surveillance and consists of two layers, i.e., the tracking layer and a context layer. The proposed framework relies on an ontology-based representation of the scene in combination with contextual information and sensor data. The notion is the application of logical reasoning initiating from the acquired data from a classical tracker intending to construct an ontological model of the objects and the activities happening in the area of observation. Reasoning procedures were utilized to detect and predict tracking errors, sending feedback to the tracker to adjust the low-level image processing algorithms. Vision operation, like movement detection, blob-track association, and track and trajectory generation, were performed in the tracking layer. The context layer was supposed to produce a high-level interpretation of the scene. RACER reasoner [289] was utilized for scene interpretation since it allows abductive reasoning. Abductive rules are defined in the proposed framework to interpret what is happening in the scene from the primary tracking data.

2) Semantic-based Mid and High-level Analysis
Atomic events such as loitering, fall, direction changes, group formations, and separations [290], and "complex" events such as aggressions, fights, thefts, and other general suspicious events [291], [292] falls in mid and highlevel video analytics, respectively.
The utilization of semantic technology for video event representation in surveillance videos was initiated by Video Event Representation Language [293]. The main idea is to model simple events in a hierarchical framework intended for detecting complex events. They stated that the sequence of simple events (car door opening, leaving a car, car door closing, walking and opening a building door, and entering a building) forms complex events (a person arrived through a car and entered a building). They used Allen's interval algebra to handle VOLUME 4, 2016 temporal relationships between subevents. They clarified their proposed application through the detection of an example event in a surveillance video, i.e., accessing a secure zone by entering behind an authorized individual. For complex events recognition in surveillance videos, Snidaro et al. [294] proposed an ontology and is composed of three high-level concepts, i.e., background, entities, and events. The event class, high-level in subclasses, describes simple events, spatial events, and transitive events, allowing to show how complex events can be described through simple events sequencing. The event concept is composed of sub-concepts that represent simple events, spatial events, and transitive events. The focus was on complex events, which can be achieved by sequencing simple events. SanMiguel et al. [295] propose an ontology for representing the prior knowledge related to video event analysis. Such knowledge is described in terms of scene related entities (Object, Event, and Context), and system-related entities. The key contribution of the work is the integration of different types of knowledge in an ontology for detecting the objects and events in a video scene. In the same direction, Greco et al. [296] proposed a hybrid approach for simple abnormal (person falling) and complex abnormal events (person aggression) recognition using semantic web technologies. They modeled the extracted general tracking information to the proposed tracking ontology for advanced reasoning. The data from the videos were obtained using the tracking component (frames, bounding box), knowledge about the scene (static and dynamic objects, occluding objects), Situations and Events (people leaving the scene, falling ground, fighting). For event detection, SPIN rules and functions are used while SPARQL queries are employed for analytics tasks. The system has proven to successfully recognize mid-level events (ex. people falling to the ground) and high-level events (ex. person being attacked) on the PETS2016 dataset.
Researches also utilized semantic technologies aiming to address the issue of human activity recognition in daily living. In this context, Chen et al. [297] introduced a method for activity recognition while using ontological model, representation, and reasoning. They analyzed the nature and characteristics of daily life activities and modeling related concepts through ontologies. The authors describe the algorithms of activity recognition making full use of the reasoning power of semantic modeling and representation. They claimed that the proposed ontological models for daily life activities could easily be customized, deployed, and scaled up. Likewise, [298], an approach exploiting the synergy between the semantic technologies and tracking methods have been presented for object labeling. The work aims to augment and comprehend situation awareness, as well as critical alert-ing conditions. The unmanned aerial vehicles with an embedded camera were utilized to recognize moving and stationary objects along with relations between them. Contextual information was used for abnormal event detection. A prototype was designed and used a drone to capture videos on the University of Salerno. They stated that the proposed system could recognize an abnormal event by means of SWRL rules associated with midlevel activities, such as ball kicking by a human and a car passing through the same road.

3) Semantic-based Video Retrieval
Some researchers exploited semantic technologies for video search and retrieval purposes. In this regard, Yao et al. [299] proposed the image to text framework to extract events from images (or video frames) and then provide semantic and text annotations. The And-or-Graph incorporates vocabularies of visual elements like objects and scenes along with stochastic image grammar that identifies semantic relations among the visual elements. In this way, low-level image features are linked with high-level concepts, and the parsed image can be transformed into semantic metadata to form the textual description. Video contents are expressed in both OWL and text format, and then the users are allowed to search images and video clips through keyword searching and semantic-based querying. Xue et al. [300] proposed an ontology-based content archive and retrieval framework for surveillance videos. A surveillance ontology was proposed that represents semantic information of video clips as a resource ontology. Such an ontology models the basic feature description in the low level, the video object description in the mid-level, and event description in the high-level. The proposed system was tested for object and event retrieval, such as walking and car parking.
Furthermore, Xu et al. in [301] and [302] propose a method to annotate video traffic events while considering their spatial and temporal relations. They introduced a hierarchical semantic data model called structural video description, which consists of three layers, i.e., pattern recognition layer (ontological representation from the video the extracted video concepts), video resources layer (links video resources with their semantic relations), and demands layer (retrieval interface). They defined various concepts in the ontology, such as persons, vehicles, and traffic signs that can be used to annotate and represent video traffic events. Besides, the spatial and temporal relationships between objects in an event are defined. As a case study, an application to annotate and search traffic events is considered. Sah et al. [303] proposed a multimedia standard-based semantic metadata model and annotate globally inter-operable data about abnormal crowd behaviors from surveillance videos. Similar efforts are made by Sobhani et al. [304] 26 VOLUME 4, 2016 and proposed an advanced intelligent forensic retrieval system by taking advantage of an ontological knowledge representation while considering the UK riots in 2011 as a use case. Similarly, A. Alam et al. [305] proposed a layered architecture for large-scale distributed intelligent video retrieval while exploiting deep-learning and semantic approaches called IntelliBVR. The base layer is responsible for large-scale video data curation. The second and third layers are supposed to process and annotate videos, respectively while using deep learning on the top of a distributed in-memory computing engine. Finally, the knowledge curation layer, where the extracted lowlevel and high-level features are mapped to the proposed ontology, can be searched and retrieved using semantic rich queries. Finally, they projected the effectiveness of IntelliBVR through experimental evaluation.

VI. STATE-OF-THE-ART CVAS
In this section, we review the start-of-the-art CVAS. This discussion also supports our claim on the relationship between video big-data analytics and cloud computing. The discussion is further divided into two subsections, i.e., Scholarly CVAS and Industrial CVAS.

A. SCHOLARLY CVAS
In this subsection, we explore academic research trends (summarized in table 9) that how scientific community investigate and proposed cloud-based IVA solutions while utilizing big data technologies. In this direction, Ajiboye, S.O. et al. [306] stated that the network video recorder is already equipped with intelligent video processing capabilities but complained about its limitations, i.e., isolation, and scalability. To resolve such issues, they proposed a general high-level theoretical architecture called Fused Video Surveillance Architecture (FVSA). The design goals of the FVSA were cost reduction, unify data mining, public safety, and scalable IVA. The FVSA architecture consists of four-layer, i.e., Application layer (responsible for system administration and user management), Services Layer (for storage and analytics), Network Layer, and Physical Layer (physical devices like camera, etc.). They guaranteed the compatibility of FVSA with the hierarchical structure of computer networks and emerging technologies. Likewise, Lin, C.-F. et al. [307] implemented a prototype of a cloud-based video recorder system under glsIaaS while using big data technologies like HDFS and Map Reduce. They showed the scalable of video recording, backup, and monitoring features only without implementing any video analytics services. Similarly, Liu, X. et al. [308] also came out with a cloud platform for large scale video analytics and management. They stated that the existing work failed to design a versatile video management platform in a user-friendly way and to effectively use Hadoop to tune the performance of video processing. They successfully develop a cloud platform and the same big data technologies, i.e., Hadoop and MapReduce. They also managed to develop three video processing services, i.e., video summary, video encoding and decoding, and background subtraction.
Tan, H. et al. [106] used Hadoop and MapReduce for fast distributed video processing and analytics. They developed two video analytics services, i.e., face recognition and motion detection, by using JavaCV. Furthermore, Ryu, C., et al. [309] proposed a cloud video analytics framework using HDFS and MapReduce along with OpenCV [310] and FFmpeg for video analytics. They implemented face recognition and tracking algorithm and reported the scalability of the system and the accuracy of the algorithm. Ali M. et al. [311] proposed an edge enhanced stream analytics system for video big data called RealEdgeStream. They tried to investigate video stream analytics issues by offering filtration and identification phases to increase the value and to perform analytics on the streams, respectively. The stages are mapped onto available in-transit and cloud resources using a placement algorithm to satisfy the Quality of Service constraints recognized by a user. They demonstrate that for a 10K element data streams, with a frame rate of 15-100 per second, the job completion took 49% less time and saves 99% bandwidth compared to a centralized cloud-only based approach.
White et al. [312] researched MapReduce for IVA services, which comprises classifier training, clustering, sliding windows, bag-of-features, image registration, and background subtraction. However, experiments were performed for the k-means clustering and Gaussian background subtraction only. Tan and Chen [106] presented face detection, motion detection, and tracking using MapReduce-based clusters on Apache Hadoop. They utilized JavaCV since Hadoop is developed and designed for Java. Pereira, R. el al. [313] proposed a cloud-based distributed architecture for video compression based on the Split-Merge technique while using the MapReduce framework. They stated that they optimized the Split-Merge technique against two synchronization problems. The first optimization problem was split and merge video fragments without loss in synchronization, whereas the second optimization problem is the synchronization between audio and video can be greatly affected, since the frame size of each one may not be equal. Similarly, Liu et al. [314] used Hadoop and MapReduce for video sharing and transcoding purposes.
Zhang, W. et al. [52] proposed a cloud-based architecture for large scale intelligent video analytics called BiF. BiF combines the merits of RIVA and BIVA while exploiting distributed technologies like storm and MapReduce, respectively. BiF architecture considered non-VOLUME 4, 2016  functional architectural properties and constraints, i.e., usability, scalability, reliability, fault tolerance, data immutability, re-computation, storing large objects, batch processing capabilities, streaming data access, simplicity and consistency. The BiF architecture consists of four main layers, i.e., data collection layer, batch layer, realtime layer, and serving layer. The data collection layer collects the streaming video frames from the input video sources (camera). The data collection layer forwards the video frames to the batch layer and streaming layer for batch processing and real-time analytics, respectively. The service layer is to query both batch views and realtime views and integrate them to answer queries from a client. To evaluate the performance of the BiF architecture, they developed a video analytics algorithm, which was able to detect and count faces for a specific interval of time from the input source. During the evaluation, they showed that BiF is efficient in terms of scalability and fault tolerance. Zhang et al. [320] introduced Apache Kafka and Spark Streaming framework for efficient realtime video data processing. They also proposed a finegrained online video stream task management scheme to boost resource utilization and experimented with license plate extraction and human density analysis.
Azher et al. [319] proposed CVAS for RIVA and BIVA while using Spark Stream and Spark, respectively. They implemented IVA services such as human action recognition and face recognition services, respectively. In another work, Azher et al. [321] proposed a novel feature descriptor to recognize human action on Spark while utilized the Spark MLlib [143] to recognize the action from the feature vector generated by ALMD 28 VOLUME 4, 2016 [321]. Wang et al. [322] also performed human action recognition on Spark. The aim was to speed up some key processes, including trajectory-based feature extraction, Gaussian Mixture Model generation, and Fisher Vector encoding. Distributed video processing called streaming video engine is also introduced in [323] for distributed IVA framework at Facebook scale against three major challenges, i.e., low latency, application-oriented flexibility, and robustness to faults and overload.
Zhang et al. [317], [318] stated that the historical video data could be used with the updated video stream to know the current status of an activity, e.g., status of traffic on the road, and to predict future. To make it possible, they proposed a video cloud-based service-oriented layered architecture called Depth Awareness Framework and consists of four layers, i.e., data retrieval layer, offline video analytics layer, online video processing layer, and domain service layer. The data service layer is supposed to hander large-scale video data and Webcam Stream. The offline layer is used to perform the operation on the batch videos, whereas online processing occurs in a real-time video processing layer. On the top of the proposed cloud platform, they implemented deep convolution neural network for obtaining in-depth raw context data inside the big video, and a deep belief networkbased method to predict workload status of different cloud nodes, as part of knowledge on a system running status. They prepared a dataset consisting of seven traffic videos, each of size 2GB. During the evaluation, they stated the improvement in object prediction accuracy, fault tolerance, and scalability. Zhang et al. [324] performed pedestrian recognition on real-time video data using deep learning. Here, the CNN network is improved to fine-CNN, which consists of a nine-layer neural network. Moreover, the Apache Storm framework, along with a GPU-based scheduling procedure, is presented.

B. INDUSTRIAL CVAS
Various leading industrial organizations have successfully deployed CVAS. Some of the most popular are briefly described in the following subsections.

a: Google Vision
On March 8, 2017, at the Google Cloud Next conference in San Francisco, Google announced the release of the IVA REST API [328]. The API lets the developer recognize objects in videos automatically and can detect and tag scene changes. Furthermore, it enables the users to search and discover the unstructured video contents by providing information about entities (20,000 labels). Its main features are label detection, explicit content detection, shot change detection, and regionalization [331]. It exploits deep-learning models and is built on the top of the TensorFlow framework. The Google IVA APIs is targeting the unstructured video content analytics rather than surveillance and security. The application domains of the API can be large media organizations that want to build their media catalogs or find easy ways to manage crowd-sourced content. It can also be helpful for product recommendations, medical-image analysis, fraud detection, and many more.

b: IBM CVAS
In April 2017 at the National Association of Broadcasters Show, IBM announced CVAS services [332]. BlueChasm, [329] development team, came up with a prototype app, known as "VideoRecon," that combines IVA via IBM Watson and IBM cloud stack. The IBM CVAS service can extract metadata like keywords, concepts, visual imagery, tone, and emotional context from video data. The IBM CVAS allows the users to upload video footage to the IBM Cleversafe object storage [333] and subscribe to a service. When an object or event of interest is detected, the VideoRecon service creates a tag along with a timestamp of the point in the video when either the object was recognized or the event occurred. The tags are then stored in the IBM Cloudant fully managed NoSQL JSON document store [334] for future use.
c: Azure CVAS Microsoft Azure, a cloud computing service launched in 2010, started media services that enable developers to build scalable media management and delivery applications [330]. Media Services is based on REST APIs that enable the users to securely manage video or audio content for both on-demand and live streaming delivery to clients. Recently, they provide CVAS APIs to the customers for (R/B)IVA (as shown in Table 10).

d: Citilog CVAS
Citilog [325], also known as CT-Cloud, provides intelligent video analytics and surveillance solutions in the domain of transportation. Citilog provides services like automatic incident detection, traffic data collection (vehicle counting, classification, average speed, occupancy and levels of service), interaction control, video management, and license plate recognition. The Citilog is an open platform, providing APIs, widgets for quick development of services. According to the Citilog, they process approximately 32000 hours of video data and detects about five incidents per minute from 900 sites worldwide.  cameras, recorders, and a cloud video management solution for security dealers, integrators, and end-users. Initially, they used to provide OVAS, but with the advancement of cloud technology, CheckVideo launched CVAS. They provide domain-specific intelligent video analytics RIVA solution. The main features Check-Video are RIVA, a video search engine, cloud video storage, and an alert system. The provided services can be categorized as basic analytics, object classification, and business analytics (see table 10). According to the company's, they have successfully analyzed 108,458,000 hours of video and detected 61,233,000 events per month.

f: Intelli-Vision CVAS
Intelli-Vision [327], founded in 2002, is a leading and notable company in the field of Artificial Intelligence (AI) and deep learning-based video analytics and video cloud platform. They are exploiting state-of-the-art technology in the area of AI for security and monitoring purposes while targeting multiple business domains, including home, retail, transportation, and advanced driver assistance systems for cars. Intelli-Vision's analytics adds the "Brains Behind the Eyes" for cameras by analyzing the video content, extracting meta-data, sending out realtime alerts, and providing intelligence on the video. Currently, they are providing a wide range of video analytics services in the domains mentioned above, ranging from object left to night vision and enhancements (see table 10). In Feb 2018, in a press release, the Intelli-Vision stated that they have successfully deployed four million cameras worldwide, which have been subscribed to various IVA services.

VII. IVA APPLICATIONS
IVA at scale drives many application domains ranging from security and surveillance to self-driving and healthcare. Many application areas of video big data analytics are shown, which project the significant role of big data and cloud computing in IVA.

a: Traffic and Transportation
IVA has been extensively used in traffic control and transportation, e.g., lane traffic counts, incident detection, illegal u-turn, and many more. One of the main reason for deaths and injuries are traffic-related misfortunes [335]. Proactive analytics is required to predict abnormal events so that to minimize or avoid such accidents. In this direction, VisonZero [336] has been developed and deployed successfully. In transportation, another application is vehicle tracking where chasing of a license plate, overspeeding, and collision cause analysis can be obtained by analyzing video data. Kestrel [337] is a vehicle tracking system and uses information from various non-overlapping cameras to detect vehicle path. Gao et al. [338]used an automatic particle filtering algorithm to track the vehicle and monitor its illegal lane changes. Chen et al. citechen2009machine used hidden Markov models to determine the traffic density state probabilistically. Incident detection framework based on generative adversarial networks were proposed in [339].
b: Intelligent Vehicle and Self-driving cars Currently, the term self-driving cars mean that the vehicle exploits computer vision for safe and intelligent driving while assisting the driver. In an intelligent vehicle, different sensors and high definition cameras (cameras for vehicle cabin, forward roadway, and the instrument cluster) are integrated with the vehicle, which generates multi-model data, and the same is sent to the cloud for real-time analytics [340], [341]. In this context, video analytics is vital with optimum algorithm accuracy. Researchers have developed several algorithms, including pedestrian detection, traffic light detection, and other driver assistance system. For example, Wang et al. [342] proposed a method for pedestrian detection in urban traffic conditions using a multilayer laser sensor mounted onboard a vehicle. An algorithm was proposed by Tsai et al. [343] to detect three condition changes: missing, tilted, and blocked signs, using GPS data, and video log images. An innovative CNN-based visual processing model is proposed in [344] to automatically detect traffic signs and dramatically reduces the sign inventory workload. Driver decision making was improved in taking the right turn in left-hand traffic at a signalized intersection utilizing simulation [345]. They used an in-car video assistant system to present the driver's occluded view when the driver's view is occluded by truck. An effort for driver body tracking and activity analysis, posture recognition, and action predication, have been studied in [346], [347] respectively.

c: HealthCare
Recently, video big data analytics is reshaping the healthcare industry and yet another vital application area that demands special focus while deploying video analytics. Surveillance video streams can help understand the tracked person's behavior, such as monitoring elderly citizens or blind people against fall detection or detecting any possible threat. Fleck and Strasser reported a prototype 24/7 system installed in a home for assisted living for several months and shows quite promising performance [348]. Zhou et al. studied how video analytics can be used in eldercare to assist the independent living of elders and improve the efficiency of eldercare practice [349]. Some more studies were done [350]- [353] to analyze activities, recognize posture, and to detect falls or other substantial events. A smart gym can exploit the video cam stream to determine frequently used equipment, the duration of exercise, and time spent on a piece of particular equipment, which is useful for real-time assessment.

d: Smart City Security (IoT) and Surveillance
In many organizations ranging from large enterprises to schools, home and law enforcement agencies where se-curity is becoming an essential concern and is turning the security centers to video analytics to keep their premises safe. Law enforcement agents can use a body-worn camera to identify criminals in real-time while transmitting the video stream to the video analytic cloud. IVAaaS can provide customized services that can adjust quickly to changing needs and demands. People detection, and tracking [354], motion detection, intrusion detection [355], line crossing [356], object lift, loitering [357], and license plate recognition [358] are the example of video analytics services for security. For security in subway stations, Krausz [359] developed as a surveillance system to detect dangerous events. Shih et al. [360] tried to extract the color features of an employee's uniform to recognize the entry legality in a restricted area. In the context of security, abandoned object detection is indispensable and can lead to a terrorist attack. In the future, video analytics applications are developing fast, and they are changing the way the security industry works.

e: Augmented reality and Personal Digital Assistance
Among the many, one aspect of the augmented reality is visual, where devices like special glasses, helmets, or goggles are utilized for the projection of additional information or interactive experience of the surrounding real-world environment. The visual aspect of augmented reality may encompass complex IVA and demand powerful hardware. Likewise, vision-based digital assistants is a rising technology (e.g., personal robot Jibo) that could deeply alter our regular activities while offering personalized and interactive experiences. Such devices could be offloaded to the CVAS for low latency complex IVA uninterruptedly.

f: Retail, Management, and Business Intelligence Analysis
Large-scale products, services, and staff management while adjusting to consumer demands can be challenging without timely and up-to-date information. Smart Retail solution takes advantage of smart cameras combined with IVA to gather data on store operations and customer trends. Dwell analysis, face recognition, queue management, customer count, customer matrics, consumer traffic map, are some of the example services in this context [361]. Gaze analysis provides a means to learn customers' interest in merchandise by following their attention [362], [363] on a store display. Actions like reaching or grabbing products were analyzed by [364], [365] to understand customers' interest. Emotion analysis can identify customers' views regarding product and interaction with the company's representative [335]. IVA can also be used for business intelligence analysis to answer queries like "number of people visited per unite time?" or "customer interest in items?" while utilizing the same security infrastructure. Such information is beneficial for retailers in improving customer experience and marketing strategies.

VIII. RESEARCH ISSUES, OPPORTUNITIES, AND FUTURE DIRECTIONS
Intelligent video big data analytics in the cloud opens new research avenues, challenges, and opportunities. This section provides in-depth detail about such research challenges, which has been summarized in Table 11).

A. IVA ON VIDEO BIG DATA
Big data analytics engines are the general-purpose engine and are not mainly designed for big video analytics. Consequently, video big data analytics is challenging over such engines and demand optimization. Almost all the engines are inherently lacking the support of elementary video data structures and processing operations. Further, such engines are also not optimized, especially for iterative IVA and dependency among processes.
Optimizing cluster resource allocations among multiple workloads of iterative algorithms often involves an approximation of their runtime, i.e., predicting the number of iterations and the processing time of each iteration [366]. By default, Hadoop lacks iterative job support but can be handled through speculative execution. However, Spark supports not only MapReduce and fault tolerance but also cache data in memory between iterations. IVA on video big data creates an immense space for the research community to further crack in this direction. The research community is already trying to develop basic video processing and IVA support over big data, but it is still the beginning. How to optimize such engines for iterative IVA? It also allows us to research whether the exiting distributed computing engines fulfill the demands of the IVA on video big data or need a specialized one.
Furthermore, the focus of the existing research on IVA are velocity, volume, velocity, but the veracity and value have been overlooked. One promising direction in addressing video big data veracity is to research methods and techniques capable of accessing the credibility of video data sources so that untrustworthy video data can be filtered. Another way is to come up with novel ML models that can make inferences with defective video data. Likewise, users' assistance is required to comprehend IVA results and the reason behind the decision to realize the value of video big data in decision support. Thus, understandable IVA can be a significant future research area.

B. IVA AND HUMAN-MACHINE COORDINATION
IVA on video big data grants a remarkable opportunity for learning with human-machine coordination for numerous reasons: • IVA on video big data in cloud demands researchers and practitioners mastering both IVA and distributed computing technologies. Bridging both the worlds for most analysts is challenging. Especially in an educational environment, where the researcher focuses more on the understanding, configuration, and tons of parameters rather than innovation and research contribution. Thus there is a growing need to design such CVAS that provide high-level abstractions to hide the underlying complexity. • IVS service to become commercially worthwhile and to achieve pervasive recognition, consumer lacking technical IVA knowledge. The consumers should be able to configure, subscribe, and maintain IVA services with comfort. • In traditional IVA, consumers are usually passive.
Further, research is required to build more interactive IVA services that assist consumers in gaining insight into video big data. An efficient interactive IVA service depends on the design of innovative interfacing practices based on an understanding of consumer abilities, behaviors, and requirements [367]. The interactive IVA services will learn from the consumer and decrease the need for administration by a specialist. It will also enable consumers to design custom IVA services to meet the domainspecific requirement.

C. ORCHESTRATION AND OPTIMIZATION OF IVA PIPELINE
L-CVAS is a service-oriented architecture, i.e., (R/B)IVAaaS. The real-time and batch workflow are deeply dependent on the messaging middleware (Table 2) and distributed processing engines (Table 7). In L-CVAS, the dynamic (R/B)IVA service creation and multi-subscription environment demand the optimization and orchestration of the IVA service pipeline while guarantees opportunities for further research. In literature, two types of scheduling techniques have been presented for real-time scheduling, i.e., static and dynamic [368], [369]. Static approaches are advantageous if the number of services and subscription sources is known priorly, but this is not the case with L-CVAS. The suitability of dynamic methods is reasonable but is expensive in terms of resource utilization. Likewise, the main issue in the BIVA service workflow on video big data is the data partitioning, scheduling, executing, and then integrate numerous predictions. The BIVA service workflow can be affected by the data flow feature of the 32 VOLUME 4, 2016 underlying big data engine (as shown in Table 7). As Hadoop map-reduce lacking the loops or chain of stages, and Spark support DAG style of chaining, whereas Flink supports a Controlled Cyclic Dependency Graph (CCDG).
In the map-reduce infrastructure, a slowdown predictor can be utilized to improve the agility and timeliness of scheduling decisions [370]. Spark and Flink can accumulate a sequence of algorithms into a single pipeline but need research to examine its behavior in dynamic service creation and subscription environment. Further, concepts from the field of query and queuing optimization can be utilized while considering messaging middleware and distributed processing engines to orchestrate and optimization of IVA service Pipeline.

D. IVA AND BIG DIMENSIONALITY
The VSDS multi-modality can produce diverse types of data streams. Similarly, the IVA algorithm, developer, and IR has a triangular relationship. An array of algorithms can be deployed, generating varied sorts of multi-dimensional features from the acquired data streams. The high-dimensionality factor poses many intrinsic challenges for data stream acquisition, transmission, learner, pattern recognition problems, indexing, and retrieval. In literature, it has been referred to as a "Big Dimensionality" challenge [371].
VSDS variety leads to key challenges in acquiring and effectively processing the heterogeneous data. Most existing IVA approaches can consider a specific input, but in many cases, for a single IVA goal, different kinds, and formats can be considered.
With growing features dimensionality, current algorithms quickly become computationally inflexible and, therefore, inapplicable in many real-time applications [372]. Dimension reduction approaches are still going to be a hot research topic because of data diversity, increasing volume, and complexity. Effect-learning algorithms for first-order optimization, online learning, and paralleling computing will be more preferred.
Similarly, designing a generic, efficient, and scalable multi-level distributed data model for indexing and retrieving multi-dimensional features is becoming tougher than ever because of the exponential growth and speed of video data. Considering varied situations, requirements, and parameters such as complex data type indexing (such as objects, which contains multiple types of data), multidimensional features (require different feature matching scheme for each type), cross scheme matching (e.g., spatiotemporal, spatial-object, object-temporal, etc.), giant search space, incremental updates, on the fly indexing, and concurrent query processing demands further investigation. It gives the research community the opportunities to optimize existing hashing schemes and indexing structures such as R-trees [373], M-tree [374], X-tree [375], locality-sensitive hashing [376] etc. on the top of big data engines. Log-structured merge-tree [377] based distributed data stores (see Table 4) can be leveraged to improve multi-dimensional query performances. Additionally, ML classification models can be used to capture the semantics by inspecting the association between features and the context among them to better index multi-dimensional data. Such methods make them more precise and effective than the non-ML methods [378]. Thus, ML classification models (including neural networks) can be used to capture the semantics by inspecting the association between features and the context among them to better index multi-dimensional data, that make them more precise and effective than the traditional indexing approaches [378].

E. ONLINE LEARNING ON VIDEO BIG DATA
The value of RIVA is dependent on the velocity of the video streams, i.e., newness and relatedness to ongoing happenings. Though existing big data de-facto standards are lacking to deal with the changing streams [379]. The RIVA services must address continuous and changing video streams. In this context, online learning can be utilized, representing a group of learning algorithms for constructing a predictive model incrementally from a sequence of data, e.g., Fourier Online Gradient Descent and Nystrom Online Gradient Descent algorithms [380]. In this context, Nallaperuma et al. [381] proposed ITS platform utilizing unsupervised online learning and deep learning approaches. It gives further research opportunities by involving data fusion from heterogeneous data sources [381].

F. MODEL MANAGEMENT
L-CVAS architecture is designed to deploy an array of IVA algorithms, i.e., both by administrator and developers. An algorithm might hold a list of parameters. The model selection process encompasses feature engineering (feature selection IV-B1), IVA algorithm selection, and hyperparameter tuning. Feature engineering is a laborious activity and is influenced by many key factors, e.g., domain-specific regulations, time, accuracy, video data, and IVA properties, which resultantly slow and hinder exploration. IVA algorithm selection is the process of choosing a model that fixes the hypothesis space of prediction function explored for a given application [382]. This process of IVA algorithm selection is reliant on technical and non-technical aspects, which enforce the IVA developer to try manifold techniques at the cost of time and cloud resources. Hyperparameter is vital as they govern the trade-offs between accuracy and performance. IVA analysts usually do ad-hoc manual tuning by iteratively choosing a set of values or using heuristics such as grid search [382]. From IVA analysts' perspective, model selection is an expensive job in terms of time and resources that bringing down the video analytics lifecycle [383]. Model selection is an iterative and investigative process that generally creates an endless space, and it is challenging for IVA analysts to know a priori which combination will produce acceptable accuracy/insights. In this direction, theoretical design trade-offs are presented by Arun et al. [384], but further research is required that how to shape a unified framework that acts as a foundation for a novel class of IVA analytics while building the procedure of model selection easier and quicker.

G. PARAMETER SERVERS AND DISTRIBUTED LEARNING
Developing a model (such as Stochastic Gradient Descent) for video big data analytics in a distributed environment carries an intrinsic issue of sharing and updating high-dimension parameters that can easily run into orders billions to trillions. The Parameter Server notion has been introduced to address this issue, aiming to store the parameters of an ML model such as the weights of a neural network and serve them to clients. Parameter Server proposes a new framework for building distributed ML algorithms, and encompass diverse design goals, e.g., efficient communication, flexible consistency, elasticity when adding resources, resource utilization, and ease of use. In literature, recently, various studies tried to optimize Parameter Server. PS2 [385] builds the parameter server on top of Spark. SketchML [386] compresses the gradient values by a sketch-based method. FlexPS [387] introduces a multi-stage abstraction to support flexible parallel control. The Parameter Server can be optimized further against the stated design goals and need further investigation.

H. EVALUATION ISSUES AND OPPORTUNITIES
Haralick [388] initiated the discussion of IVA performance evaluation followed by dedicated workshops [389], and journals [390], [391]. As a result, performance evaluation tools (ViPER5), and datasets (ETISEO [392], TrecVID [393], i-LIDS [394]) were introduces. It is a fact that traditional IVA has an established set of prediction accuracy based metrics for performance evaluation that is ranging from accuracy, error rate, and precision to optimization and estimation error. Some more evaluation parameters are adopted from big data analytics when IVA is tried on distributed computing, e.g., scalability, fault tolerance, memory usage, throughput, etc. [395] (as shown in Table 9. The amalgamation of two types of matrices might not be enough.
IVA services provided by a system like L-CVAS has to accomplish predictably through an intractable number of scenarios and environmental circumstances, meeting requirements that vary according to the situation, domain, and user. For L-CVAS's consumer, the IVA services work as a black box where the significant metrics relate to overall system performance, such as false alarms, accuracy, and detection rate. However, from the developer and researcher perspective, L-CVAS consists of numerous computer vision algorithms, with complex interfaces among them. A proper performance evaluation matrix is required for IVA service developers to comprehend these relations and to revolutionize and address novel IVA services. Another critical issue is how to guarantee accurate and predictable IVA service performance when porting technology between distributed algorithm development environments and deployment code environments with hardware-specific optimizations. These shortcomings result in algorithmic alterations that can influence the performance of IVA services.
IVA performance evaluation is goal-oriented, and the factors should be determined carefully. Many key factors that influence the performance of video big data analytics utilizing distributed computing engines in the cloud are listed below (not limited to).
• VSDS: holds diverse types of parameters, i.e., video type (color, grey-scale, infrared, omnidirectional, depth map, etc.), property (frame-rate, field-depth), and quality (resolution, pixel depth) of the generated video as generated by the camera. • VSDSand messaging middleware parameters: encompass VSDS connection, frames reducing and transformation, messaging queue (broker server), compression artifacts, mini-batch size, and possibly the involvement of internal and external network. • VSDSenvironmental parameters: some features of the configuration remain constant in a given use case, but differ between configuration, possibly influencing performance. These parameters comprise camera location (mounting height, angle, indoor or outdoor), mounting type (still or in motion), camera view (roads, water, foliage), and weather (sun, cloud, rain, snow, fog, wind). • Distributed processing environment: the video processing hardware (FPGA, CPU, GPU, etc.), and network communication channels potentially impose additional limitations in terms of locality, speed, and memory. • Big data analytics engines: The nature and characteristics of big data engines affect the performance of the IVA, such as data flow, windowing, computation model, etc. Further, big video analytics in the cloud demand complex trade-offs between different evaluation criteria. In order to comprehend, assume the intricate trade-offs between accuracy and response time. Iterative tasks have an inverse relation with fault tolerance concerning scalability (e.g., MapReduce is high fault-tolerant but lacking iteration). Similarly, non-iterative IVA algorithms scale better than iterative at the cost of performance degradation. • IVA service parameters: Application parameters are domain-specific parameters, e.g., vehicles, carts, humans, etc., tolerable miss detection and false alarm rates and their desired trade-off, IVA type, and max acceptable latency. • Computation and communication trade-off: IVA Algorithms and services in the distributed environment should be developed and designed wisely intending to minimize computation time, which is associated with data locality and loading. Diverse types of factors affect video big data analytics performance in the cloud, constructing a comprehensive evaluation of all use cases are almost near impossible. It further provides opportunities for the researchers to design a framework that provides a unified and generic framework that can be adapted by any CVAS. Investigating these issues would significantly contribute to the academic and industrial communities interested in building IVA algorithms, services, and CVAS.

I. IVA ALGORITHM, MODEL, AND SERVICES STATISTICS MAINTENANCE, RANKING, AND RECOMMENDATION
L-CVAS architecture is designed under the Customer-to-Customer (C2C) business model. In L-CVAS, a user can develop and deploy an IVA algorithm, model, or service (here collectively we call it IVA service) that can be either extended, utilized, or subscribed by other users. The community members run such architecture, and rapidly, the number of IVA services can be reached to tons of domain-dependent or independent IVA services. This scenario develops a complex situation for the users, i.e., which IVA service (when sharing the parallel functionalities) in a specific situation, especially during service discover. Against each IVA service, there is a list of Quality of Service (QoS) parameters. Some of these QoS parameters (not limited to) are user trust, satisfaction, domain relevance, security, usability, availability, reliability, documentation, latency, response time, resource utilization, accuracy, and precision.
Such types of IVA services against the QoS parameters lead to the 0-1 knapsack issue. In this direction, one possible solution is utilizing multi-criteria decisionmaking approaches. It gives further opportunities to the research community to investigate how to rank and recommend IVA algorithms, models, and services. Similarly, it can lead to a high-dimensional sparse matric [396]. In this direction, research is required on how to utilize such parameters for IVA services recommendation.

J. IVAAAS AND COST MODEL
Recently, cloud-based analytics platforms are the key means for enterprises to provide services on the payas-you-go cost model. Existing cost metrics usually are determined to utilize hardware usage comprising processing (CPU, GPU), disk space, and memory usage. VOLUME 4, 2016 These prices are often static or dynamic [397]. The example of the former one is Amazon's EC2, which offers tiered levels of service. In later cases, the cost model is used to determine the price of the service using analytics. This takes into account factors such as peak hours and opponent cost model etc. The hardware cost is usually minimal compared to the cost of software such as L-CVAS where the cost of IVA analytics valued more.
L-CVAS is supposed to provide IVA-Algorithm-asa-Service (IVAAaaS) and IVAaaS in the cloud while adopting the C2C business model. Unfortunately, current SaaS cost models might not be applicable because of the involvement of diverse types of parameters that drastically affect the cost model. Such parameters are, business model (Business-to-Business (B2B), Businessto-Customer (B2C), and C2C), unite of video, user type (developer, researcher, and consumers), services (IVAAaaS and IVAaaS), service subscription (algorithm, IVA service, single, multiple, dependent or independent), cloud resource utilization, user satisfaction, QoS, location, service subscription duration, and cost model fairness. The addition of further parameters is subject to discussion, but the listed are the basic that govern L-CVAS cost matrix. Additionally, the cost model demands further research and investigations to develop an effective price scheme for IVA services while considering the stated parameters.

K. VIDEO BIG DATA MANAGEMENT
Despite video big data pose high value, but its management, indexing, retrieval, and mining are challenging because of its volume, velocity, and unstructuredness. IVA has been investigated over the years (section V) but still evolving and need to address diverse types of issues such as: • In the context of video big data management, the main issue is the extraction of semantic concepts from primitive features. A general domainindependent framework is required that can extract semantic features, analyze and model the multiple semantics from the videos by using the primitive features. Further, semantic event detection is still an open research issue because of the semantic gap and the difficulty of modeling temporal and multimodality features of video streams. The temporal information is significant in the video big data mining mainly, in pattern recognition. • Motion analysis is vital, and further research is required for moving objects analysis, i.e., object tracking, handling occlusion, and moving objects with statics cameras [214], moving cameras [216], and multiple camera fusion. • Limited research is available on CBVR (see section V-A) while exploiting distributed computing.
Further study is required to consider different features ranging from local to global spatiotemporal features utilizing and optimizing deep learning and distributed computing engines. • For video retrieval, semantic-based approaches have been utilized because of the semantic gap between the low-level features and high-level humanunderstandable concepts. Ontology adds extra concepts that can improve the retrieval results [183] but can also lead to unexpected deterioration of search results. In this context, a hybrid approach can be fruitful and need to design different query planes that can fulfill diverse queries in complex situations. • Insufficient research is available on graph-based video big data retrieval and analysis, opening doors for further investigation. The researchers can conduct studies to answer questions like: the formation of the video graph, tuning the similar value and its effect on the graph formation, studying properties, and meta-analytic of the formed video graph. • How the reinforcement learning and real-time feedback query expansion technique [398] can be exploited to improve the retrieval results? • Recently, video query engines have been introduced to retrieve and analyze video at scale [399]- [401]. A special focus of the database community is required to design, implement, optimize, and operationalize such video query engines.

L. PRIVACY, SECURITY AND TRUST
Video big data, acquisition, storage, and subscriptions to shared IVA in the cloud become mandatory, which leads to privacy concerns. For the success of such platforms, privacy, security, and trust are always central. In literature, the word 'trust' is commonly used as a general term for 'security' and 'privacy' [402]. Trust is a social phenomenon where the user has expectations from the IVA service provider and willing to take action (subscription) on the belief based on evidence that the expected behavior occurs [403], [404]. In the cloud environment, security and privacy are playing an active role in the trust-building. To ensure security, the CVAS should offer different levels of privacy control. The phenomena of privacy and security are valid across VSDS, storage security, multi-level access controls, and privacy-aware IVA and analysis. We list some research directions that can provide opportunities for cloud security specialists.
• The video big data volume, variety, and velocity boost security threats. Recently, disputes and news were circulating regarding the misuse of usergenerated content and hacking the cameras in an unauthorized manner. The CVAS vendor, in collaboration with or under the law enforcement agencies, must come up with new rules, laws, and agree-ments, which can differ from country to country. Utilizing such policies, the CVAS vendor should ensure that all the IVA service, subscription, and storage level agreements are adequately followed. The policies can be researched whether it offers adequate protection for individuals' data while performing video big data analytics and public monitoring. • Unlike other data, videos are more valuable for the owner and can be a direct threat, e.g., live broadcast, blackmailing, etc. Likewise, the focus of traditional privacy approaches is data management that becomes absolute when it comes to data security. Novel algorithms are required to secure user's data both for shared IVA, storage to make the video stream acquisition more secure. • In the context of security, the blockchain (popularized by Bitcoin) [405] has been studied and operationalized across academia and industry evenly. A blockchain is a modification resilient cryptography technique known as a distributed ledger, where records are linked and managed by a decentralized peer-to-peer network [405]. The blockchains techniques are still in early-stage and can be researched further to form a novel automated security system for CVAS. • ML techniques have been matured over the years and have been successfully utilized for security, i.e., modeling attack patterns with their distinctive features. However, change in features in case of sophisticated attacks may lead to security failure. ML could enhance the performance of security solutions to alleviate the dangers of the existing cyberattacks. The research community can further investigate how ML techniques, especially deep learning, can be deployed to analyze logs produced by network traffic, IVA processes, and users to recognize doubtful activities.

IX. CONCLUSION
In the recent past, the number of public surveillance cameras has increased significantly, and an enormous amount of visual data is produced at an alarming rate. Such large-scale video data pose the characteristics of big data. Video big data offer opportunities to the video surveillance industry and permits them to gain insights in almost real-time. The deployment of big data technologies such as Hadoop, Spark, etc., in the cloud under aaS paradigm to acquire, persist, process and analyze a large amount of data has been in service from last few years. This approach has changed the context of information technology and has turned the on-demand service model's assurances into reality. This paper provides an extensive study on intelligent video big data in the cloud. First, we define basic terminologies and establish the relation between video big data analytics and cloud computing. A comprehensive layered architecture has been proposed for intelligent video big data analytics in the cloud under the aaS model called L-CVAS. VBDCL is the base layer that allows the other layers to develop IVA algorithms and services. This layer is based on the concept of IR orchestration and takes care of data curations throughout the life cycle of an IVA service. The VBDPL is in-charge of preprocessing and extracting the significant features from the raw videos. The VBDML is accountable for producing the high-level semantic result from the features generated by the VBDML. The KCL deploys video ontology and creates knowledge based on the extracted higherlevel features obtained from VBDML. When all these layers are pipelined in a specific context, it becomes an IVA service to which the users can subscribe to video data sources under the IVAAaaS paradigm. Furthermore, to show the significance and recent research trends of IVA in the cloud, a broad literature review has been conducted. The research issues, opportunity, and challenges being raised by the uniqueness of the proposed L-CVAS, and the triangular relation among video big data analytics, distributed computing technologies, and cloud has been reported. IRFAN ULLAH has received his master's degree in computer science from the National University of Sciences and Technology (NUST), Islamabad, Pakistan. Currently, he is serving as a research assistant at the Department of Computer Science and Engineering, Kyung Hee University (Global Campus), South Korea. His research interests in the areas of: Natural language processing, social computing, crisis informatics, machine/deep learning, OS design and optimization on memory systems, big data analytics, and distributed computing.