Towards an Ontology Design Pattern for UAV Video Content Analysis

Video scene understanding is leading to an increased research investment in developing artificial intelligence technologies, pattern recognition, and computer vision, especially with the advance in sensor technologies. Developing autonomous unmanned vehicles, able to recognize not just targets appearing in a scene but a complete scene the targets are involved in (describing events, actions, situations, etc.) is becoming crucial in the recent advanced intelligent surveillance systems. At the same time, besides these consolidated technologies, the Semantic Web Technologies are also emerging, yielding seamless support to the high-level understanding of the scenes. To this purpose, the paper proposes a systematic ontology modeling to support and improve video content analysis, by generating a comprehensive high-level scene description, achieved by semantic reasoning and querying. The ontology schema comes from as an integration of new and existing ontologies and provides some design pattern guideline to get a high-level description of a whole scenario. It starts from the description of basic targets in the video scenario, thanks to the support of video tracking algorithms and target classification; then provides a higher level interpretation, compounding event-driven target interactions (for local activity comprehension), to reach gradually an abstraction high level that enables a concise and complete scenario description.


I. INTRODUCTION
Unmanned Aerial Vehicles (UAVs) are extensively used for research, monitoring and assistance in several fields of application ranging from defense, emergency and disaster management to agriculture, delivery of items, filming and so on. Their performance is often estimated about how accurate and precise is the provided scenario description, ranging from the basic identification of fixed and mobile targets, to recognize target actions that constitute events occurring in the real-time scenario. Especially when a high-level description of the scenario is strongly desired, UAVs should be able to process the initial tracking data and, by adding environmental information, interpret the scene captured by the on-board camera. Although the human remote control of these vehicles is often decisive to clearly understand the scene and make an action, UAV equipped with such abilities could support human operators in many situations, especially if they are dangerous for humans.
The scenario comprehension requires to analyze low level data and then build knowledge on different aspects of the scene, collecting distinct feature data and merge them, increasingly, to get a complete picture of what it is happening [8]. A straightforward interpretation of the road scenario requires to firstly detect the principal actors of the scene, such as people, vehicles moving in the scene. Then, there is the need to understand their movements and interactions to recognize events or actions. Combinations of events involving one or more objects depict higher-level activities or situations. This process gradually transforms primitive data (e.g., from sensors or tracking) into high-level information to reach a high-level view of the scenario. Figure 1 shows an incremental multi-layer knowledge extraction schema that depicts this process. Each layer produces a knowledge "granule" that is used and integrated in the upper layer with additional features to increase the knowledge granularity on the initial entities. In the figure, the VOLUME 4, 2016 FIGURE 1: Multi-layer knowledge schema: an incremental design process to video scene description original video frames are processed by tracking methods (in the figure, the focus is on a zoomed frame portion), obtaining target bounding boxes. The Raw sensor data layer constitutes the level of primitive data from tracking, such as object dimensions, positions, width and height of bounding boxes, etc., and also possible sensing data if sensors are involved to collect data in this phase. Tracked targets are the output of the initial data transformation step. The next level is defined on the scene object detailed features, obtained through the tracking process. The Object layer is composed of all the recognized targets, including the target identification and classification activities. In Figure 1, for example, the targets identified in the video frames are classified and labeled as Person. In other words, Person is the (class) label associated with the bounding boxes identified as id1, id2. The Activity layer describes the relations between the objects appearing on the scene: moving objects can interact with other (moving or fixed) objects, involving actions, movements, or any change of the scene. For example, people's movements and interactions state that the objects labeled as Person are walking. The upper layer represents, at high level, the interpretation of the scene, through the activities carried out by the named objects in the scene. The layer Situation abstracts the object movements in the environment, to achieve a final human-like interpretation of the scene. In this case, the revealed situation People Crossing is a higher level description of the activity Person Walking of the previous layer, carried out by the recognized Person objects. The situation People Crossing explains what is happening on the scene, straightforwardly and concisely. The multi-layer knowledge extraction schema, depicted in Figure 1, shows a methodological infrastructure to incrementally recognize objects and activities they are involved in and systematically describe a video frame scenario. The logic behind this schema needs solid formal modeling that finds its solution in the use of a thorough ontological design. Ontologies provide indeed formal models to describe axiom-based knowledge and infer new knowledge through semantic reasoning. Bearing in mind one of the focal principles of the Semantic Web, viz., the data re-usability, the multi-layer knowledge schema is achievable by integrating existing upper and domain ontologies, aligning similar concepts and extending them, in order to bridge different domain knowledge.
Ontology integration is not an easy task to fulfill, due to the difficulties to relate distinct domains (ontology alignment). Poor ontology integration can result in excessive redundancy of information, with a consequent reduction in performance [14], that inevitably affects semantic reasoning and query processing [22].
To address this issue, this work proposes a novel and systematic ontology design to support Computer Vision methods in the video scene comprehension. The idea is to add an ontology-based semantic support to the well-known approaches and methods for Video Analysis, in order to increase the effectiveness in video content analysis. Basically, the output of video tracking and target classification (and labeling) is encoded in ontological assertions to infer new enhanced knowledge that describe target interactions, events, activities and finally situations appearing on the scene. The multi-layer knowledge schema shown in Figure 1 provides a systematic design process to increasingly yield a scenario description of the video content, formally supported by ontology modeling. The knowledge, produced by each layer, is modeled as ontology concepts corresponding to the main scene actors, and their relationships constituting movements, event, activities, and finally, situations on the scene. At each layer, higher-level knowledge is built from the information of the previous layer, thanks to the corresponding ontology model, that describes the conceptualization at that layer. By semantic reasoning, new assertions, inferred on the previously generated knowledge, enable high level view of the video content. The remaining of the paper is structured as follows. Section II presents an overview of the main literature in video content analysis with a focus on semantic knowledge-based approaches; Section III describes the individual ontology models used in this approach as well as the final ontology model resulting as an integrated design model of the previous ones. Finally, Section IV shows an illustrative example, that highlights step-by-step the whole process, in an actual scenario. Conclusions close the paper.

II. RELATED WORK
This section provides literature review about situation recognition by analyzing the proposed Machine Learning and knowledge-based methods. The section also discusses the main ontologies designed for situation modeling.

A. COMPUTER VISION AND DEEP LEARNING
Situation interpretation has been a highly debated topic in literature to support devices, such as smart cameras, robots and unmanned vehicles, to accomplish complex surveillance and monitoring tasks. As first step, scenario interpretation from mobile cameras requires the detection and tracking of the main actors of the video scene, such as people, vehicles and animals. To this purpose, tracking algorithms [20], [31] have been proposed in literature. Object tracking from mobile cameras is a challenging topic because there is no fixed scene [5], [11], [12], [21], then, traditional techniques, such as the background subtraction [29], can not be applied to accomplish the task. After scene object detection has been performed, scenario interpretation requires the identification of object identity, as well as the recognition of the environment and specific features of the scene that can support activity and event detection. To this purpose, Machine learning and, especially, Deep Learning methods [30], [34] have been widely investigated to recognize object identity, scene elements or event, and even activities from egocentric videos [34]. These methods exhibit very good performances, but the training phase is quite expensive, due to the huge number of training samples; they are often ad-hoc designed for a specific domain (e.g., pedestrian event) [33]. Furthermore, a camera-equipped UAV flying over outside areas can take different types of environments and objects, doing various activities, with different light conditions and angles. These conditions can significantly increase the number of training samples required for a goodperforming object and event detection. The implementation of Deep Learning methods is driven by the availability of high-performance computing GPU and software frameworks. Additionally, the training, classification and validation of Deep Learning have been demonstrated to be not always a trivial case. [28]

B. KNOWLEDGE-BASED SYSTEMS APPLIED TO SURVEILLANCE
Recent literature [7], [9] focuses on enhancing UAVs as knowledge-based systems to become aware of situations occurring in a real-world scenario. Knowledge-based methods have been used to perform sensor fusion to integrate heterogeneous data and support various applications [2], [32], such as UAV-driven object detection in video scenes [6], [18]. Cognitive models have been proposed to improve object detection and tracking by fusing information on the scene to catch tracking faults such as occlusion, ID lost and motion blur. Other researchers [10], [17], [26] proposed new models to cope with UAV-based event detection both in inside and outside environments. They proposed ontology-based approaches to model knowledge on the scene and objects. Some approaches focus on a robust interpretations of events over time to abstract higher-level knowledge on a scene and provide refined descriptions of the whole scenario [6], [26], [27]. In [27], the authors propose a novel reasoning mechanism to deal with uncertainty in activity detection. In [6], the ontology-based model introduced in [7] is extended by considering a query-based temporal window to analyze spatio/temporal relations among tracked people and detect events over time. In [26] an ontology-based system, namely iKnow, detects activities of daily-living by merging dependencies among low-level and high-level concepts, such as locations and objects involved in activities. This model introduces the telicity criteria, which is applied to group already detected activities for situation interpretation. Our previous approach [10] employs an ontology-based modeling of UAV-recorded video scene to detect activities carried out by people and vehicles in various environmental contexts. The approach detects simple activities carried out by tracked scene objects, then, compositions of these activities over time enable the definition of higher-level complex activities. The knowledge modeling is achieved by ontology axioms and applying reasoning on them.
The knowledge-based system proposed in [17] introduces a context layer over tracking, that employs an ontology composed of several sub-ontologies, each one devoted to a specific aspect/layer of the scene, from the lowest to the highest level (i.e., tracking data, scene objects, situations).
The approaches [7], [9], [10], [17], [26] employ knowledge-based methods to detect activities and situations, but they do not provide a methodological approach to achieve a scene description; this work presents an ontology design pattern that provides the incremental steps (in form of ontological models) to describe a scene, at different levels of detail. Coding design patterns into ontologies has been proven to be useful for supporting and improving Semantic Web ontology engineering [15]. In [15], content-oriented patterns are shown to be useful to abstract knowledge and support composition. This paper introduces a multi-ontology process design pattern to support knowledge acquisition and reuse about a UAV-taken scenario. The employment of a knowledge-based approach does not prevent the use of a statistical-based or probabilistic approach. In fact, in [16], ontologies and Markov Logic Networks are used synergistically to accomplish activity recognition.

C. ONTOLOGIES FOR SITUATION MODELING
Recent studies evidence the role of ontology for modeling the features arisen from the UAV-observed scene [1], [7], [23]. In [7], the ontology, namely TrackPOI, is proposed to represent scene mobile objects (i.e., people, vehicles, etc.) and environments (roads, buildings, etc.) by starting from tracked scene data. Activity Ontology Design Pattern (ODP) [1] introduces a core ontology for activity modeling that can be used in different contexts. The activity is modeled along with its features (time duration, people involved, etc.). VOLUME 4, 2016 This ontology also allows the modeling of an activity as composed by simpler activities. An ontology similar to ODP is proposed in [25], the authors present a core ontology to model the activity and its features. Then, the model is extended with a specialization pattern and a composition pattern to, respectively, specialize the core ontology to model a specific domain and build complex activities from simpler ones. Situation Theory Ontology (STO) [23] concerns the modeling of concepts in Situation Theory (additional details will be provided in the next sections).
In the Situation Awareness domain, ontologies often combine classes modeling sensor-related information with classes modeling high-level features, such as relations among scene objects, events, and situations. The ontologies proposed in the literature are upper ontologies, representing general relations among the data, that can be specialized to accomplish a specific application. In [24], a novel method to knowledge representation for Situation Awareness is discussed. It uses RuleML-based domain theories and proposes the Situation Awareness (SAW) ontology. The ontology models a situation as a collection of goals, entities or objects and relations among these objects. The ontology also models events as acquired by sensors and allows the definition of dynamic representation over time by updating specific properties. The ontology is a core ontology, but its classes can be extended to represent situations occurring in specific domains. In [17], several connected upper ontologies are proposed to describe different aspects of the scene, such as tracked entities, scene objects, activities, etc.

III. THE ONTOLOGY MODEL
This section presents the whole ontology design: firstly, individual ontologies involved in the integrated formal model are introduced, according to the multi-layer knowledge schema introduced in Figure 1; then the whole model, with the relative conceptual alignments is presented along with the generated semantic knowledge.

A. RAW SENSOR DATA LAYER
This layer represents the basic level, namely, 0-layer, to highlight the fact that it is an initial processing step, on which the ontological model is based. It indeed collects the input data from the UAV-recorded video, sensing the main actors of the scene and the environmental context. Video Analysis techniques are widely employed to accomplish this task: video tracking is performed to track the movements of the mobile scene objects, such as people, vehicles, etc.; also target classification information are returned about each detected scene object.
The output is an XML-based file including the information on the scene objects detected frame by frame. To detect the environment type, area classification is also provided for the types of ground areas present in the video. The classification results annotate each tracked object along with the area where they appear and the areas in its surroundings. In general, the XML file collects information types such as bounding boxes dimensions and positions, speed, direction as well as object identity and area classification, etc. The ontology modeling approach supposes that the generated XML file is the result of accurate video tracking as well as object recognition and classification activities, to guarantee an effective nested knowledge generation layering. Deep learning, as reinforcement learning are established techniques used in Video Analysis and represent a solid basis on which to build our ontological modeling.
The output results of Raw sensor data Layer are roughly the main mobile and fixed objects present in the scene, annotated with the class label. These data are the raw knowledge on the scene, on which our approach incrementally builds higher-level knowledge on the UAV-monitored scene.

B. OBJECT LAYER: TRACKPOI ONTOLOGY
The output of tracking, along with target classification tasks, needs to be coded in semantic assertions. The TrackPOI ontology [7] is designed to describe road scenarios, where mobile and fixed objects move and interact with each other. Figure 2 shows the main classes and relations of the TrackPOI ontology. Our videos usually show road scenarios, but the layered knowledge process could be easily customized for different scenario types, replacing on this layer, the appropriate domain ontology.
The mobile objects in the TrackPOI ontology are tracks annotated as people, vehicles, animals or things moved by the people. The Track class indeed, represents the bounding box marking the detected object (viz., the track) in each frame of the video. Therefore, each detected object in a frame sequence is modeled as a collection of instances of Track class, identified by the same ID value. Track is a general class of the TrackPOI ontology and includes all the recognized moving objects. It is specialized to identify instances of its subclasses, such as the classes Person and Vehicle. Thus, according to classification results, a Track instance can also be a Person, Vehicle or Unknown instance.
The fixed objects include environmental features, such as rivers, buildings, stores, etc. The fixed objects are coded as Points of Interest (POIs) retrieved by Google Maps service. In Figure 2, some fixed objects, namely Highway, Route, Park and Parking_lot, are represented as the sub-classes of the class POI.
TrackPOI imports GeoRSS ontology 1 to model POI GPS data and also employs Time ontology 2 to represent the instant of a track instance.
TrackPOI defines also the spatio/temporal relations between tracks and POIs in a video scene. Relation modeling allows to describe the interactions among tracks, and the track movements in the environment. According to the layered knowledge schema of Figure 1, TrackPOI models a firstlayer knowledge, dedicated to describe the mobile objects of the scene. It is in charge of generating assertions on tracking TrackPOI provides the formal model to describe what appears in each frame, frame-by-frame. As stated, track objects, identified by an ID and appearing in a frame sequence, represent the same physical object. Moreover, in terms of ontology coding, the axioms related to the object presence in a time interval are replicated as many times as the frame number is. To this purpose, TrackPOI provides a further class namely TrackPOI:ThingObject, that supports the conceptual abstraction of the object presence over time, by a digest, timebased axiom. Figure 3 shows the class TrackPOI:ThingObject that is related to the class TrackPOI:Track by the relation TrackPOI:hasTrack, or conversely, each TrackPOI:Track is part of (TrackPOI:trackOf) a TrackPOI:ThingObject. In other words, an instance of TrackPOI:ThingObject is the actual object appearing in the scene, described by a sequence of TrackPOI:Track instances (identified by the same ID) over time.

C. ACTIVITY/EVENT LAYER: ODP ONTOLOGY
The activities carried out by the main actors in the scenes are modeled by using an ontology design pattern [1] (briefly, ODP) to model the common core of activities in different domains.  Figure 4 shows the Activity ODP schema with classes and properties. According the schema in the figure, a generic activity has a starting and finishing time (respectively, described by the properties hasStart and hasEnd), represented by xsd:time; it lasts over time, the range of property hasDuration is xsd:duration which represents the activity time duration. Moreover, a generic activity can be composed of other activities. In fact, an activity individual, represented as an instance of the Activity:Activity class, can be related to its component activities through the hasPart property. The Activity:Activity class is connected by relations Activity:hasRequirement and Activity:produces to the two main classes that characterize the activity, the Activity:Requirement and Activity:Outcome classes, that represent the input and the output of the activity, respectively. These classes enable modeling logical order among the activities.
Classes from external ontologies are also used to contextualize the activity. Accordingly, in the figure, the POI:place class models the place where the activity occurred. The foaf:Agent class represents the participants in the activity.
The ODP ontology has been employed to model knowledge on detected activities (that specialize this generic class) and support the definition of higher-level complex activities.

D. SITUATION LAYER: STO ONTOLOGY
In common sense, a situation is often represented by a combination of circumstances in which someone or something finds itself or a specific status with regard to conditions and circumstances. A situation can be a simple people's activity, or the effect caused by some complex events. In Situation Awareness [13], situation is defined as the perception of some situational elements, the comprehension of their meaning and the projection of their state in the future. The Situation Theory Ontology (STO) models the fundamental concepts involved in the situation theory [23]. Situation theory concerns the situation semantics developed by Barwise and Perry [3], [4], [19] to reason over commonsense and real world situations. In this theory, a situation is composed of infons, elementary units of information that characterize a situation. More formally, it is defined on an VOLUME 4, 2016 n-ary relation R among n objects or individuals a 1 , . . . , a n , therefore, it is written as follows: R, a 1 , . . . , a n , 0/1 . The infon represents a fact that can be true or false and it is represented by the last argument in the infon definition (0/1) that expresses its own polarity. The relation (R) in the infon represents the type of event or action involving one or more individuals. The individuals (a 1 , . . . , a n ) are entities (i.e., people, animals, etc.) that participate in the situation.

E. ONTOLOGY INTEGRATION AND LAYERED KNOWLEDGE GENERATION
The ontologies are the building blocks of our layered knowledge schema, shown in Figure 1. They contribute to provide a high-level abstraction of the scene in a dynamic environment. Conceptual alignments or, more in general, portions of ontology merging and integration need to be harmonized in a comprehensive ontology model that reflects our schema. Figure 6 shows the final ontology schema, with the integration model design (additional relations connecting the individual ontologies) in evidence. The figure strictly reflects the layered knowledge schema, namely from the bottom layer Raw sensor data (layer 0), Scene object (layer 1), activity/event (layer 2), Situation (layer 3).
The layer 0 provides the xml-based data describing bounding boxes and their positions, as well as their class label (e.g., if the bounding box represents a person, a car, etc.), as described in Section III-A.
At the layer 1, the data, generated at the previous layer, are translated into semantic assertions that describe the recognized mobile and fixed objects as instances of the T rackP OI:T rack and T rackP OI:P OI, respectively, from the ontology TrackPOI. The track identifiers and class names are coded into semantic assertions: for example, the triple <t_1_2 a T rackP OI:P erson> states that the track with ID:1 in the second frame (numbered as 2) represents a Person (in other words, t_1_2 is an individual of the class Person). POIs collected by Google Maps service, or detected by area classification at layer 0, are described by ontology assertions in a similar way.
Interactions between fixed (e.g., POIs) and moving objects are also identified in this layer. To this purpose, object positions, with respect to a specific area or just generic spatio-temporal relations occurring in the scene are detected. Therefore, triples representing spatio-temporal relations among tracks are generated. Furthermore, in this layer, the identification of the scene object, as composed of tracks appearing in a frame sequence, is accomplished as individuals of the TrackPOI:ThingObject class. Spin rules help the consolidation of the object movements and interactions, as well as the merging of the tracks associated to the same object (see Section III-B for details). For instance, the generated triple <s_1 a T rackP OI:T hingObject> represents the mobile scene object s_1 composed of tracks with ID equals to 1 from the video frame sequence, such as <t_1_1 a T rackP OI:P erson>, <t_1_2 a T rackP OI:P erson>, <t_1_3 a T rackP OI:P erson>, etc.
In the layer 2, SPARQL queries are designed to elicit activities, that are based on the generated TrackPOI:ThingObject instances and spatio-temporal relations among tracks. More specifically, the queries allow the detection of high-level activities over time [10]. The detected activities are represented as instances of the Activity:Activity class, then, new triples are generated. These triples relate the activity with the thing objects who carried out or participate to the activity and the place where it happened. In the figure, for instance, a generic activity act_1 is characterized by the participant (the thing object named s_1) in that activity, the place where it occurs (the POI o_2) and the starting and ending times (at the second 0.12 and 0.42, respectively).
Let us notice that the layer 1 and layer 2 are joined by new additional relations (isEquivantT o), that connect similar concepts from the ontologies TrackPOI and ODP, respectively. More specifically, the TrackPOI:ThingObject instance is the high-level object that carries out the activity; since it represents the main participant of the activity, it is equivalent to the foaf:Agent class.
In this way, through the Activity:hasParticipant property (that connects the Activity:Activity class to the foaf:Agent class), the activity (i.e., Activity:Activity instance) is related to the object doing it (i.e., TrackPOI:ThingObject instance).
Similarly, the TrackPOI:POI and POI:place classes are equivalent and related to the Activity:Activity class through the property Activity:takesPlaceAt. At layer 3, the high level ontology STO is in charge of situation description. Figure 6 shows the connection with the two underlying layers. As stated in Section III-D, the STO:Individual class, in the STO ontology, models entities (i.e., people, animals, etc.) that carry out activities or are involved in events and situations. The TrackPOI:ThingObject class represents the same concept (i.e., it is assumed to be equivalent) to STO:Individual. The Activity:Activity class exclusively represents activities carried out by one or more scene objects. Activities are also modeled in the STO ontology by the STO:Relation class. The Activity:Activity class is designed as a subclass of the STO:Relation class, that connects directly the ODP ontology to the STO ontology.
When new Activity:Activity instances have been generated at layer 2, the same instances are also of type STO:Relation. At layer 3, Infons on each generated STO:Relation instance are produced. Precisely, an instance of STO:ElementaryInfon is yield, for each detected activity type in Activity:Activity, equivalent to STO:Relation. These instances represent the detected activities along with time, location and the participants to the activity. Concatenations of infons, defined by Spin rules, allow defining high-level situations. For instance, given the infons Inf on 1 , Inf on 2 and the situation Sit 1 defined by the rule R : Inf on 1 ∧ Inf on 2 =⇒ Sit_1; if the two infons Inf on 1 and Inf on 2 are generated, the rule R allows the detection of the situation Sit 1 .

IV. A CLOSER LOOK AT THE INCREMENTAL ONTOLOGY MODELING: A SCENARIO EXAMPLE
This section presents a case study showing the applicability of the proposed ontology modeling and effectiveness in the scene description, on a real-world video. Figure 7 shows the generation of the ontology population, through the layers of the knowledge schema, starting from the initial raw data to yield a high level description scenario. The video frames, at the layer 0, show a typical outside scenario recorded by a camera-equipped UAV. A vehicle is running while a person is crossing and another person is walking on the lawn beside the road. As stated, data retrieved by sensors and tracking algorithms allows us to recognize targets in the scene. The tracking algorithm used in this case study estimates camera movements for background scene extraction and identifies object position. Moreover, feedforward control [7] has been used to improve trajectory tracking of objects through frames. In the example, the tracking algorithm returns the objects identified by id:0, id:1 and id:2. Then classification algorithms have been employed to object and background area annotations. Our object classifier considers three object categories: people, vehicles and unknown objects. The object classification is performed frame-by-frame and, then, the object label is got through a majority voting approach [7]. The classification results are used to annotate each detected scene object, adding a class-type field, expressing its identity.
Identity and area annotations on scene objects are added as attributes to tags, expressing the tracked objects, in the original tracking output file.
The area classifier detects the main background environments (e.g., lawn or road) where the objects stay or places they get close to [10]. Identity and area annotations on scene objects are added as attributes to tags, expressing the tracked objects, in the original tracking output file. Tracking and classification data are then encoded into ontology assertions [7], generating actual instances of TrackPOI ontology. At layer 1, for each frame, the instances of Vehicle and Person are created. In the frame numbered 1, the generated instances T rackP OI:T rack_0_1, T rackP OI:T rack_1_1 and T rackP OI:T rack_2_1, represent the tracks produced at the layer 0 and are individuals of TrackPOI ontology T rackP OI:V ehicle, T rackP OI:P erson. Considering video frames, it is possible to seek the same track through frames.
Tracks with the same ID are grouped in a unique dynamic entity (i.e., thing object) representing the mobile object in the scene. For instance, the instances T rackP OI:T rack_1_1, T rackP OI:T rack_1_2 and T rackP OI:T rack_1_3 represent the tracks with the ID equals to 1 in frames 1, 2 and 3, respectively. These tracks, representing the same instance of the T rackP OI:P erson class through the frames, are grouped to build the T rackP OI:T hingObject_1 instance of the class TrackPOI:ThingObject. At the same time, the generated T rackP OI:T rack instances are related to T rackP OI:P OI instances, representing the environments where they move, through the T rackP OI:inArea property. Through this property, tracks of the vehicle and the person with ID:1 are found in the area of the route, while the other person with ID:2 is found on the lawn besides the route. These spatial relations are also timed because related to a specific frame. Therefore, the generated spatio/temporal relations support the contextualization of the object movements and interactions with other objects. The outcome of layer 1 is the identification of three objects (belonging to the class TrackPOI: ThingObject), and their relation with the places where they appear (i.e., the route and the lawn).
At the layer 2, some rules are designed on the T rackP OI:T hingObject instances and the spatio/temporal relations. Collecting data on objects and their spatiotemporal relation, by SPARQL reasoning, activities are detected. In the figure, some specialized activities are shown: they are carried out by the two people and the vehicle arise at layer 2 of Figure 7. More precisely, the following activities are elicited: Activity:_0_vehicleStopping, Activity:_1_ManOnTheRoad, Activity:_2_ManOnTheLawn. At high level of description, the observed scenario shows a vehicle which is stopping (Activity:_0_vehicleStopping) when the person crosses the route (Activity:_1_ManOnTheRoad). Then, the other person is simply walking in the lawn area (Activity:_2_manOnTheLawn). 1 SELECT ? ob ? t r a c k ? t i m e ? p o i 2 WHERE { 3 ? t r a c k a t r a c k p o i : P e r s o n . 4 ? t r a c k t r a c k p o i : i n A r e a ? p o i . 5 ? p o i a t r a c k p o i : R o u t e . 6 ? t r a c k t r a c k p o i : hasTime ? t i m e . 7 ? t r a c k t r a c k p o i : t r a c k _ I D ? i d . 8 ? t r a c k t r a c k p o i : t r a c k O f ? ob . 9 } ORDER BY ? i d ? t i m e Listing 1: manOnTheRoad activity: SPARQL query for detecting people on the road As a SPARQL query example for activity definition, let us consider the query to detect the activity instance Activity:_1_ManOnTheRoad shown in Listing 1. The SPARQL query detects people walking on the road over video time. This query makes possible to create an instance of a specific class Activity:ManOnTheRoad, subclass of Activity:Activity, for each track who carried out this activity. The query returns a list of tracks ordered by their ID and time when they appear in the video. The TrackPOI:trackOf property supports the identification of the person (TrackPOI:ThingObject instance) walking on the road, while its track time serves the detection of the times of entrance and exit on the road.
At the layer 3, the scene description becomes concise, and reaches a very high level of abstraction. Situation Theory is applied to the detected activities and scene objects to abstract knowledge from them and provide high-level situations describing the whole scene. Infons are generated on the detected activities and scene objects to relate all the information and build situations. The situations are Spin rule-defined as concatenations of infons. The outcome of the layer 3 is the infons Infon_1 and Infon_2 in correspondence with activities Activity:_0_vehicleStopping and Activity:_1_ManOnTheRoad, respectively. The Spin rules define a situation, namely STO:_0_vehicleStopToLetPeopleCross, that comes from the concatenation of these infons, in the road context. This situation exactly captures the main action happening in the road scenario, and provides a humanoriented, high-level view of the scene.
The proposed ontology modeling provides a systematic way to feed a knowledge base describing a video, ranging from the identification of the individual objects to the occurring activities, till to incrementally achieve a general, high-level scenario description.
In order to assess the applicability of this approach and its effectiveness in terms of scenario description, some videos have been processed, as described in the case study. Three videos 3 recorded in our campus have been processed: they show people and vehicles carrying out some activities in different environments, such as roads, lawns and heliports. Table 1 shows the results of the application of the proposed ontology model, according to the multi-layer knowledge schema. The table provides the video content description: specifically, for each video, the situations and the activities, that compound these situations, are shown in the time interval when they occur. Then, each activity includes the thing object who carried out the activity, the thing object type, the POI where the activity happened and the activity beginning and ending times. Figure 8 shows one of the situations recognized in each of the three videos (i.e. people grouping from Video #1, people crossing from Video #2, people moving on a heliport from Video #3). Situations are described exactly by the time interval they occur, expressed by the starting and ending frames. Let us notice that by comparing situations, objects and times in the figure with the table results, the detected situations correspond to those found in the videos. For instance, looking at Video #2, in Table 1, the recognized situations are Sit_3_ManCrossing, Sit_1_Grouping, Sit_2_Grouping, and Sit_0_VehicleStopstoLetPeopleCross.
The Video #2 shows a road scene with people grouping, and a crossing happening in presence of an oncoming vehicle (see Figure 8). In Table 1, for Video #2, the situation Sit_3_ManCrossing is produced by the individual activity 0_ManCrossing (in the Activity column); the situations Sit_1_Grouping, Sit_2_Grouping are described by the grouping activities (identified as 0_Grouping, 1_Grouping, 2_Grouping, 3_Grouping). Each grouping activity can be carried out by only one thing object, so there are as many grouping activities as there are thing objects involved in the grouping. The thing objects namely TO_3 and TO_1 are both recognized as persons (Type column) and participate to the situations Sit_2_Grouping and Sit_1_Grouping. More interesting is the situation Sit_3_VehicleStopstoLetPeopleCross that represents a vehicle stopping to let people cross the road, described by the activities 1_ManOnTheRoad and 4_Stopping. The two activities involve two thing objects recognized as a person (TO_1) and a vehicle (TO_2). Situations and activities last a certain amount of time, from a starting to ending time (Start and End columns, in the table).The starting and ending times allow us to describe the temporal succession of the situations detected in the video. The Video #2 indeed shows initially two people grouping (Sit_2_Grouping), then moving away from each other, and one of them crosses the street (Sit_3_ManCrossing) while an oncoming vehicle stops to let the person cross (Sit_0_VehicleStopstoLetPeopleCross); in the end, the people meet again (Sit_1_Grouping) (see Figure 8).

V. CONCLUSION
This paper introduces a novel knowledge modeling of a video scenario, recorded by a UAV. Rather than using only the tracking and classification methods to detect targets and their movements, the use of Semantic Web technologies provides support for enriching the scenario description, reflecting the way the human observes a scene. The approach presents a systematic ontology-based design process based on the introduced multi-layer knowledge schema, that composes the scene increasingly at a high level of abstraction. The layered knowledge model indeed allows feeding knowledge on the scene incrementally, from tracked data to the situations describing the scene. The integrated ontology model exploits the features of several well-known ontologies to thoroughly model different aspects of the scene and achieve complete scene comprehension. Data tracking along with activity and situation (theory) modeling support the three levels underpinning the Situation Awareness: Perception (collecting row sensing data), Comprehension (seeking main actors in the scene: e.g., objects and carried activities), Projection (assessing possible critical issues on the detected situations).
The proposed ontology design is a kind of guideline that, reflecting the multi-layer knowledge schema, produces a formal knowledge modeling as well as arise the semantic description on an observed scene.
In the light of the recent literature on situation comprehension, the main benefits of the proposed approach are briefly listed below.
• An ontology design pattern for scenario understanding. The whole ontology can be considered as a sort of ontology design pattern, coming from the modeling and integration of ontologies intended to portray the layering of our proposed knowledge schema described in Figure 1. In particular, the ontologies ODP and STO are indeed ontology design patterns, in charge of covering the Activity and Situation layers, respectively. The Object layer is the only one achieved with a domain ontology, and, for this reason, it can be easily replaced with another ontology, if a different video context (for example, the video scenes take place in a environment other than a road scenario) appears. • A modular design process for easy methodological integration. The ontology design not only offers seamless extensibility at the ontology design level, but the modular layering also guarantees high flexibility and interchangeability of the methodological approaches for target tracking and classification in the Raw sensor data layer. The employment of high-performance Machine and Deep Learning methods for target tracking and classification tasks, for example, can enhance the effectiveness of the global system. Depending on the computer vision methods, used in the Raw sensor data layer, the ontology model can combine/compound more or less accurately detected scene objects, in order to produce higher-level scene descriptions. • A knowledge base to support video content analysis.
The ontology model allows populating a knowledge base describing the video content, collecting, depending on the layering of the knowledge schema, the information granule associated with the corresponding knowledge layer. The knowledge base is accessible by SPARQL queries: objects, activities, and situations appearing in a video (or in a portion of it) can be recovered by a query easily. The collected knowledge becomes a flexible repository to facilitate video content analysis targeted, for instance, at surveillance and monitoring applications. • A human-oriented scenario description. The role of semantics is crucial in the scenario description: modeling a situation as a composition of activities and, in turn, an activity as spatio-temporal relations among objects and between the object and the environment, enables the logical "thinking" process, for understanding what really is happening in a scene and explaining why particular conclusion is achieved. The logics behind a situation can yield human-like video content description along with the reasoning steps that build a situation.
The proposed approach provides a semantic support for object detection and scenario description, if used in combination with Machine and Deep Learning methods, whose synergy providea solid performance. Future directions will focus on knowledge synthesis methods to analyze the collected information over time to reduce the knowledge base dimensions and, consequently, speed up system performances.
DANILO CAVALIERE received master degree in computer science from University of Salerno, Italy in 2014. From 2015 he is predoctoral fellow at Computer Science Department and then at Department of Information and Electrical Engineering and Applied Mathematics. (DIEM). He is currently a PhD student at the University of Salerno, Italy. His research interests are in the areas of artificial and computational intelligence, machine learning, data mining and knowledge discovery, areas in which he has published some papers.