Machine learning methods for forest image analysis and classification: a survey of the state of the art

The advent of modern remote sensors alongside the development of advanced parallel computing has significantly transformed both the theoretical and real implementation aspects of remote sensing. Several algorithms for detecting objects of interest in remote sensing images and subsequent classification have been devised, and these include template matching based methods, machine learning and knowledge-based methods. Knowledge-driven approaches have received much attention from the remote sensing fraternity. They do, however, face challenges in terms of sensory gap, duality of expression, vagueness and ambiguity, geographic concepts expressed in multiple modes, and semantic gap. This paper aims to review and provide an up-to-date survey on machine learning and knowledge driven approaches towards remote sensing forest image analysis. It is envisaged that this work will assist researchers in coming up with efficient models that accurately detect and classify forest images. There is a mismatch between what domain experts expect from remote sensing data and what remote sensing science produces. Such a mismatch or disparity can be reduced or alleviated by adopting an ontology paradigm methodology. Ontologies should be used to support the future of remote sensing in forest object classification. The paper is presented in five parts: (1) a review of methods used for forest image detection and classification; (2) challenges faced by object detection methods; (3) analysis of segmentation techniques employed; (4) feature extraction and classification; and (5) performance of the state-of-the-art methods employed in forest image detection and classification.


I. INTRODUCTION
R EMOTE sensing science is rapidly growing. The evolution of high spatial resolution remote sensors in conjunction with advanced computing has significantly transformed the specification and practice of remote sensing [1]. Remote sensing images are characterized by high spatial resolution and provide more explicit information on the earth's surface as compared to middle and coarser resolution images [2]. Machine learning methods for analyzing and classifying forest images are continuously evolving to provide more advanced automatic land cover pattern recognition on aerial images.This paper surveyed existing methods for forest ecosystem image classification. In particular, machine learning classifiers and deep learning techniques for forest image classification are reviewed.
There are several algorithms that are geared towards detecting objects of interest in remote sensing images, for the further regional analysis and classification. These algorithms are categorized into three groups, namely template matching based, knowledge based, and machine learning based methods [3]. The taxonomy of image classification methods is depicted in Figure 1.

(a)
Template matching based detection methods The template matching method determines whether a picture or an image contains a previously defined object or whether a predefined sub image (template) has an exact match in an image. Although this method provided one of the first approaches for object analysis [3], its dependence on handcrafted matching criteria limited its applicability to complex object recognition. Once a suitable template is FIGURE 1. Methods for object detection in optical sensing images [3] determined, a measure of matching between the template and every possible location in the image is calculated, and a classification decision is made based on the measure of certainty. The most popular metric based measures are the Euclidean distance, squared difference, and cross correlation, defined in Equations (1) to (3): There are two types of templates, namely, global and local templates. When a template is used to reference the whole (global) object in an image, it is referred to as a global template. However, when object features (local features of an object) in an image are referenced with multiple or several templates, these templates are referred to as local templates [4]. Figure 2 shows stages to be followed to determine the best templates for object detection. The challenge with this approach is that the method does not cater for the scale and orientation of the template [5]. It fails due to occlusions and distortions on the boundary [6]. The method is very sensitive to shape and viewpoint change. The solution suggested was to have a unique representation of a template orientation and scale that varies, but the solution becomes computationally expensive.

FIGURE 2.
Template matching based criteria [3] (b) Machine Learning based approaches An input image is subjected to the initial first phase where regions or objects are extracted. Then, for each object, features of interest are computed using Convolutional Neural Networks (CNN). Optimal features are obtained after a subsequent series of feature fusion and dimension reduction processes. Finally, classifiers such as Support Vector Machines (SVM), k-Nearest Neighbor (kNN), Sparse Representation based Classification (SRC), AdaBoost, Conditional Random Fields (CRF), and others are used to classify each region/object. Figure 3 shows the main important phases of machine learning object detection, i.e., feature extraction, feature fusion, and dimensionality reduction, as well as the classification phase. Machine learning-based approaches coupled with innovative algorithms and higher performance computing seem to have gained popularity in remote sensing science because they produce better results considering the . Machine learning methods [3].
accuracy of the created maps [1]. As a result, they are used in big land cover applications that rely on pixel based statistical analysis of massive image data sets [7][8] [9]. Pixel based approaches pose challenges in the analysis of high spatial resolution images [10] because they take into consideration the aspects of spectral information as a backdrop for analyzing and classifying high spatial resolution images, neglecting spatial and temporal information, which are of paramount importance. These methods are less efficient in dealing with symbolic knowledge, that is, when concepts are characterized by symbols, for instance, vegetation is made of grass [11]. They do not offer the function of creating a super class whilst classes have been defined. Suppose one has defined the following classes of interest; "trees", "grass", "road", and "building". It will then be impossible for the user to define a vegetation class unless it has been beforehand defined as a super class. The methods do not offer the facility to add spatial rules [12], for instance, "grass" cannot be found inside a building, but can be found in a field. Because of this reasoning limitation, data-driven approaches are unsuitable for use in applications areas such as ecology, that deal with earth observation.

(c) Knowledge based detection methods
These methods have been applied to land slide, crops, urban land change and forests [13] [14][15] [16]. Figure 4 shows the processes whereby an input image goes through a hypothesis generation phase, the hypothesis is validated and tested using the established knowledge and rules. Post processing of validation results will be subjected to machine learning for final object detection. Knowledge and rules from geometric information and context information will be used to test the validity of the hypothesis generated from an input image, if the hypothesis is valid it will be subjected to machine learning for object detection [17]. Generally, there are two types of knowledge that have been used on target objects, and these are geometric knowledge and context knowledge.
(a) Geometric Knowledge Knowledge based object detection systems [3] This type of knowledge is the most important in a knowledgebased approach and is widely used for object detection. It encompasses generic shape models or parametric specifics. For instance, it is proposed in [18] that buildings are square or composed of rectangular segments and are utilized as conventional models of shapes to distinguish buildings.

(b) Context knowledge
Context knowledge is very important for key objects, and it is expressed by rules derived from relationships between objects of interest and their respective backgrounds [14][19] [20]. For instance, shadow evidence has been used for building detection [21], the correlation between artificial structures such as buildings and their respective shadows has been used to project locations and shapes of buildings [22]. Recently, knowledge-driven approaches seem to be the direction taken by the remote sensing science community [3] since they incorporate domain expert knowledge. Geographic Object Based Image Analysis (GEOBIA), which classifies image objects based on apriori domain expert knowledge, is proving to be a key trend in remote sensing image analysis. GEOBIA is a classification technique that divides a remote sensing image into objects of interest and evaluates the objects based on their spectral, temporal and spatial characteristics. The generation of objects of interest is done using different segmentation approaches such as random walker, canny, histogram-based segmentation, etc. An algorithm is deemed effective in segmentation if and only if a segmented image object completely matches the corresponding Actual Image Region (AIR) of a scene object. [2], proposed a blend of area coincidence methods and boundary coincidence VOLUME 4, 2016 methods for assessing segmentation quality. The area coincidence methods select an image that has the dominant or largest area of intersection with the AIR. The boundary coincidence methods calculate the distance between a point of interest in a segmented image and that of its corresponding point on the AIR. The segmentation quality is high when the measured distance is much closer to zero. Segmentation evaluation methods can either be Unsupervised or Supervised. Supervised techniques evaluate a segmented image based on a ground truth image also referred to as the reference image. The evaluation of unsupervised methods is solely dependant on the segmented image as it has to assess the extent to which the image matches the desirable features of a good segmented image. [23] proposed four metrics (Equations 4-7) for assessing supervised segmentation quality namely Fmeasure, SUM which should be less than 2, ED that indicates distance to point (0,0) in the space and ED' that indicates distance to point (1,1) in the space.
Two other metrics that take into account the over and under segmentation errors, GOSE and GUSE, respectively, are proposed [24]. Rand Error(RE) is another widely used metric for evaluating supervised approaches. RE is a measure of is defined in Equation 8 [25]. Let R 1 and R 2 be segmentation regions of image S with t pixels and the following holds: • n correspond to the number of pixels in image S that appear in both R 1 and R 2 • m correspond to the number of pixels in image S that are neither in R 1 and R 2 A criterion for unsupervised technique that balances homogeneity and inter-segment heterogeneity is proposed by Wang et al. [26] as in Equation 9.
where, T and D represents intra-segment homogeneity and inter-segment heterogeneity, respectively. Another metric for unsupervised techniques proposed by Gao et al. [27] is the Global Score (GS). GS incorporates weighted variance (WV) and Moron's I and is defined in Equation 10.
The final step of GEOBIA is image classification. The common image classifiers for GEOBIA are Random Forest (RF), Simple Vector Machines (SVM), k-Nearest Neighbor (kNN) and Naive Bayes (NB) [28].  Figure 5 shows GEOBIA workflow [28] that implemented three different algorithms, namely, Large Scale Mean Shift (LSMS) in OTB, the Shephered segmentation algorithm in RSGISLib and the Multi-resolution segmentation (MRS) algorithm in eCognition. However, GEOBIA solutions do not give answers to every segmentation problem. Even though GEOBIA is more efficient than pixel-based approaches, segmenting a multi-spectral image made up of thousands of mega pixels remains a challenging task [29]. Another drawback of GEOBIA is that it approximates, to some extent, computer-aided photo interpretation, which has been criticized as being highly subjective [5]. However, in the last decade, knowledge-driven techniques, like GEOBIA, have gained traction as a means of bridging the gap between implicit data representation and end-user needs. Knowledgedriven approaches consist of translating symbolic knowledge into a format understandable by humans into numerical knowledge. Vegetation indices obtained from satellite images provide valuable information which is essential for the mapping of vegetation. The Normalised Difference Vegetation Index (NDVI) has proven to be a valuable tool, particularly in tropical dry forests, where it serves as a foundation for estimating overall green biomass, tree density, and species diversity [30][31] [32]. NDVI is an indicator that determines the greenish component from the analyzed satellite images. NDVI provides a balance between the energy received and the energy emitted by objects on the earth's surface [33].
In the context of plant communities, it is an indicator that determines how greenish an area is, and that is influenced by the quantity of vegetation in that particular area and its state of health. The NDVI values range from -1 to +1. The values that are less than 0.1 correspond to water bodies and bare grounds, while higher values indicate the presence of agricultural activities, temperate forests, and rain forests [33]. The NDVI values can be used to group the vegetation ecosystems into 4 major categories as follows [34]: forests made up of semi-deciduous and evergreen have NDVI ≥ 0.7, woodlands are defined by the range (0.6 ≤ NDVI < 0.71), a mixed class that is composed of a) shrub land, b) woodland/shrub land/exposed lands, and c) cactus forest have the range (0.49 ≤ NDVI < 0.61) and the dwarf woodland and shrub land assume the range of (NDVI < 0.49) [34]. For instance, a forest concept made up of semi-deciduous and evergreen is described by high NDVI values and when translated into numerical knowledge, it is implemented by the classification rule set: Forest = (NDVI ≥ 0.71). Figure 6 shows the symbolic to numerical knowledge conversion.

A. DIFFERENT MODES OF DEFINING GEOGRAPHIC CONCEPTS
A geographical concept can be defined from different perspectives; the definition might be based on physical, historical, functional, or conventional mode [35]. Various methods of defining the same geographic concept bring about elective perspectives on the definition of the same concepts; for example, an idea can be characterized by elective definitions that are not basically the same, despite the fact that they are normally correlated [36]. From a functional viewpoint, the role of the forest primarily acts as a repository for storing carbon. This is correlated by the Net Primary Productivity (NPP) values. Forests can also be defined based on physical attributes such as vegetation cover, phenology, vegetation, age, etc. A tremendous effort is still in place to standardize land cover classes in land cover classification systems (LCCS) [37]. The term "forest" is defined differently by different organizations and countries; for instance, in Brazil an area that is regarded as a forest, has an area that exceeds one hectare, is characterised by a 30% canopy, and is composed of trees with a minimum height of 5m [38]. A forest in China is defined as an area larger than 0.67 hectares in size, with at least 20% crown cover and trees standing at least 2 meters tall. The Food and Agriculture Organization (FAO) standardised the definition of forest to refer to a land area spanning over 0.5 hectares enveloped by trees at 5m and above, with a canopy cover of 10%. This definition excludes land under agricultural or urban land use [39].

B. DUALITY OF GEOGRAPHIC CONCEPTS
Two important major terms arise from the concept of duality, that is, scene and image. A scene is real and exists on the ground, whereas an image is an assortment of spatially orchestrated estimations drawn from the scene [1]. Components obtained from images are regarded as abstractions of real objects in the ground scene [40]. Forest concepts can be viewed either from a real world perspective (a forest concept is characterized by high NPP values) or from image properties (a forest concept is defined by high NDVI values).
In the case of forests, the assertion that NDVI is correlated to NPP is not always valid because NPP lacks information on attributes such as vegetation height, vegetation cover, and so forth. This anomaly is also referred to as the sensory gap. With this notion, sensory gaps cause improbability in describing geographic objects [41].

C. VAGUENESS AND AMBIGUITY OF GEOGRAPHIC CONCEPTS
The process of connecting attribute (e.g. NDVI) range of values (for instance, high) to geographic concepts (e.g. forest concept) is not easy [1]. The reason behind this is that the associated value "high" is qualitative in nature, so the obtained classification rule becomes vague and ambiguous. Some pixels (image objects) inside the image in Figure 7 are not classified as forest, though in nature they are constituted as forest areas. When qualitative terms like "high" are employed to identify objects with sharp, crisp boundaries, threshold ambiguity occurs. Qualitative description of geographic objects raises partiality issues. For instance, the symbolic classification rule "high NDVI" partially describes the forest because: (a) It is very difficult to fish out only forests in areas that have other crops with the same NDVI values as in Figure 9.
(b) It is also impossible to classify all the varying types of forests because, in some cases, there are some forests that have "low NDVI" values, such as the degraded forests in Figure 8.
Ambiguity arises in all cases where a natural language expression can have various meanings [42]. The usual one is lexical ambiguity, which emerges because of the homonym of regular language articulation, that is, an expression with more than one meaning, such that each meaning points only to one ontological concept unambiguously [42]. More than 800 different definitions of forest concepts are provided in [43]. Deep ambiguity, also referred to as open texture, exists where there is no clear boundary between concepts or terms or cases where the meaning of a concept changes over time, for instance, when a new technology appears or the physical or social context of the term evolves. VOLUME 4, 2016

D. SEMANTIC GAP
It arises from the vagueness and ambiguity of geographic concepts. It is defined as a mismatch between data extracted on the basis of visual information and the interpretation drawn from the same data in a given situation [41]. This is so because converting visual data (from human perception) to computational representation is a very difficult task. The translation first requires expressing perception of visual data into a symbolic knowledge representation format (e.g. forests have high NDVI values). Such a conversion is a very difficult task since some concepts have vague meanings when expressed in natural language [44]. For instance, color may be considered a significantly important biophysical property [45], but its perception varies amongst humans and it is difficult to express.

III. INTRODUCTION ON ONTOLOGIES
Sharing knowledge among people is feasible only if people speak a common language [43]. The traditional definition of "knowledge is a subset of true beliefs" [46]. It is the intersection between truth and beliefs, as represented in Figure 10. Ontologies enable formal (machine-understandable) representation of knowledge. In computer science, ontology is defined as an explicit, formal specification of a shared conceptualization [47]. An ontology is a systematic description of existence, and this term is drawn from philosophy. What "exists" for Artificial Intelligence (AI) systems is that which can be represented. The following properties, with corresponding definitions, should be observed: (1) conceptualization, means that an ontology is an abstract model of a real world phenomenon; (2) explicit, implies that all ontology concepts must be clearly defined; (3) formal, implies that an ontology is machine understandable; and (4) shared, means that there should be consensus amongst a community of people about the knowledge represented by the ontology.

A. FORMAL ONTOLOGIES
Remote sensing science experts are conversant with working on numerical knowledge that has been derived from an image viewpoint [48]. Numerical knowledge representation by nature suffers from the problem of partiality and implicit knowledge representation, hence it becomes difficult to share the knowledge with other scientists, such as ecologist, agronomist, etc., who are used to working with symbolic knowledge in describing a geographic concept, for instance, a forest concept is defined by "High NDVI" values. Formal ontologies provide a road map that caters for the representation of both symbolic and numerical knowledge. Formal ontologies can be utilized to unequivocally portray a perception or observation from different perspectives, for instance, the extensible observation ontology (OBOE) is utilized to portray the semantics of scientific observations. An observation of an entity encompasses the characteristics (e.g. biomass) of the entity based on a measurement standard (grams). Ontologies for remote sensing science applications based on description logic offer the following advantages; • Symbolic language -it binds/associates concepts with relevant sensing data and also promotes binding of 6 VOLUME 4, 2016 related concepts. • Knowledge sharing -it advocates for common conceptualization and adoption of standard ontology language such as web ontology language. • Reasoning -description logic in ontology allows the inferring of new knowledge from explicit descriptions.

B. ONTOLOGY KNOWLEDGE BASED AS A SOLUTION
This section outlines how the adoption of ontologies in knowledge base approaches helps in alleviating the problems addressed in Section 2.0.

1) Duality of geographic concepts
Ontologies incorporate the concept of perspectivalism. That is, they allow the separate description of a field point of view of a forest concept. For instance, a forest concept can be specified in terms of attributes such as "high" NPP, leaf type, and so on. The other angle of description is from the point of view of an image of geographic objects. That is, a forest can be defined in terms of attributes such as "high" NDVI, texture, and wavelength. In general, it allows for the separate description of geographic entities and geographic objects alongside their characteristics. Figure 11 shows the dual representation of a geographic feature, that is, it can be described either from the perspective of a geographic entity or from the perspective of a geographic object.

2) Vagueness and ambiguity of geographic concepts
Fuzzy logic is the most popular way of handling the vagueness of geographic concepts [49]. Processing of data is done using partial set membership rather than strict set membership. For example, a forest concept is not considered to be strictly "green", but rather is considered to belong partially to some degree to the set of things that are green. [50] defined two thresholds i.e. ambiguity reject threshold and the distance reject threshold. Ambiguity reject threshold is defined by the rule α amp ∈ | 0.5, 1 | and define the degree of confidence required to recognise an object. Distance reject threshold is defined by the rule α dist ∈ | 0.1 |, this means an object x is unlikely to belong to both classes C k and ¬ C k and might belong to a concept not yet learnt. Vagueness can also be addressed by adopting probability ontologies. They use probability sets to define concepts of interest. Attributes in the set properties have probabilities attached to them, and the statistical measure of the probability value of the geographic concept [51] is used to determine whether a geographic concept is a member of a class. Ambiguity in ontologies can be reduced by limiting the information that describes a concept [42].

3) Sensory Gap
The discrepancy between real objects and their depiction in images is known as the sensory gap. As referenced by [52] sensory gap can be reduced by explicitly defining the domain and world knowledge in the system. Knowledge about physical laws, laws governing the behavior of objects, and how people perceive them will all be incorporated into the system in the hope of enhancing recognisers and thereby assisting machines in bridging the sensory gap [53]. However, in ontologies, real world description of forest entities is correlated with matching image point description of forest objects, i.e., NDVI is correlated with NPP [2]. Figure 12 shows how a real world description of a forest concept can be mapped to an object description in an image. An object description of an image is easily formalised on a computer.

4) Semantic Gap
The semantic gap is the discrepancy between the high level descriptions of images by humans and the low-level detection used by machines to detect images [54]. On the other hand, adding captions and annotations to images solves the problem [51]. The method is time-consuming and costly because it requires a lot of effort, machine algorithm tweaking, and close attention to vocabulary and content to ensure that photos are appropriately labeled [51]. In ontologies, however, an image feature (e.g. NDVI) and its associated value ("HIGHNDVI") are used to define a pixel (image object) of a forest concept. The "HIGHNDVI" concept is formalized as a result of the established relationship between symbolic information (e.g. "HIGHNDVI") and numerical knowledge (e.g., NDVI>0.7), hence the semantic gap is reduced. VOLUME 4, 2016 IV. ONTOLOGICAL FRAMEWORK FOR RSI (REMOTE SENSING IMAGE) [55] proposed a novel framework for RSI. The framework is made up of important terms or concepts. These include satellite, sensor, image, spatial resolution, and spectral resolution. The elements are shown in the table 1. Slot is mainly concerned with the spatial and spectral resolutions, which relate to the scope, although there are no related elements in the range component. Spectral resolution is one of the most important concepts for the framework. It follows a top down approach method, where the concept is parceled into two subcomponents, i.e. the visible part and the infrared part. The visible is made up of three color segments, i.e. the RGB (red, blue, and green). The infrared part is also made up of three segments, i.e. thermal infrared, near infrared, and far infrared. The parameters suited for the slot are explicitly defined and include has_spatial_resolution, has_spectral_resolution, etc.
[56] developed a simple ontological approach for remote sensing image classification. The prototype was built upon the expert remote sensing knowledge expressed in [57].

A. ONTOLOGICAL FRAMEWORK FOR OBJECT FEATURE EXTRACTION
After an image goes through a segmentation process, each region is characterised by a set of features. The feature extraction process from eCognition software follows the general upper ontology defined using the top down method [58]. The features are divided into six categories, namely Layer-Property, GeometryProperty, PositionProperty, TextureProperty, ClassProperty, and ThematicProperty. The selection of features of interest is performed by an expert to allow object detection. Figure 13 shows a hierarchical breakdown of object features from the six categories. GeometryProperty, Tex-tureProperty, and ThematicProperty are important features in detecting forest objects [59].

B. ONTOLOGY MODEL OF THE LAND COVER CLASS HIERARCHY
The upper-level ontology is developed using concepts from land cover classification systems (LCCS) [cite [h39]. Figure  13 shows a hierarchically simplified way of representing classes of interest emanating from the main land cover class [55]. [58] designed an upper level ontology for the Chinese Geographic Condition Census Project [60]. Figure 14 depicts the design of an eight land cover ontology. The procedure was as follows: 1) The first step was to establish a set of important terms, in this case; Fields, Woodland, Grassland, Orchards, Bare land, Roads, Building and Water. 2) Classes and class hierarchies were then defined, A land cover class was defined through a top down approach.

1) Ontology Model of the decision tree classifier
Ontologies typically express two algorithms, namely decision trees and semantic rules [58]. [61][62] used decision trees in the field of ontologies to cluster and classify image objects. Findings proved that decision trees enhance ontologies to granulate information, thereby increasing image classification accuracy. [62] uses decision trees to solve the problem of inconsistency between overlapping ontologies. [48] use decision trees for ontology matching; the matching process is purely based on derived decision tree rules for an ontology that are compared with rules for external ontologies. [58] designed an ontology model for decision tree classifier that consists of three parts; (1) a set of decision trees is composed of all essential terms and concepts, for instance, a node and a leaf; (2) a slot is defined by the following inequality symbols > ≥,<,≤(3). The final step is to create the nodes. Figure 15 shows the elements of the decision tree classifier.
2) Ontology model of the semantic rules [58] followed a two phased approach to designing an ontology model for semantic rules; the first is the establishment of mark rules, followed by decision rules. Mark's rules convert low level features to semantic concepts. On the other hand, decision rules are inferred from mark rules and apriori knowledge.
• Ontology model for mark rules The morphology of semantic notions is classified into strip and planar; the shape is regular and irregular; the texture is smooth and rough; the brightness is light and dark; the height is high, medium, and low; and the position relationship is adjacent, disjunct, and contained. The ontology model of the mark rules is shown in Figure 16 • Ontology model of the decision rules Ontologies explicitly represent concepts in the same way humans describe concepts in their domain of interest. However, ontologies that are developed disregarding decision rules have proved to be computationally expensive [63]. This is due to their inability to capture the kinds of decision-making knowledge that arises in practice, such as those involving multiple ontologies. Decision rules on ontologies help in three ways, namely: [64][65]; (a) they take into cognisance primitives from multiple ontologies as well as primitives that are not part of the rule framework; (b) they are time dependant (c) they incorporate default assumptions. Eight types of land cover obtained from the Chinese Geographic Census Project [60] were defined in terms of a rule as outlined in Figure 17.

C. SEMANTIC NETWORK MODEL
Semantic networks graphically represent knowledge in the form of nodes and links, whereby links provide hierarchical relationships between objects [66]. The semantic network model explicitly express knowledge through concepts and their corresponding semantic relations [58]. This is shown in Figure 18. The network bridges the gap between lowlevel characteristics and high-level semantics, reducing the semantic gap.

D. ONTOLOGIES FOR KNOWLEDGE MANAGEMENT
Framework ontologies and domain ontologies are the two most important types of ontologies. Frameworks, or foundation ontologies, consist of concepts explicitly expressed in high-level knowledge (for human understandability), and they are also not designed for a specific domain. A domain ontology has knowledge tailor-made for a specific domain, e.g., remote sensing. Domain ontology eave drops from framework ontology. Domain ontologies have a hierarchical structure of two levels; the first level is called the ABox, and the second level is called the TBox. ABox contains assertions (or rules) that comprise the theory that the ontology describes VOLUME 4, 2016  Decision rules based from ontology [58] in its domain of application [67]. TBox is where experts conceptualise their knowledge in a specific scientific domain [56]. There are vast paradigms for modelling ontologies, but chief amongst them are Description Logics(DL) [68] and rule formalism. The DL formalism serves as a foundation for building ontologies using the web ontology language (OWL) [69]. Ontologies can be inferred from new knowledge using DL, which makes ontologies machine understandable.

E. MODULAR ONTOLOGICAL APPROACH
The modular approach is the best way of building complex ontologies from simpler (modular) ontologies in a constant and well-defined way [70]. Such an approach allows collaborative development by many different domain experts to build a single ontology through the integration of independently developed ontologies. The ontological approach is carried out in such a way that TBoxTs are not changed when elements of T' are reused in another TBoxT. Formalisation of such a property follows the conservative theorem [71]. Definition1(Conservative Extension). Let T and T' be TBoxes, Sig(α) be a signature of axiom α and Sig(T') be signature of TBoxes of T'. Then . The semantic network model [58] independent parts T 1 and T 2 of an ontology T, are constructed in a modular way, then T remains modular as well. These are formalised as follows [70]: Falomir at al [69] proposed three levels of knowledge that are imperative for designing a modular ontological approach: the reference conceptualisation (which provides a description of images and image objects), the contextual knowledge (a set of rules defined by a domain expert) and the image facts (these are semantic descriptions of image content). Figure  19 illustrates how the reasoner assigns image objects to their corresponding concepts based on facts drawn from reference conceptualisation and contextual knowledge also drawn from reference conceptualisation.
(a) The reference conceptualisation It is a general model for describing image objects in remote sensing. It consists of two packages, namely, (1) the image structure package and (2) the image processing package [56]. The image structure package is superimposed with the Ima-geObjects concepts, which describe objects according to their characteristics, and the ImageObjectFeature concept, which links related concepts with associations such as "hasfeature". The image processing package is composed of the Pseu-doSpectrallndex and SpectralBand concepts. The concepts help remote sensing experts describe contextual knowledge. Concepts such as spectral bands and texture are used by remote sensing experts to interpret remote sensing images.

(b) The Contextual Knowledge
Contextual knowledge's purpose is to represent remote sensing expert knowledge using DL, hence the name "contextual knowledge." The basis of this knowledge comes from the Remote Sensing Science expert. As a result, it is a "subjective" description of image rules rather than an "objective" depiction of image structure. Figure 20 shows the concepts, relations, and instances in conceptual knowledge.
(c) The Image Facts These are facts extracted from image analysis, and they are stored in the ABox [48]. The TBox contains the reference conceptualisation and the contextual knowledge [48]. Facts in ABox provide semantic descriptions of image objects, and the description is done with the help of reference conceptualisation and conceptual knowledge.

V. VEGETATION DETECTION
Unsupervised and supervised classification algorithms are very crucial in identifying vegetation areas.

A. UNSUPERVISED CLASSIFICATION INDICES
Spectral indices are used in these methods to detect vegetation areas. The Normalized Difference Vegetation Index (NDVI), which is calculated for each pixel in an image, is one of the indices utilized. The NDVI image is represented in a gray scale image. As shown in Figure 21: image (a) is a representation of an image using the RGB channel; image (b) is the representation of the same image in an NDVI format using the gray scale.
Equation 4 illustrates the calculation of NDVI, where ψIR and ψR are pixel values in the infrared and the red band respectively. The formula defines vegetation as areas that have a higher reflective index in the infrared than the red band index. The formula was then refined to take into account the spectral index [73].
Equation 5 illustrates the calculation of SI, where ψ B is the pixel value in the blue band and ψ R is the pixel value in the red band. An NDVI value and a SI value are binarized to create a vegetation mask. This is shown in Figure 22.

B. SUPERVISED CLASSIFICATION INDICES
Detection of vegetation by spectral indices is highly dependent on spectral characteristics. In other cases, supervised classification methods are primarily based on Support Vector Machines (SVMs). The feature vector that defines all pixels in the training data set contains four characteristics, namely: the reflectance value of each pixel in the infrared, red, green, and blue. Supervised methods do well in distinguishing between non-vegetation and vegetation areas through spectral indices. It necessitates the use of a SVM capable of determining the best linear separator. Random Forest (RF), k-Nearest Neighbour (kNN), SVM and sparse representations are among pixel wise classifiers that have been used for the last decade [74]. These traditional methods only consider spectral information as the basis of the classification process, disregarding spatial contextual information which contributes significantly to the classification performance [74]. Several researchers have proposed a hybrid of spectral-spatial classification that takes into account both the spatial context and spectral information, based on the assumption that pixels from a local region have similar spectral information. [74] proposed a hybrid model of kNN combined with guided filter for hyper-spectral image (HSI) classification of forest trees. Joint hybrid model of kNN and guided filter (PGF-kNN) was used to optimise hyper-spectral images produced by kNN.
Optimised hyper-spectral images were taken in as input by the Joint kNN, and processed to produce the classification maps. Each class map was converted into a probability value and the class map with the highest probability value was chosen as the classification result. [75] conducted a study to determine the reliability of RF and SVM algorithms in the classification of very high resolution images (VHR), obtained from oak woodlands of a Mediterranean ecosystem. The first stage was data acquisition, where images were subjected to a Structure-Form-Motion (SFM) technique to identify common features in overlapping images. Each image was then orthorectified through the interpolated digital surface mode (DSM). Finally, all the images were combined into an orthomosaic. The workflow of the study followed 4 main steps, namely, preprocessing, segmentation, classification and accuracy assessment. Figure 23 shows the workflow of the proposed model. In the preprocessing stage each input layer was subjected to a linear band covering a range of 8 bits, that is, from a minimum of 0 to a maximum of 255. The process was done to normalise each band, to suppress the effect of possible outliers on the segmentation. A layer stretching process was performed on images containing R-G-NIR (Red, Green, Near-Infrared) bands, obtained during spring and summer seasons through integrating NDVI and DSM data, to obtain the final 2 five band orthomosaics. Such a process was of significant importance because OTB segmentation requires only one rasta image as the input data. Spectral separability is of significant importance when it comes to image classification. The M-static defined in Equation 13 was employed [75] to measure the separability of NDVI and DSM layers of varying types of vegetation.
where, µ 1 is the mean value of class 1 and µ 2 is the mean value of class 2. σ 1 is the standard deviation of class 1 and σ 2 is the standard deviation of class 2. If M < 1 it signifies overlap of classes, if M > 1 it denotes that classes are well separable. The segmentation process considered both semantic properties and radiometric information. Large-scale mean shift (LSMS) segmentation was used in the study because of its ability to perform tile-wise segmentation of large VHR imagery [76]. The OTB LSMS segmentation process followed the steps of LSMS smoothing, LSMS segmentation, LSMS merging and LSMS vectorisation. Classification was performed for five different land cover classes, namely, grass, cork oak, soil, shrubs and shadows. Two supervised learning algorithms including RF and SVM were used to perform the classification. SVM performs linear separation in a hyperspace using a µ(.) mapping function. In the case  where objects are not linearly separable, the kernel method is used where it takes into account projections of feature space [76]. RF uses decision trees for bagging to produce different subsets of variety of trees. Every decision tree in the RF participates in the classification process and the classification label returned is the class with the most votes.
Another study [77] analysed the performance of kNN and RF classifiers for mapping forest fire areas. The authors [77] implemented kNN and RF to classify forest areas and explained the effects of different satellite images on both classifiers. Figure 24 shows the flow chart of the model. The model being a supervised approach was implemented by using multi-spectral images obtained from Landsat8, Landsat-2, and Terra sensors. The classification accuracy was VOLUME 4, 2016 FIGURE 23. Workflow that presents the stages of preprocessing, segmentation, classification and accuracy assessment [75].
determined by the confusion matrices. The machine learning classifier based on kNN and RF produced excellent results with k set to 5 for kNN and 400 trees for RF. The results from the hybrid model achieved a very high classification accuracy with an Overall accuracy (OA) > 89% and Dice coefficient (DC) > 0.8. Other studies [78][79] have also implemented non-parametric algorithms such as kNN and RF in remote sensing applications.

VI. IMAGE SEGMENTATION
An input image is partitioned (or subdivided) into meaningful image objects (segments). Image segmentation can be classified into two categories: supervised (empirical discrepancy methods) and unsupervised (empirical goodness methods) [80]. Unsupervised approaches evaluate a segmentation result based on how well the image object matches a human perception of the desired set of segmented images, and they use quality criteria that are typically created in accordance with human perceptions of what constitutes a good segmentation. Supervised methods compare a result from segmentation with a ground truth [2]. If ground truth can be reliably established, supervised methods are preferred.

A. TYPES OF IMAGE SEGMENTATION
Pixel, edge, and region-based image segmentation methods are the three primary types of traditional image segmentation. [81].

(a) Pixel Based Methods
This method involves two important processes: (1) image thresholding and (2) segmentation in feature space. For image thresholding, image pixels are divided according to their intensity level [82]. There are three types of thresholding [83][84]: (1) Global thresholding -T being the appropriate threshold value. The output of an image q(x,y) based on T is obtained from an original image p(x,y) as Variable thresholding -This when the value of T varies varies over an image and it comes in two flavours: • Local Threshold -T depends on the neighborhood of x and y. • Adaptive Threshold -T's value is a function of x and y. (3) Multiple thresholding -It has multiple values of T. The output image is computed as follows: However, these methods suffer from incomplete segmentation, so the output results need to be clumped. Also, these methods are appropriate for images with lighter objects than the background.

(b) Edge Segmentation methods
Edge-detecting operators are employed to detect all possible edges that are found in an image. Adjacent edges are clearly separated by a gray sharp edge, but there could be a case where the gray value is not continuous [85]. The edges will be represented by discontinuity in gray level, color, texture, etc. This discontinuity is detected by using derivative operations such as differential operators [86]. The Prewitt, Roberts, and Sobel operators are the most frequently utilized first order differential operators [87]. There are a number of edge detection operators such as the template matching edge detectors. One challenge with edge-based segmentation 14 VOLUME 4, 2016 FIGURE 24. Flowchart of the model that harmonises RF and kNN [77] is that sometimes it presents edges in locations where there is no border. Filtering, enhancement, and detection are the three processes in edge segmentation algorithms [81]. The purpose of the filtering process is to reduce the amount of noise present in the imagery. The enhancement uses high pass filtering to detect and reveal local changes in intensity. Finally, the edges detected (using threshold techniques) are combined or linked together to form the boundaries of the image object. One challenge with edge-based segmentation is that, sometimes it presents edges in locations where there is no border.
(1) Region Growing Segmentation It starts with the matrix's origin (seed point), which is then subjected to a rule that joins surrounding pixels to these starting regions, and the procedure is repeated until a particular threshold is met [85]. The method is repeated until there are no more pixels to ascribe. This process is repeated until the entire image is segmented. The algorithms, on the other hand, suffer from a lack of control over the region's growth breakoff criterion [90].
(2) Region splitting and region merging The original image is split or subdivided into sub images. Each sub-image is recursively divided into its own subimages based on the condition or predicate given. If the condition is not satisfied, further splitting ceases [91]. Figure  25 shows the splitting process.  [80] proposed a segmentation process that improves segmentation accuracy by modifying the super-pixel extraction methodology so as to increase robustness to added noise. The segmentation method is based on Gabor filtering and Principal Component Analysis (PCA). Figure fig:figure19 presents the state-of-the-art segmentation process. The method depends on two principal tasks: (1) pre-segmentation (superpixel extraction), and (2) clustering of previously extracted pixels.

FIGURE 26.
Image segmentation state of the art [80] (a) Pre-segmentation VOLUME 4, 2016 An input image is subdivided into a number of regions of interest. Each region is made up of pixels with similar features. The Watershed Transform (WT) clustering based super-pixel algorithm has previously been considered for super-pixel extraction [92] [93].
(b) Gabor filter Gabor filters are used to extract spatially localized spectral features [80]. They have been advocated for because they are based on principles found in similar human visual systems and have key features that can be utilized to segment images.
Before the introduction of deep learning, machine learning techniques such as SVM, K-means clustering, Random Forest, etc., were the chief algorithms for image segmentation. Semantic segmentation using deep learning has proven to work better than the aforementioned techniques because they classify each pixel of an image rather than the entire image object. The next chapter gives an overview of semantic segmentation techniques.

VII. SEMANTIC SEGMENTATION USING DEEP LEARNING
This section introduces fundamental ideas of CNNs and subsequent variants for semantic segmentation, as well as their network structures [94].

A. ALEXNET, VGGNET AND GOOGLENET
These are the three chief deep neural networks for image classification, which formed the major foundations of later developments. The networks support network architectures for semantic segmentation.

1) AlexNet
AlexNet is made up of five convolutional layers and three connected layers [95]. AlexNet [95] In between the convolutional layers is a pooling layer whose role is aimed at reducing dimensionality and computational complexity. AlexNet's pooling strategy is max pooling, and the strategy is to obtain the biggest value covered by the filter, which is used to remove noisy components [94]. Filters of sizes 11x11 and 5x5 are used in the first and second convolutional layers, respectively. The last three layers use smallsized filters of 3x3. The whole process is described in Figure  27. The primary purpose of such filters is to be solely used for feature extraction. Varying filters accommodate objects of different scales.
1) It supports the application of non-saturating Rectified Linear Unit (ReLU) whose output is defined by . 2) It employs the overlapping max pooling strategy (which means that each filtering operation's step size (stride) is smaller than the filter's overall size). 3) To reduce over-fitting, it uses the dropout approach in fully-connected layers.

2) VGGNet
The network is made up of three fully connected layers and a varying number of convolutional layers. This is shown in Figure 28. Unlike AlexNet, VGGNet has fixed small size filters of 3x3 in the convolutional layer [96]. The number FIGURE 28. VGGNet [95] of weights in the network is reduced by using small filters, which minimizes the training complexity. Just like AlexNet, VGGNet uses max pooling over a 2x2 window slide of 2 pixels. The advantage of simplifying convolutional layers to a greater extent is that it increases network depth, thereby improving the accuracy of the network. The network's performance in tasks like semantic segmentation and target detection is improved by using features extracted from CNN that are structured in a hierarchy of scales [97]. Other classifiers, such as SVMs, can use the features without fine-tuning [98].

3) GoogLeNet
The architecture is different from the other three in that it involves three aspects, namely the inception module, at the training stage, an auxiliary classifier is required, as well as one fully connected layer [99]. Output results from these filters are concatenated with the maximum pooling result. Between the inception modules, maximum pooling is employed, and after the last inception module, average pooling that employs dropout is used [100]. The flow chart diagram is shown in Figure 29. The network is so deep because it is made up of nine inception modules and up to three convolutional layers. Because of the profundity of the network, the smooth flow of gradient from layer to layer becomes an issue. Figure 30 shows the Inception Module. The issue is addressed by adding an auxiliary classifier in the middle of the convolutional layers, whose role is to process the outputs from the inception modules. The loss from these classifiers is added to the overall loss of the network during training.
16 VOLUME 4, 2016 Auxiliary classifiers are prohibited from making decisions during the prediction phase.

B. FULLY CONVOLUTIONAL NETWORK
Fully convolutional networks for semantic image segmentation are an extension of AlexNet, VGGNet, and GoogLeNet [101]. Multi-convolutional, deconvolutional, and fusion are the three steps that define the network. The flow chart is shown in Figure 31. Convolutional layers have been substituted for fully linked layers, with the specification that each image's score be computed using a 1x1 convolution. Because of pooling, the output image from convolutional layers is smaller than the input image. The deconvolutional process is used to restore the image. It uses the same methods as the convolutional process, but cushions the framework (by padding the matrix) and joins the elements inside a deconvolution filter to increase the input size. The process of recovering the original image through the deconvolution process has some side effects; for example, some details are lost as a result of the dilution of class scores. To circumvent the side effects, the skip architecture combines semantic information obtained from layers with location details obtained from previous layers. By element wise addition, the upsampled deep layer is fused with the yield or output of a shallow layer.

C. UNET
The building blocks of Unet are the convolutional and deconvolutional layers. The network works well with small images, hence the paramount step is downsizing of input images [102]. Convolutional layers use filters of size 3x3 which produce output images that are subsequently subjected to Relu for processing, followed by maxpooling (which uses a stride of two). Maxpooling generates downsized outputs. Feature channels in the convolutional layers double at each and every step. The deconvolutional layer does upsampling, but a 2x2 convolution is used to limit the number of features to the required standard. The network generates the segmentation result by applying a 1x1 convolution on the feature map and labeling pixels. The interconnection of layers in Unet is shown in Figure 32.

D. SEGNET
The network is composed of two subnetworks, namely; the encoder and decoder networks. The encoder network's mandate is the downsizing of feature maps. It consists of a varying number of convolutions and subsequent maxpooling operations for feature extraction [103]. However, features produced have vague or ambiguous spatial information. The issue is solved by saving an element index that will be used later in the decoder network's up-sampling procedure. Convolutions map low-resolution features to high-resolution features in the decoder network. A 2x2 low-resolution feature, for example, is up-sampled to a 4x4 matrix. This process may result in the loss of spatial information; therefore, reusing the pooling index from the encoder network completely recovers the lost information. The SegNet network is depicted in Figure 33.

E. DEEPNET
It is a variant of FCN that employs dilated convolution to broaden the scope of filters to include image context in a larger neighborhood while also allowing for flexibility over feature response resolution [17]. Deeplab uses Atrous Spatial Pyramid Pooling (ASPP) for up-sampling. Several atrous convolutions operated on the same kernel but with various sampling rates are used in the scheme. An additional operator combines the output from all convolutions. Downsampling processes and subsequent maxpooling operators make segmentation results lose some fine details. To solve the problem, conditional random filters (CRFs) are employed to improve the spatial localization of segmentation. CRF models contribute to the smooth segmentation process based on the underlying image intensities [104]. They boost the accuracy score by 1% to 2%.

F. RESNET
The residual network is well recognized for its 152 layer depth and residual block introduction [105]. The residual block is presented in Figure 35. As based on traditional neural networks, the greater the number of layers, the better the performance of the network. However, because of the vanishing gradient problem, first layer weights will not be updated correctly through the backpropagation algorithm [106]. As the error gradient is propagated to earlier layers it goes through a repeated multiplication process such that the gradient becomes very small hence the network performance gets saturated and will start to decrease. This problem is solved by using the identity function, whereby the gradient is multiplied by one so as to preserve the input and avoid any loss in the information. The network is made up of the following components; 3x3 filters, CNN downsampling layers with a stride of 2, global average pooling, and a 1000way fully connected layer with softmax at the end. ResNet employs a skip relation, which means that an original input is also connected to the convolution block's output. This aids in the solution of the vanishing gradient problem by allowing the gradient to flow in a different direction. The network diagram of the residual network is shown in Figure 34.

G. APPLICATION OF DEEP LEARNING TECHNIQUES
New emerging technologies such as deep learning have gained ground in the remote sensing science fraternity because the automatic processing of images by these techniques chiefly depends on human expert knowledge, which has impacted the way land surveys are done [107]. The main advantage of deep learning approaches is the automatic computational extraction of features, unlike other machine learning algorithms where feature extraction is typically manual [108]. The strength of deep learning algorithms lies in learning from examples. The learning process consists of a number of steps: first, an architecture of a network of nodes is clearly defined. The nodes that form an Artificial Neural Network (ANN) are arranged into layers. An ANN with many layers is referred to as a Deep Neural Network (DNN). The behaviour of the DNN is determined by the type and number of nodes as well as the connection between the 18 VOLUME 4, 2016 nodes [107]. If an existing DNN is to be customized for an new application context, its weights are recursively updated to achieve the new desired response. This process is referred to as "transfer learning". Deep learning was originally used for locating and classifying different tree species in a mosaic built from UAV-acquired images [109] [110]. [111] devised a deep learning technique to detect and identify tree species. The objective of the study was to classify patches corresponding to tree species. The authors developed a Deep Learning (DL) architecture, which is a hybrid of ResNet and UNet, to come up with a semantic segmentation algorithm for tree spices that is precise and efficient. Seven orthomosaic images were collected using UAV in the winter, and one orthomosaic image was collected using UAV in the summer. The algorithm pipeline is presented in Figure 36. The first step of the technique identified the classes corresponding to each mosaic patch. The focus was on classifying the pixels in each mosaic patch. The incorporation of the ResNet architecture into the DL network enhanced the accuracy and efficiency in classifying forest images [112] [113]. Images were divided into patches in response to the prescribed annotations, and each patch was assigned to a list corresponding to the classes that matched it. Patches could belong to more than one class, resulting in patches having to be labelled repeatedly. Because of the repeated labelling of patches, the algorithm is referred to as a Multi-label Patch (MLP) based classifier. The ResNet architecture went through the training phase so that it would be able to classify the patches. The MLP classification algorithm produced coarsely segmented images. A watershed segmentation algorithm was applied to refine the segmentation process. The UNet architecture, originally used for medical image segmentation [114], is also very useful for remote sensing images. The UNet architecture was trained with data and pixel-wise annotation patches. The segmentation process follows a number of steps: (1) mosaic images were split into patches for processing, (2) a UNet model was trained to predict patch segmentation, and (3) patch joining was used to obtain semantic segmentation for the entire mosaic image. The model achieved an effective learning transfer with a 12.48% improvement over random weights. Overall, the model reached a higher accuracy of nearly 95%.
Another study [115] proposed a Residual Neural Network (ResNet) architecture for classifying tree species acquired using a camera mounted on a UAV platform. In temperate forests, UAV images have been successfully used to distinguish between living and dead forest species [116]. The motivation of the study was that, most of the existing methods for tree species classification are cost-sensitive because they require very large data sets and are restricted to specific tree species [117]. The study proposed a model based on CNN to classify tree species at an individual level by analysing high resolution RGB images obtained from the UAV. A CNN was chosen in the study because of its ability to learn highly descriptive features from tree canopies. The study proposed a CNN model with 50 convolutional layers, referred to as ResNet50. Figure 38 shows the architecture of ResNet50. The procedure for performing tree crown delineation was based on the iterative local maxima filtering technique that was used to identify probable tree tops. Tree tops were designed as markers, hence a marker controlled watershed segmentation was performed as a means of complementing the DSM for segmenting the tree crowns. Figure 37 shows a tree crown segmented polygon. The tree crown delineation process enables tree crown identification labelling. In the training phase, images were shuffled in unison with their corresponding labels to randomise the input data so that the neural network becomes generalised. The model achieved an overall classification accuracy of 80%. The study concluded that classification accuracy increases with an increase in the number of training images.
The task of classifying and mapping vegetation images has been difficult because the conventional methods employed are highly labour intensive. Deep learning and CNN came as solutions to the problems posed by traditional methods, but they are still not efficient in detecting ambiguous objects [118]. There is a little research that employs CNN to detect and classify vegetation in remote sensing science images [118]. A study by Guirado [119] successfully used CNN to detect wild shrubs from Google Earth images. The author demonstrated that a CNN is much better than traditional object detection methods. Another study [118] used a deep learning model and the chopped picture method to detect vegetation from Google Earth images. The study was carried out against the backdrop that existing work still faces huge challenges in classifying vegetation that has ambiguous and amorphous shapes, such as clonal plants. The training data was prepared using the chopped picture method, and images were put into two sets; one set with images completely covered with bamboo trees and the other set without bamboo trees. Images were then chopped into small squares and subsequently used as training images. A classical deep learning model in the form of a LeNet network was employed by the study because it is efficient in processing small-sized images. The network is composed of two convolution layers, two pooling layers, and one fully connected layer. The final layer was used to detect bamboo coverage in Google Earth images. Input images were randomly shuffled to alleviate overlapped training and validation data. 72% percent of the data was used for training and 25% of the data for testing. The model achieved an average classification accuracy of 97.52%.

VIII. FEATURE EXTRACTION TECHNIQUES
This section delves into the main techniques for feature extraction, and these include (1)  FIGURE 37. Tree crown delineation [115] 1) Principal Component Analysis (PCA PCA is popularly used as a dimensionality reduction technique [120]. It was first proposed by [54]. From the original FIGURE 38. CNN model architecture [115] data input, the PCA method tries combinations of input features in order to determine the best features that summarise the original data. This is accomplished by looking at pairwise distances to maximize variances and minimize reconstruction error [121]. Since PCA is an unsupervised learning algorithm it leads to misclassification of data in some cases [120]. Distortion errors arise when data is reconstructed back because samples would have been projected onto a subspace [122]. 20 VOLUME 4, 2016 ICA, like PCA, is a linear dimensionality reduction method that combines discrete components to produce input data with the goal of correctly identifying each of them [120]. It is based on the principle that two features are deemed independent if their linear and nonlinear dependence are both zero [123]. Independent Component Analyses are extensively used in medical applications such as Electroencephalography (EEG) and Functional Magnetic Resonance Imaging (FMRI) analysis to differentiate useful from unhelpful signals [120].

3) Linear Discriminant Analysis (LDA)
LDA is a supervised learning dimensionality reduction technique and a machine learning classifier [120]. The method is similar to PCA in the sense that it calculates the projection of data along a direction, but instead of maximising variation of data, the LDA uses label information to determine a projection by maximising the ratio of between class variance to within class variance [122]. The goal of LDA is formulated as the Fisher criterion [124].
Recently, this technique has been used for indoor positioning or localisation systems for the purpose of obtaining superior and higher accuracy [125]. The performance of LDA in the construction of data using independent variables is directly proportional to the number of data patterns [125]. However, its performance is yet to be confirmed in the context of nonlinearity [126].

4) Locally Linear Embedding (LLE)
The LLE is built on a foundation of manifold learning. A manifold is a D-dimensional object that is embedded in a higher-dimensional space. A manifold is considered as an integration of small linear patches, which is done through piece-wise linear regression [127]. To do the integral operation, [128] proposed the construction of a kNN graph similar to an isomap. Then all the sample data is represented by a weighted summation of its k nearest neighbors. Considering w i to be row i of the n x k weight matrix w, the solution to the goal is found by: where G is called a Gram matrix and V is a n x k matrix. After the process of representing samples as a weighted summation of their neighbors, LLE represents samples in the lower dimensional space by their neighbors with the same obtained weight.The method has been successfully used in feature extraction of Motor Imagery Electroencephalography (MI-EEG) and it outperformed methods such as Discrete Wavelet Transform (DWT) in classification accuracy with fewer feature dimension [129].

5) t-Distributed Stochastic Neighbor Embedding (t-SNE)
tSNE is an improvement of Stochastic Neighbor Embedding (SNE) [130], which is used for data visualisation. The main goal is to preserve the joint distribution of data samples in the original and embedding spaces. Considering P ij and Q ij to donate the probability that x i and x j and are neighbors and y i and y j are neighbors, it follows that: Embedded samples are then obtained by adopting the gradient descent method over minimizing Keullback-Leibler divergence [131] of p and q distributions. The main advantage of t-SNE is the ability to deal with the problem of visualising "crowded" high dimensional data in a low dimensional space (e.g., 2D or 3D)[132] [131].

A. FEATURE EXTRACTION STATE OF THE ART
In image retrieval, calibration, classification, and clustering, it is critical to extract useful features or characteristics from the image [133]. Color histogram is the most significant method to represent color features [134]. [135] provided a state-of-the-art feature extraction model that consists of two parts: (a) adaptive color region extraction via the definition circle (DC) model, and (b) corner feature extraction via the edge detection model, which includes a suppression mechanism.
The purpose of the algorithm was to produce a clear and precise forest saliency map. The algorithm is broken down into three parts, and those are: (a) the color feature extraction part; (b) the determination of the center of the DC model; and (c) an accurate description of color. The algorithm is expressed in figure 36.
(A) Colour feature extraction Model appropriate for the extraction of color features is the DC model, which is comprised of the following steps: (1) using the RGB picture G histogram to calculate the DC model's center; (2) mapping the image to the HIS color space or lab color; (3) using the k-means procedure to find the DC model's radius. The flow chart of the DC model is shown in Figure 41.
(1) Determine the center of the of the DC model While the DC model can describe color fluctuations under specific gradients, the forest region's dominating hue is generally green, implying that the 'greenish' pixels in the forest area must be filtered off. As a result, the G channel (green) VOLUME 4, 2016 in the RGB three-channel system will be the focal point for filtering out pixels that fall within a given range and calculating the mean value within the range. That value will be regarded as the center of the circle.
(2) Color description It is critical to note that the purity of the green is determined by the circle's center, thus the radius must be adjusted to account for a variety of color variations and fault tolerance. The RGB channel, on the other hand, does not function well for color adjustments. The RGB color system is converted to Hue, Saturation, and Intensity (HSI) or lab color space to fix the problem. The color can be defined more correctly using only two channels, namely hue and saturation, rather than the RGB color space.
(3) Adjustment of DC Model radius To improve the accuracy and adaptability of forest region extraction, the center and entire remote sensing picture acquired in the first phase is mapped or converted to HIS color space.Each pixel's Euclidean distance to the RSI center is calculated. The k-means clustering algorithm subdivide the forest into clusters and determines the Euclidean distance between the cluster center and the DC model's center, which is then used as the DC model's radius.
P denotes the center of the DC model and the value would have been obtained by the histogram model in the RGB to HIS color scheme. R is the Euclidean distance and δ(i) represents an is the iterations of the class algorithm. Figure  39 shows the color extraction feature of the DC model FIGURE 39. DC model in color extraction feature [135] (B) Edge Feature extraction The goal of this procedure is to successfully eliminate non-forest areas. [136] proposed the canny operator as the edge detection operator because of its better performance than other operators in terms of edge feature detection. In particular, denoising is key for image processing, and in this particular instance, a Gaussian filter was employed to smoothen the image, thereby preserving the edges. The amplitude and direction of the gradient are then calculated using the finite difference of the step-wise derivative. The canny edge detector operator returns only the maximum value and uses the non-maximum suppression operation to suppress the field's conspicuous points, resulting in a corner point with high precision and clear vision. Finally, by using a dual threshold setting, discrete edges are linked together to form a continuous edge. Figure 35 shows the stages of an edge feature extractor.

IX. PERFORMANCE EVALUATION MATRIX
The major matrices to measure the performance of the model in forest image classification are: False Positive Rate (FPR), Accuracy (Acc), F1 score, Precision-Recall Curve, and Average Precision (AP). Measurements for image segmentation area evaluation are presented in Table 2. The Area Fitness Index (AFI) was proposed by [66] and the remaining measurements by [2] 22 VOLUME 4, 2016  The overall of area discrepancy index of over and under segmentation. When ADI is zero, segmentation quality is optimal The average distance between the reference object and its matching image object is described by the Position Discrepancy Index (PDI). The Overall PDI is the average of the PDI.
P DI Overall = 1 n n i=n P DI(i)

X. PERFORMANCE ANALYSIS OF THE STATE OF THE ART
Results based on the CNN with hyperparameter settings of patch size L = 15, regulation strength α = 0.001, and C = 32 filter kernels in the first convolutional layer up to a maximum of C' = 128 kernels. Using Tensorflow and Keras mechanisms, the final CNN classifiers used the hyperspectral imagery to outperform the RGB subset image as indicated by precision, recall, or F-score. Results are presented in Table 3.   Object fate analysis and the method proposed by [66] do not objectively express segmentation quality results.  indicates that AFI ranges from 0.561 to -0.280 when shape and compactness are both at 0.1 and the scale parameter is changed from 60 to 120.

XI. RECOMMENDATION
Pixel-based techniques have been commonly used for image analysis and classification for a very long time. However, due to the massive growth of high spatial resolution images and the fact that pixel based methods only work with spectral information, the technique could not be fully utilized because it does not incorporate spatial, texture, and shape information, [137]. Previous studies have also shown that such approaches cause noise in the output message, otherwise known as the "salt and pepper effect." [138]. Due to the limitations of traditional pixel-based methods to cope with high-resolution imagery, OBIA methods have become increasingly popular because they have a high degree of information utilization, strong anti-interference, a high degree of data integration, and high classification accuracy [139] [140]. However, GEO-BIA techniques are made up of knowledge and rules purely from domain expert knowledge, such that they enhance the subjectivity of image interpretation processes. Given the evolution of remote sensing science as a result of artificial intelligence, this study suggests that we pay more attention to Good Old-Fashioned Artificial Intelligence (GOFAI), which is based on sound mathematics and logic to construct symbolic representations of abstract notions [1]. This research highly recommends a shift towards remote sensing image analysis with ontologies because such technology allows management, aggregation, and sharing of the knowledge of remote sensing and domain experts. Formal ontologies explicitly define expert knowledge that is used to interpret remote sensing images. This improves the sharing and reuse of formalized remote sensing expert knowledge.

XII. CONCLUSION
This paper is a critical and analytical survey of the methods for forest image detection and classification. It is a comprehensive review of the techniques used to detect objects of interest in an image that will be analysed for classification of forests. These techniques cover semantic segmentation techniques, feature extraction methods and finally classification techniques. Exploration of knowledge based approaches in form of GEOBIA were analysed and how their shortcoming in terms of dual mode of defining geographic concept, vagueness and ambiguity of geographic concepts, and se-mantic gaps were addressed by ontology knowledge based approaches. Performance of the state of the art Tensorflow and Keras for image classification were analysed. Formal ontologies knowledge representation was recommended for state of the art approach for detecting objects of interest. CNN methods for semantic segmentation were critically analysed and these were; AlexNet, VGGNet, GoogLeNet, FCN, UNet, SegNet, DeepNet and ResNet.
MANDLENKOSI GWETU is a senior lecturer at the University of KwaZulu-Natal (UKZN), South Africa. He holds a PhD in Computer Science (CS), specializing in medical image processing, from UKZN. He is currently serving as the academic leader of CS at UKZN. He is the principal investigator of the UKZN node in the Erasmus+ funded Living Labs for Climate Change multi-national project, and is an alumni of the Heidelberg Laureate Forum. His research areas focus on deep learning, pattern recognition and computer vision.
DR JEAN VINCENT FONOU-DOMBEU is a Senior Lecturer in the Department of Computer Science of the University of KwaZulu-Natal (UKZN). He holds a PhD in Computer Science from the North-West University, South Africa, an MSc in Computer Science from the University of KwaZulu-Natal, South Africa, and BSc (Hons) and BSc in Computer Science from the University of Yaoundé I, Cameroon. Dr Fonou-Dombeu's research is in Ontology Engineering, Semantic Web, Machine Learning; specifically, in ontology building, learning, modularization, ranking, summarization and visualization, artificial intelligence, machine learning and data mining methods for the Semantic Web, knowledge representation and reasoning on the web, and knowledge graphs and deep semantics. VOLUME 4, 2016