Ontology-assisted Expert System for Algae Identification with Certainty Factor

Harmful Algal Bloom (HAB) is one of nature’s responses to nutrient enrichment in aquatic systems and increasingly occurs in coastal waters, such as in Lampung Bay and Jakarta Bay, Indonesia. HABs present environmental and fisheries management challenges due to their unpredictability, spatial coverage, and detrimental health effects to coastal organisms, including humans. Here, we propose an automated algae species identification system assisted and validated by expert judgment. The system uses ontology as guidance to determine the species of algae and certainty factors to indicate the level of confidence of the experts when providing a statement or judgment for a particular object or event under consideration. We tested the system to identify 60 sample data using 51 predetermined algal characteristics. The tests were narrowed down to the 20 most common HAB-causing algae types found in the study sites and compared with identification by the experts. The results showed that the system has successfully identified the test data with an accuracy of 73.33%. The system also has a high agreement (above 79.75%) with the identification performed by the experts on six algae species. Further improvement on the system’s accuracy could facilitate its use as an alternative tool in rapid algal identification or part of an early warning system for HABs.


I. INTRODUCTION
Algal blooms and red tides have become common in many saline or freshwater systems worldwide, including Indonesia. Both are considered a phenomenon in which one or more algae/phytoplanktons increase rapidly in the water column. The bloom takes on a red or brown color depending on the pigment contained in the phytoplankton [1]. Some causative red tide species are harmful and produce toxins that can directly kill the organism affected or be transferred through the food web chain from which they retain the name as harmful algal blooms or HABs [2]. Besides toxic, HABs can also cause respiratory damage to aquatic organisms due to the build-up of algal or its remains in the respiratory system and suffocating hypoxic conditions.
HABs have become a crucial environmental issue in the last decade, particularly for aquaculture industries, due to their impact on fish culture. Between 1990-2015, at least 23 cases of HABs were recorded in Indonesia waters distributed from Java Sea, Jakarta Bay, Ambon Bay, Lampung Bay, and West Sumatera [3]. However, despite HABs being frequently reported by the media, research institutions, and government, HABs prevention or efficient mitigation to avoid or reduce losses to fish farmers or mass fish mortality has yet to be seen.
The outbreak of HABs is usually related to changes in environmental conditions. Some physical and chemical water parameters induce the rapid growth of HABs species. Some scientists argued that the apparent increase in harmful blooms is strongly linked to eutrophication in aquatic systems [4]. In addition, the enrichment provides a vast quantity of nutrients needed by phytoplankton to grow. Therefore, detecting the changes in environmental conditions could be used as an early warning system for HABs prevention or mitigation.
Furthermore, identifying the harmful impacts of HABs is also another mitigation approach. For example, various studies reported the detrimental effects of cyanobacteria (blue-green algae), predominantly found in eutrophic fresh and brackish waters, to aquatic organisms. In addition, some cyanobacteria species produce microcystins, a group of chemically stable cyclic peptide toxins responsible for poisoning cases of animals and humans [5]. Thus, knowledge on the characteristics of abundance and poisoning potential of algae to other living organisms will feed critical information to prevent HABs or mitigate their effects successfully.
Various species of algae have been identified to cause HABs. At least 32 [6], 37 [7], and 28 [8] causative species of algae were reported to cause HABs events in the Yellow Sea, South Chinese Sea (SCS), and Southern SCS, respectively. However, some causative species of HABs are still unidentified or at least debated their identification accuracies [8]. As such, studies on increasing the accuracy and rapid identification of HABs-caused species have been proposed. For example, [9] developed a framework for automatic identification for one species of algae, namely Neomeris vanbosseae M. A. Howe (NVH), which belongs to the Chlorophyta division. In another study, [10] proposed morphological processing and fluorescence imaging for phytoplankton identification. The study utilized a database of the shape of each algal. Similarly, [11] developed a novel multi-target deep learning framework for algal detection and classification. The study collected a large-scale colored microscopic algae dataset and applied extensive identification experiments based on the dataset. Lastly, [12] explained a critical development direction for the algae classification and determination based on image analysis of biological-morphological differences. However, this study had to deal with some challenges, such as the high degree of sample imbalance and the difficulty of formalizing the description of local physiological features.
These previous studies had relied on identifying algae species based on image analysis and required large datasets processed using machine learning (ML) techniques. An ML system focuses on statistical modeling, which entails creating a recognition model using the existing training data and then using that model to infer conclusions from new observations. For example, in a support vector machine (SVM) or neural network (NN), the modeling might be made up of differentiable parameters like weights and biases. In order to minimize an error/cost function, systems are usually optimized using gradient descent-based optimization methods. This optimization technique achieves minimal error surface. Nevertheless, the result might not be easy to explain, particularly in artificial neural network-based systems. The final result is a functional system for inference with no explanation, which is problematic in some circumstances [13]. For example, a study by [14] demonstrated that the SVM performed poor classification when inferencing outside its training set. Furthermore, SVM and NN require large datasets of hundreds or thousands of images, which bears a high collection cost.
A type of artificial intelligence (AI), the knowledge-based system (KBS), offers a better solution which SVM and NN lack thereof. KBS uses rules to do induction and draws a judgment based on facts or knowledge and rules. The drawback of KBS is in generating a knowledge base (KB), which may be time-consuming [15]. However, the model's advantage is that it provides more information on the reason. The KB serves as ground truth, which can be used for various purposes, such as building a faceted search [16] or even a knowledge-based chatbot [17]. Furthermore, the system can explain its logical "thinking" process and why it got to that particular conclusion. To the best of our knowledge, this study is the first of its kind in identifying algal species based on their morphology using KBS in the form of an ontology-assisted expert system coupled with certainty factors.
Currently, most laboratories use labor-intensive manual identification techniques by comparing the observed algal cell morphology using a light microscope with the closest resembling images in plankton printed books or electronic databases. The accuracy of this manual identification process varies depending on the analyst's level of expertise and experience. The method is also a tedious job and time-consuming due to one by one comparison of morphological characteristics.
An algal bloom outbreak does not suddenly occur. When algae bloom does occur, the growth rate of algae increases rapidly. However, observable signs of environmental precursors and algal bloom developmental stages can be easily recognized, such as gradually changing watercolor. By determining the causative algal bloom species as early as possible before reaching the population peak, the negative impacts of algal blooms, mainly harmful ones, can be prevented or mitigated. The development of a rapid identification system is primarily needed to shorten the algae identification process in the laboratory. Rapid species identification results can allow managers to make the right and on-time decisions, particularly in the mariculture industry. Therefore, we propose an expert system to assist researchers and laboratory technicians in rapidly identifying algae based on their morphological characteristics. The system uses ontology to guide the identification process in determining the type of algae coupled with a certainty factor indicating its confidence level.
The rest of the paper is organized as follows. First, in Section II, we describe the data and the research methodology. Then, we explain our findings, including the results and the discussions in Section III and Section IV, respectively. Finally, the conclusion and the future works are given in Section V.

II. MATERIALS AND METHODS
This section describes materials and methods used in this research, including study sites and data, knowledge base, inference engine, dialog interface, and the design of the experiment.

A. STUDY SITES AND DATA
This study took place in two locations, namely Lampung Bay and Jakarta Bay. Lampung Bay is located in Lampung province and adjacent to the Sunda Strait, Indonesia. A shallow water area characterizes the bay with an average depth of 20 meters, strongly influenced by anthropogenic activities from the surrounding dense settlements. Tidal currents dominate its water exchange where water mass from the Indian Ocean enters the bay through the east coast then directly turns to the west side at the tip of the bay. Current velocity decreases on the west coast of the bay due to coastal morphological conditions, where there are some small bays and small islands along the west coast [18]. As a result, it accumulates nutrient concentrations and generates frequent algae blooms during certain environmental conditions [19]. Jakarta Bay is located in the north of the Jakarta province, Indonesia. It is shallow water with an average depth of 12.5 m, affected by the water mass from the Java Sea, where the monsoon strongly influences the movement of the water mass. Similar to Lampung Bay, Jakarta Bay is also strongly influenced by anthropogenic activities.
Lampung Bay and Jakarta Bay received high input of nutrients from land activities that generate frequent algae blooms [20]. As a result, the trend of algal blooms in Jakarta Bay has increased since 2000. There are 16 species of algae identified as blooming causative [21], where most of the abundant species found are Skeletonema and Chaetoceros [22]. On the other side, since 2012, there has been an increase in the frequency of red tides in Lampung Bay caused by the blooming of Chaetoceros polykrikoides algae [23]. The data used in this study is algae data obtained from the two bays' water taken from 2017 to 2019. Fig. 1 shows a map of the research sites located in Lampung Bay and Jakarta Bay.

B. KNOWLEDGE BASE
We used ontology as the knowledge base of the expert system. Apart from being a taxonomic reference of the species and genera understudy, ontology provides axioms as boundaries to give the characteristics of the object. For example, Alexandrium is a genus of algae with a flagella length ranging between 16 and 55 µm. Therefore, when the identification process is carried out, if there are algae that do not have flagella or have a length outside this range, the system will refuse to classify the algae into the genus Alexandrium. Fig. 2 shows the ontology design process using the Protégé editor.
Another benefit of using an ontology in this expert system is to assist in the "pruning" process. Ontology is used to guide the dialogue so that there is no waste of questions given to users. It is done by narrowing down the candidate questions that users ask based on the previous questions. This pruning technique will be explained later in Section II.C.
This study proposed a certainty factor (CF) to determine how confident an expert is in the assertion that an algal under consideration has specific characteristics. CF is used when expert judgment is needed on a statement in the form of object/event features: how sure an object/event has specific features. Two variables must be obtained from the experts, namely the measure of belief (MB) and the measure of disbelief (MD). The function to calculate the certainty factor, CF, of the hypothesis H given evidence E, is defined in the equation (1), where: = Measure of belief to the hypothesis H ( , ) given evidence E (between 0 and 1); and = Measure of disbelief to the hypothesis H ( , ) given evidence E (between 0 and 1). Certainty factor, CF, not only uses values to assume an expert's level of confidence but can also use values from initial facts found by users [24]. For this case, the CF (H, E) is defined in equation (2), where: = the CF value of evidence E determined by the ( ) user; and = the CF rule value determined by the expert. ( ) Equation (3) combines two or more CFs into one CF that represents these CF values, where: = Combined certainty factor, also ( 1, 2) called as the confidence level (CL); = Certainty factor 1; and 1 = Certainty factor 2.. 2 We used a relational database to accommodate the need for certainty factor calculations in the developed expert system, with a structure shown in Fig. 3

C. INFERENCE ENGINE
The application begins by displaying a list of algae images. The user is welcomed to choose an image of the algae that closely resembles the algal to be identified. Next, the system asks the user a question about the characteristics of the algal to be identified. The user only needs to answer 'Yes' or 'No' whether the characteristic is present in the algal in question. The system processes the answer from the user and considers it as input for the question to be asked next. The given answers by the user become provisions for the system to select the candidate objects. The application continues in this loop until one of three conditions is met: (1) the object in question is identified; (2) the object in question cannot be identified; or (3) the user cancels the consultation session. The flowchart of the inference engine is presented in Fig. 5.
It can be seen from the flowchart that three critical processes guide the application in identifying objects. The first is finding the next feature asked in a consultation session with the user (fetchNextQuestion). The second is getting the selected candidate object(s) based on the questions asked by the system and the answers given by the user (getObjectCandidates). The third is calculating the CF value of the selected object given the characteristics that the user determines. The calculation of this CF value is carried out using equation (1), described in Section II.B.
The fetchNextQuestion function takes three parameters: the candidate objects for the algae to be identified, the features asked by the user, and the features selected as the characteristics of the object by the user. The main principle of this function is retrieving the next feature to be asked and getting the minimum degree of feature similarity of the candidate objects. It is done so that when the user answers 'Yes' to a feature asked for, the system can eliminate candidate objects that do not have that feature as quickly as possible. Then, to determine the next feature asked of the user, we have to search for the features that have the slightest similarity between the three objects. If is the set of ς( 1) feature similarity of (the set of objects that have the 1 1 feature) and is its degree (the number of objects that δ( 1) have this feature), we obtained that the F1 feature was chosen for the next question. The algorithm for the fetchNextQuestion function is as follows.

Algorithm 1 fetchNextQuestion Input
ObjectCandidates: candidate objects for the algae to be identified AskedFeatures: features that have been asked of the user SelectedFeatures: features that have been selected as the characteristics of the object. Initial value: all features in the database Output FeatureCandidate: feature candidate that will be asked of the user for the next question Description This function determines the next feature that will be asked of the user for the next question in a dialogue session. 1 begin 2 FeatureGroups enumerate feature groups of SelectedFeatures 3 FeatureCandidate [] 4 Candidates The getObjectCandidates function retrieves all objects that have the features selected by the user. It filters the current object candidates based on the new set of selected features and an unselected feature (if any) to generate new object candidates. Objects that do not have all the features listed in the selected features will be excluded from the object candidates. Moreover, when the user answers 'No' for a dominant feature possessed by an object, the object is removed from the list of object candidates. The algorithm of this function is as follows.

D. DIALOG INTERFACE
The expert system was implemented as a web-based application called Algies (Algae Identification Expert System). It can be accessed via a web browser at the URL address http://www.masagus.id/algies/. First entering the application, the user will be shown a welcome screen that contains how many species/genera are included in the database. Then, there is a button to start the consultation and to select the language to be used.
During the consultation session, users are presented with questions generated by the system according to the fetchNextQuestion algorithm described in Section II.C. When the user answers 'Yes' that the character is present in the observed algae, or 'No' if not, the system selects or excludes object candidates according to the getObjectCandidates algorithm. The consultation continues until the algal in the database or the object can no longer be identified. When the algal is identified, the system then calculates the confidence level of this identification process. At any time, the user can cancel the ongoing consultation session. In addition, we added a facility to evaluate system performance by displaying a feedback form at the end of the consultation session. Fig.  6 shows a screenshot when the user is in consultation with Algies: (a) the system asks about the characteristics of the algae, and (b) the system provides identification results based on the characteristics given.

E. EXPERIMENTAL DESIGN
Once the expert system is developed, the system is tested to see its performance, i.e., how accurately the expert system provides the correct answer from the characteristics described by the user. We took 60 samples of algae which were distributed evenly among three experts. Algies then tried to identify them one by one. Finally, algae experts verified the identification results. Algae identification accuracy is calculated as the ratio of correct answers (correctly identified objects) to the number of experiments, as defined in equation (4). (5) is used to calculate the overall confidence level. It is the average of all confidence levels for correctly identified objects, where CL is the confidence level calculated using equation (3) and n is the number of correctly identified objects.

III. RESULTS
This section describes the results obtained in the research that has been carried out.

B. KNOWLEDGE BASE
The collected algal data and their characteristics were then compiled into a table of species/genera ↔ feature relationships. There were a total of 51 features possessed by the 20 algae. The table was then given to five algae experts for assessment and validation. The experts have at least ten years of experience in algal identification and analysis. The assessment results are shown in Table 2.
The Char column contains the feature numbers as depicted in Table 1. Columns 1 to 20 represent species/genera codes as listed in Section III.A. A number in the cell represents a dominance level of the feature possessed by the species/genera. For example, in the first row, columns 1, 2, 15, and 18 contain the numbers 4.375, 4.375, 4.5, and 7, respectively. It means that the Stretched straight shape (feature no. 1(a)) is owned by species/genera Nitzschia, Pseudo-nitzschia, Ceratium furca, and Trichodesmium with dominance level of characteristic as mentioned above. This value is the average given by the experts according to their beliefs about the feature dominance level possessed by a species/genus. This value is given based on the criterion range of values as listed in Table 3.

C. EXPERIMENT
Tests were carried out to see the system's performance in identifying algae types based on their characteristics. First, we trained three algae experts to use and evaluate Algies. One evaluator was from the Ministry of Marine Affairs and Fisheries, and the other two were university academics. Two evaluators hold doctorate degrees and one has a master's degree in related algal science. We then provided each evaluator with 20 samples of algae to be identified using Algies. The results of Algies identification were then compared with the results of evaluator identification, i.e., correct or incorrect. The results of testing the Algies system by the three evaluators are presented in Table 4. The accuracy is calculated using equation (4). We also calculated the average value of confidence level and accuracy to see the overall system performance. Moreover, we captured the incorrect species/genus names suggested by the system in the last column for analysis purposes.

IV. DISCUSSION
An expert system simulates the judgment of humans by using the encoded knowledge. In general, the quality of a judgment relies on the techniques used to represent knowledge and perform reasoning. For example, correctly identifying HABs-caused species or genera of algae requires precise knowledge. Combining the automated expert system and the quality of encoded knowledge could theoretically overpower the contemporary methods of algal identification in providing rapid results for decision-making. Unfortunately, the lack of automatic identification tools has restricted researchers and laboratory technicians from delivering fast results of HABs causative species. In this work, a rapid identification system is proposed based on an expert system with certainty factors. First, the morphological characteristics of targeted algae were formalized as a knowledge base (KB). After that, an inference engine was developed to drive a dialogue interface between users and the KB. The interface facilitates a structured exploration of the KB through a series of questioning and answering. As a result, the species or genera of a specimen will be identified.
We incorporated 51 algae characteristics as features into our expert system. We tested the system on 20 types of algae, where 73.33 % of accuracy was obtained. From the correctly identified species/genera, the evaluators had high confidence in the results; the confidence levels are between 52.67% to 100%, with an average of 83.51%. The confidence levels are lower from the incorrectly identified species/genera, between 50% to 96.2% and 62.28% on average. This finding indicates that the developed expert system has worked as expected where the certainty factor has a linear impact on the results. When experts are confident with their judgments, it has a high chance of producing a correct result.
In Table 4, the genus Nitzschia is the most easily identified object by Algies with the highest confidence level (100%) in each test carried out by the evaluators. This result indicates that the description of the genus Nitzschia is clear and significantly different from the other objects. The other objects consistently identified by the system are Noctiluca scintillans, Trichodesmium, Pyrodinium bahamense, Prorocentrum micans and Dinophysis caudata. The system identification results for these species/genera were in line with the identification results of all evaluators with a confidence level above 79.75%.
Algies had consistently misidentified Gymnodinium sanguineum. The species was incorrectly identified as Gymnodinium catenatum. The result underlines a critical finding that the description of Gymnodinium sanguineum has to be improved to differentiate it from Gymnodinium catenatum and vice versa. The result also outlines the strength of KBS in feeding its "logical thinking" and "conclusion" for further improvement, as discussed at the beginning of this paper.
One thing that should be noted for further inspection is the identification result of Gyrodinium lachryma. There is an inconsistency in the species identification process. Two of the three evaluators showed correct results with a high confidence level (94.75%). However, the other evaluator showed that the result was incorrect with a high confidence level, i.e., 96.5%. This finding indicates that there might be inconsistencies in the weighting of the characteristics dominance level of this species.
Further, we compared the disagreement between evaluators by computing how many evaluators disagreed on a result from the developed expert system. We obtained that one evaluator disagreed on 33.33% cases, two and three evaluators on 55.56% cases, and 11.11% cases, respectively. As depicted by this result, more than half of disagreement cases (66.67%) were performed by two or more evaluators. It indicates that the judgment of the majority of experts has a high impact on the produced results.

V. CONCLUSION
A novel expert system for identifying algae is proposed in this paper. The proposed method used algae's morphology knowledge in the form of ontology as the system's guide. The knowledge of 51 algae characteristics was acquired from five algae experts on 20 HABs causative species/genera. Furthermore, certainty factors were used to indicate the level of confidence. The experimental results showed that the system's accuracy and confidence level of correctly identified algae yielded 73.33% and 83.51%, respectively.
There are rooms for improvement to increase the system's accuracy, such as attributing different weights for each characteristic or adjusting certainty factors. For example, more weights could be given to the more essential determinants (based on experts' knowledge) of an algae type. Furthermore, the certainty factors could be tuned automatically using machine learning techniques. Moreover, a hybrid approach, combining the system with computer vision, is worth trying in the future.