Unsupervised Outlier Detection Mechanism for Tea Traceability Data

The presence of outliers in tea traceability data can mislead customers and have a significant impact on the reputation and profits of tea companies. To solve this problem, an unsupervised outlier detection mechanism for tea traceability data is proposed. Firstly, tea traceability data is uploaded to the MySQL database, and then the data is preprocessed to aggregate features based on relevance, which makes it easier to identify abnormal features. Secondly, the LOKI algorithm based on Local Outlier Factor (LOF), Isolation Forest (IForest), and K-Nearest Neighbors (KNN) algorithms is used to achieve unsupervised outlier detection of tea traceability data. In addition, a Density-Based Spatial Clustering of Applications with Noise (DBSCAN-based) tuning method for unsupervised outlier detection algorithms is also provided. Finally, the types of anomalies among the identified outliers are identified to investigate the causes of the anomalies in order to develop remedial procedures to eliminate the anomalies, and the analysis results are fed back to the tea companies. Experiments on real datasets show that the DBSCAN-based tuning method can effectively help the unsupervised outlier detection algorithm optimize the parameters, and that the LOF-KNN-IForest (LOKI) algorithm can effectively identify the outliers in tea traceability data. This proves that the unsupervised outlier detection mechanism for tea traceability data can effectively guarantee the quality of tea traceability data.

of globalization, more regulatory authorities have focused 29 on the traceability of tea safety and reliability, and customer 30 expectations for tea quality are increasing. The majority of 31 existing tea quality monitoring tools offer customers trace-32 ability information, but there are few tools that can be used 33 by businesses to examine and manage this information. Tea 34 traceability data analysis can assist tea businesses in identify-35 ing issues in the production management process and can be 36 used to control tea quality at the source. 37 Traceability data show how things have evolved and may 38 be used to investigate the root and source of things. The 39 gathering of traceability data may be classified into three 40 categories based on the input method used: manual, semi-41 automatic, and sensor input. With the rapid growth of the 42 linear-model-based methods are the most common unsuper-99 vised outlier detection methods. 100 The credibility of tea enterprises would suffer greatly if 101 they gathered incorrect tea traceability information through-102 out the manufacturing process, presented it to customers, 103 and consumers were misled by the incorrect tea traceability 104 information. This will then harm the profits of tea enterprises. 105 However, enhancing the quality of the traceability data can 106 contribute to the product's value growth. High-quality tea 107 traceability data may also be utilized to help tea enterprises 108 resolve production and administrative problems. 109 In order to solve the problems caused by the poor quality 110 of tea traceability data and to obtain the benefits from high-111 quality tea traceability data. The main contributions of this 112 paper are as follows.

113
(1) An unsupervised outlier detection mechanism is pro-114 posed, with the goal of identifying outliers in the data, ana-115 lyzing the results, and then returning the analysis results to 116 the tea enterprises.

117
(2) The LOKI algorithm is proposed with the aim of 118 combining different types of outlier detection algorithms to 119 improve the accuracy of outlier detection. 120 (3) A DBSCAN-based [19] tuning method for unsuper-121 vised anomaly detection algorithms is proposed to help 122 the unsupervised outlier detection algorithm determine the 123 parameters. 124 The remainder of this work is arranged in the following 125 manner. The study on the use of outlier detection in many 126 domains is reviewed in Section 2. The unsupervised outlier 127 detection mechanism for tea traceability data is described in 128 Section 3. The experimental data and analyses are presented 129 in Section 4. Section 5 concludes the articles, examines the 130 limits, and proposes future research areas.

132
The use of unsupervised outlier detection is also very popular 133 in tea traceability data as well as in other areas. There has 134 been a significant amount of research conducted on how to 135 identify abnormalities in complicated systems using unla-136 beled data. Liu et al. [20]. suggested the use of an incre-137 mental unsupervised anomaly detection method to rapidly 138 analyze large-scale, real-time data from industrial control 139 systems. This technique generates a random binary tree set 140 from the data stream's sampled data, combines fresh data 141 information into the current model on a continuous basis, 142 and provides a weighting mechanism to ensure that the set's 143 findings are reasonably stable, even if some trees are elimi-144 nated. Mikhailova [21]. employed deep learning approaches 145 to address civil infrastructure engineering challenges and cre-146 ated an unsupervised system that can automatically identify 147 the 'train event' point. Yanjun et al. [22]. established an 148 anomaly detection framework and gathered more detailed 149 data on the time series' shape and morphological charac-150 teristics through data representation for anomaly detection 151 in order to better detect outliers in time series data. Time 152 series data outlier identification is also commonly employed 153 VOLUME 10, 2022  proposed in this research for tea traceability data may be 177 able to reliably discover several abnormal characteristics.

178
To begin, the data are merged based on feature correlation 179 to establish the types of abnormal feature combinations, 180 and the reasons for the existence of abnormal features in 181 each group are analyzed, followed by the implementation 182 of appropriate improvement methods. Simultaneously, the 183 LOKI algorithm, which combines the LOF [27], IForest [28], 184 and KNN [29] algorithms, is proposed to increase the out-185 lier detection accuracy by merging multiple types of outlier 186 detection algorithms. In addition, the parameter adjustment 187 method of an unsupervised outlier detection algorithm is 188 suggested to aid in the optimization of parameters in an 189 unlabeled data environment. The results of the experiments 190 suggest that the proposed mechanism is capable of detecting 191 outliers in tea traceability data.

193
As illustrated in Figure 1, the tea traceability data outlier 194 detection mechanism consists of four parts: data collection, 195 data access, outlier detection, and anomaly analysis. Manual 196 input, sensor input, and semi-automatic input are all exam-197 ples of data collection methods. The data are uploaded to 198 a MySQL database, which is accessible using JDBC, and 199 the various characteristics are then integrated via correlation 200 analysis [30]. The outlier detection part first detects outliers 201 using the LOF, IForest, and KNN algorithms, assigns weights 202 to the data in the detection results of the three algorithms, 203 and finally, filters the optimal common subset of the three 204 result sets using the weights to achieve more effective outlier 205 detection. The anomaly analysis identifies abnormal types 206 the tea traceability data is compiled and the reasons for each 227 anomaly are identified so that appropriate steps may be taken 228 to eradicate them at their sources.  Table 1 shows the feature fields for each data set. There  fertilizing dates, pruning dates, and picking dates may contain 246 anomalies due to employee errors, such as repeated data entry, 247 data omissions, and data input errors. Normalization [33] involves compressing data between 0 and 251 1 to eliminate the order of magnitude difference between 252 samples, ensure each data point is of the same order of 253 magnitude, and to make the data points comparable. The 254 normalized data follow a normal distribution, and the formula 255 is as follows: where x max represents the maximum value in the data, and 258 x min represents the minimum value in the data. Correlation analysis is a method for analyzing the inherent 261 links between data features. It may be used to visually illus-262 trate the direction and degree of an intrinsic association. The 263 linear relationship [34] between two features can be examined 264 using the Pearson correlation coefficient. The value ranges 265 from −1 to 1, and the closer it gets to −1, the higher the 266 negative linear correlation between the two characteristics 267 is. The linear correlation between two features becomes 268 higher the closer it is to 1; the linear correlation between 269 the two characteristics becomes smaller the closer it is to 0. 270 The formula used to determine the Pearson correlation 271 coefficient is The reachable distance from sample point n to m is, at least, Definition 4. Let lrd k (m) be the local reachable density of 308 sample point m: The local reachable density of sample point m represents the According to the local outlier factor algorithm, if the ratio of 326 the local reachable density of the k nearest neighbor sample 327 of sample point m to the local reachable density of m is close 328 to 1, point m is more similar to its neighborhood point. If the 329 ratio of the local reachable density of the k nearest neighbor 330 sample of sample point m to the local reachable density of 331 m is less than 1, the density of m is greater than that of its 332 neighborhood point; and if the ratio of the local reachable 333 density of the k nearest neighbor sample of sample point m to 334 the local reachable density of sample point m is greater than 335 1, the density of m is less than that of its neighborhood point 336 and it can be regarded as an isolated point, so the possibility 337 that sample point m is an outlier is greater.
The IForest algorithm is an unsupervised fast outlier detection 340 method based on the ensemble method, which is mainly 341 suitable for the outlier detection of large data sets with con-342 tinuous eigenvalues. The basic principle of the algorithm is to 343 locate outliers by randomly cutting data sets. The algorithm 344 is described as follows:

345
Assuming that there is a data set D, the size of the data set 346 is n, the number of the base classifier iTrees is m, and the limit 347 height is h. 348 The iTree is built and the root node of x data is randomly 349 selected for inclusion in the iTree from the training dataset 350 as the sample dataset for this iTree. Then, a feature p of the 351 sample data is randomly selected to calculate the maximum 352 and minimum values of all data in the sample data set in this 353 feature dimension, and a data partition threshold q is ran-354 domly selected within this range. The data whose eigenvalues 355 are less than or equal to q are put into the left subtree, and the 356 data whose eigenvalues are greater than q are put into the right 357 subtree. Then, the previous step is repeated in the left and 358 right child nodes to continuously randomly divide the data 359 until one data point in the child node reaches the limit height, 360 so cutting is stopped and an iTree is constructed. Finally, after 361 repeating the above method to construct m iTrees, they are 362 merged into an IForest. Because of the big difference between 363 normal values and outliers, outliers are more likely to be 364 isolated faster and are more likely to appear at the root of 365 an iTree.

366
When the IForest construction is completed, abnormal data 367 points in the test data can be identified. First, the path height 368 of the test data on each iTree is calculated as follows: The 369 initial height of the test data is set as 0, the test data are sent to 370 the iTree, and then look down based on the branch conditions 371 of each node. As each node passes by, 1 path height unit is 372 added, and the path height data are returned after finding the 373 test data. Secondly, the average path height of the measured 374 data in the whole IForest is calculated. Then, the anomaly 375 score is calculated using the average path height. Finally, the 376  in the LOKI algorithm is shown in Table 2. The algorithm 431 inputs data set X and outputs the outliers R. First, the LOF 432 algorithm, IForest algorithm, and KNN algorithm are used to 433 detect the data, and the labels L_label, I _label, and K _label 434 are obtained. The data points labelled 0 represent normal data, 435 and the data points labelled 1 represent suspicious data.   There are three main types of outliers in tea traceability data: 476 outliers of sensor input data, outliers of semi-automatic input 477 data, and outliers of manual input data.

478
Equipment damage and aging are the most common causes 479 of outliers in sensor input data. To eliminate these anomalies, 480 the following steps should be taken: (1) equipment mainte-481 nance and repair should be improved, and the equipment's 482 key performance should be evaluated on a regular basis; and 483 (2) Managers should be familiar with the typical state of the 484 equipment and should debug it often in order to keep it in the 485 best condition. 486 An incorrect operation method is the most common cause 487 of outliers in semi-automatic input data. The following pro-488 cedures should be taken to eliminate this type of anomaly: 489 (1) The enterprise should develop a reasonable operating 490 technique process based on the product's manufacturing pro-491 cesses; and (2) strict labor discipline should be implemented 492 with frequent checks and supervision to ensure that staff are 493 carrying out the manufacturing process in strict conformity 494 with the company's operating procedures.

495
The major causes of outliers in manual input data include 496 having employees who are sloppy in their production oper-497 ations, do not precisely follow the enterprise's production 498 process, and simply repeat the same activity, resulting in 499 employee paralysis. To prevent this, (1) the staff's product 500 quality awareness education should be strengthened and their 501 feeling of responsibility should be increased; (2) job technical 502 training by should strengthened by requiring each employee 503 to learn and closely adhere to the enterprise's production 504 workflow; (3) production and inspection employees should 505 improve their manufacturing process control and conduct 506 thorough process inspections; and (4) enterprises should 507 establish an environment that allows employees to work in 508 peace and comfort.

511
Before detecting outlies in tea data, the features need to be 512 combined [37] in order to determine the type of anomaly 513 present. The correlation heat map obtained from the corre-514 lation analysis is shown in Figure 5. The degree of linear 515 correlation between features can be visualized. The weed-516 ing dates, digging terraces dates, planting dates, fertilizing 517  where ACC stands for the accuracy rate, which is defined as 538 the proportion of data successfully predicted by the algorithm 539      The experimental results of each algorithm were averaged for 611 each combination of characteristics under the same outlier 612 ratio, and the LOKI algorithm was compared with the seven 613 typical algorithms described above. The experimental results 614 show that the LOKI algorithm is extremely reliable and better 615 than the others in every respect.

616
The detection ACC of the eight techniques with varying 617 outlier ratios is compared in Figure 9. The identification 618 results of the PCA and OCSVM algorithms are much worse. 619 The ACC of the IForest, ABOD, and CBLOF algorithms 620 is slightly lower than that of the LOKI algorithm when the 621 proportion of outliers is less, but as the proportion of outliers 622 increases, the detection effect of the LOKI algorithm remains 623 excellent, while the detection effects of the IForest, ABOD, 624 and CBLOF algorithms deteriorate. The ACC of the LOKI 625 algorithm is higher than that of the LOF and KNN algorithms, 626 which has a clear relative advantage. The KNN algorithm has 627 VOLUME 10, 2022  3.4% lower than that of the LOKI algorithm.   algorithm being 2% lower on average. The LOKI algorithm 641 remains stable when the fraction of outliers changes, but the 642 OCSVM, PCA, and ABOD algorithms vary more. The TNR 643 is the most crucial evaluation indication for businesses, since 644 they do not want to pass on any outliers to their customers.

645
The TPR of the eight algorithms is compared in 646 Figure 11 for different outlier proportions. With a percentage 647 of outlier points of 5% to 10%, the ABOD algorithm has the 648 largest difference in TPR with a 31.6% decrease. The KNN 649 algorithm is closest to the LOKI algorithm and is relatively 650 3.9% lower. 651 recorder are performed in the training phase to synchronize 685 the sampling frequency and reduce the random noise in 686 the sensor signal. The preprocessed flight features are then 687 reduced by feature subset selection to select features that 688 are highly correlated with the dynamic flight characteristics. 689 The selected features are then used to train model classes 690 to predict common patterns in flight performance during 691 the takeoff and ascent phases. The monitoring phase simu-692 lates the flight data recorder dataset and introduces its real 693 time data into the trained model to validate the detection 694 capability of the proposed framework in real-time situations. 695 Anomalous flight performances are detected when the pre-696 dicted feature values violate the safety boundaries. However, 697 the framework is incapable of achieving high-performance 698 anomaly detection and feedback. Enrico et al. [45]. proposed 699 an online remote fault detection system for underwater glid-700 ers to identify undesirable behaviors on the horizon. The 701 system is tested using a deployment dataset of undesirable 702 vehicle behaviors. Once the effectiveness of the system is 703 determined, a trained anomaly detection scheme can be used 704 online from a remote-controlled center to notify the pilot of a 705 possible failure of the underwater glider after each surfacing 706 and maintenance connection. The system does not allow for 707 more granular detection of anomalies and does not provide 708 an analysis on anomalies. Wada et al. [46]. proposed an 709 adaptive-model-based anomaly detection system for daily life 710 activities that adapts to new data corresponding to changes in 711 human behavioral habits over time. A forgetting factor data-712 driven filtering approach was proposed to help the system 713 adapt to the current behavioral habits of individuals while 714 discarding features that are not relevant to old habits. The 715 forgetting factor allows the system to identify outdated activ-716 ity data that should be discarded while incorporating data 717 representing changes in human behavior routines for adapta-718 tion. A total of two forgetting factor approaches are proposed 719 in the paper: the data aging-based forgetting factor and the 720 data difference-based forgetting factor. A set of anomaly 721 detection models is then used for behavior modeling. The 722 system cannot locate anomalous data at a fine-grained level 723 and also does not provide an analysis or feedback on the 724 anomalies. A comparison of the functions of each framework 725 is shown in Table 4. 726 The above analysis compares the functionality of existing 727 anomaly detection frameworks, each of which is lacking 728 in terms of completeness. The mechanism proposed in this 729 paper is functionally complete and is capable of locating out-730 liers with fine granularity, achieving high performance outlier 731 detection, analyzing the anomalies, and providing feedback 732 on the detection and analysis results.

734
This work provides an unsupervised outlier detection mech-735 anism for tea traceability to improve the quality of tea trace-736 ability data in order to address the challenges caused by poor 737 data quality. The LOKI algorithm is proposed to improve 738 the accuracy of outlier detection. It is suggested that the 739 VOLUME 10, 2022 features of tea traceability data can be combined according  The results of this study have the potential to encourage 754 knowledge sharing in the tea supply chain. The described 755 technology can assure the accuracy of tea traceability data 756 and allow tea enterprises to fully comprehend production 757 and operation issues and make timely, targeted adjustments.

758
The following are some of the future research goals: (1)

759
The proposed unsupervised outlier detection mechanism for 760 tea traceability data needs to be applied to specific tea pro-