Investigating and Suggesting the Evaluation Dataset for Image Classification Model

Image processing systems are widespread with the digital transformation of artificial intelligence. Many researchers developed and tested several image classification models using machine learning and statistical techniques. Nevertheless, the current research seldom focuses on the quality assurance of these models. The existing methods lack to verify the quality assurance, with the lack of test cases to prepare the evaluation dataset to test the model, which can cause critical drawbacks in the nuclear field and defense system. In this article, we discuss and suggest the preparation of the evaluation dataset using improved test cases through Cause-Effect Graphing. The proposed method can generate the evaluation dataset with automated test cases through the quantification method, which consists of 1) image characteristic selection 2) creating the Cause-Effect graphing approach of the image with the feature, and 3) generate all possible test coverage. The testing is performed with the COCO dataset, which shows the declining prediction accuracy with the adjusted brightness and sharpness ranging between −75 to 75%, which indicates the negligence of the important characteristics in the existing test dataset. The experiment shows the prediction fails while sharpness is less than the 0%, and the brightness fails at −75% with less number of detection object between −50% and 75%. This indicates that characteristic changes affects the prediction accuracy and the number of detected objects in an image. Our approach proves the importance of the characteristic selection process for the overall image to generate a more efficient model and increase the accuracy of object detection.


I. INTRODUCTION
Advances in machine learning (ML) techniques have spread to the wide range of applications that perform advanced perception and decision-making in various important securityrelated fields [1]. Safety-critical machine learning systems needs best quality assurance methods for identifying the risks on real-time. Currently, the methods used to identify the model accuracy includes precision-recall, F-measure, and ROC curve [2]- [4]. Quality assurance (QA) has a high effect on social acceptance because it has always been a way to deliver safety and security [5].
Research on the dataset was mostly focused on the dataset balance, feature selection and skewness [6]. Among them, feature selection is very important process which identifies and removes the irrelevant features. Ghotra et al. [7] The associate editor coordinating the review of this manuscript and approving it for publication was Hossein Rahmani . developed metrics to confirm the impact of the feature selection in the dataset, by comparing the accuracy with 30 feature selection techniques. Attaching the importance to the feature selection, many researchers are focused on new feature selection technique based on machine learning such as Random forest [8], Decision tree [9], MapReduce [10] and so on. Similarly, the scope of research mostly focuses on the balance between the training and testing dataset, explaining its importance in the test accuracy and image classification [11]. Ponce et al. [12] makes a detailed explanation of the dataset issues in the object recognition, especially with the image dataset such as Caltech and PASCAL. It mainly focuses on the object detection with different background. The issues in the Corel dataset and its annotations were analyzed using Support vector machine (SVM) [13]. Zhang et al. [14] also trained and tested the PASCAL dataset with different combination of the background. Catal and Diri [15] investigate the effect of dataset size and metrics for the fault prediction problem, VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ while Bennin et al. [16] investigate the effect of the balanced training with the dataset on the prediction system. An improper data preparation can decrease accuracy and increase the number of errors [17]. Recently, researchers focus on the characteristics of the dataset. Oreski et al. [18] emphasis the importance of the dataset characteristics for the feature selection but ignore the image characteristic. Dodge and Karam [19] examines the importance of the image quality that affects the accuracy in deep neural network. Five types of image quality distortions such as blur, noise, contrast, JPEG and JPEG2000 compression are used on different deep learning techniques. Machine learning algorithms are also used to analyze and improvise the image quality [20] and image denoising [21]. All these studies reveal there is no single method to prepare the balanced dataset with required image characteristics. Hence, in this article, we focus on the impact of image characteristic to build a balanced dataset.
Most of the researchers claim the highest performance for their model, but some reveals unsuccessful result for the same images in a different environment [22]. Even though some of the researchers claimed that their models provided the highest performance, these models are proved to be unsuccessful when evaluated on public datasets [23]. Therefore, it is extremely crucial to benchmark available fault prediction models under different conditions on public datasets. We acknowledge the availability of effort-aware predictive models but we argue that the models selected to cover imbalance dataset with different domains of techniques ranging from statistical, data mining, and machine learning techniques. Hence these models become the representative of all the several models that exist in literature.
The current regulatory framework for different kinds of software relies on a software and system engineering paradigm that was clearly not designed with machine learning in mind. Widely used standards for software development life-cycle processes, like IEC62304 for medical device software [24], ISO26262 [25] for automotive, or ISO25000 [26] for general-purpose software, are based on defining requirements, defining the architecture, decomposing the system into smaller units, integrating, verifying and validating the results. From the 5V's [27] of Bigdata such as Variety, Velocity, Volume, Veracity, and Volume, Variety is the most important feature bound for quality assurance. Therefore, in this article, we focus on a variety of datasets.
In fact, recent accidents caused during the use of several experimental on security have revealed QA frameworks as imperative for addressing this upcoming social issue [28]. Although the image classification in security-related fields are actively developed and proposed, QA concept and technologies to ensure safety and security have not been systematized yet. Therefore, in this study, we organize the review for the open QA problems on safety-critical machine learning systems using public image dataset COCO with machinelearning models as an example. Kim [29] has validated the COCO dataset using the YOLO model, and explained the labelling issues. However, the COCO dataset has been used in many image classification analysis [30]- [36]. Pont-Tuset and Van Gool [37] also boost the preference of COCO from PASCAL. We also chose the COCO dataset to our analysis. The contributions of this study are: • Clarification of the problems related to the quality assurance of the trained model with the given dataset in image classification.
• Problems on the safety-related system, especially with machine learning models.
• Shown the effectiveness of Cause-effect graphing testing to prepare the evaluation dataset; • Shown the open issues in the current testing and next research direction.

II. RESEARCH QUESTIONS
Most research presumes that the prediction models will be improved, if the feature selection of appropriate classes for the dataset are selected and trained, high performance can be provided. We embark on experimenting with such a model to prove the importance of the evaluation dataset with a different characteristic of the image in the overall dataset. The objective of the paper is to explain that the preparation of the evaluation dataset is equally important with the preparation of the training dataset. The objective was inspired by the following questions, which will be discussed later in the result.
1. Can the trained model detect all the objects in the image, if there is a change in the image characteristics? In case, the image was taken in a night or darker area? 2. Could the possible characteristics of the image be identified to generate an accurate model? The search for characteristics is not limited and can even be considered as a feature selection of the dataset. Initially, it may be reasonable to think that the dataset contains all the possible characteristics that would be identified to produce the best model, but that might not be always the case. 3. What if the model failed in the crucial industry connected with national security and health care centers? The image classification can be sensitive in such a field, which needs more distinct and high importance to the models. But, it can be seen that most of the public dataset does not produce high accuracy when tested with different images from different places. First, it is important to consider all possible characteristics of overall images together with the feature selection of the dataset.
4. How to obtain more accurate results? It may seem justified that the training dataset should contain all the characteristic of the images, to produce an efficient model. We explore this domain, to investigate the evaluation dataset on the prediction model for high accuracy result.

III. EXPERIMENTAL SETTING
First, we describe the datasets used in the study. We conducted the experiment with the open access COCO dataset. The evaluation dataset was prepared with Cause-Effect Graphing. YOLO framework was used to train and test the model.

A. DATA COLLECTION
The Common Objects in Context dataset (COCO) [38] basically has 91 object classes that have more than 5,000 labeled instances. From the total image of 328,000 images, there are 2,500,000 labeled instances. COCO dataset consists of fewer categories compared to the ImageNet dataset [39] but rich in instances, which improves learning object models for precision and accuracy. The dataset is also larger in the number of instances compared with the PASCAL VOC [40] and SUN [41] datasets. In addition, the COCO dataset can be used for learning contextual information, which has considerably more object instances per image than ImageNet and PASCAL. Although the SUN dataset contains significant contextual information, very few instances. Therefore, we chose the COCO dataset in our study.

B. CAUSE-EFFECT GRAPHING
Cause-Effect Graphing [42] graphically represent the situations of combinations of various input conditions and the issues that manipulate the result. The graph is later transformed into a decision table to obtain the number of test cases. The cause-effect graphing technique is used because boundary value analysis and equivalence class partitioning methods do not consider the combinations of input conditions. But since there may be some critical behavior to be tested when some combinations of input conditions are considered, that is why the cause-effect graphing technique is used.
Steps used in deriving test cases using this technique are: 1. Divides the specifications into small pieces and change them into the cause-effect graphs. 2. Identify the possible input and effective output conditions in the specifications. 3. With Boolean expressions, the specifications are transformed into a cause-effect graph, where the constraint is added at the required places. 4. Convert the graph into a decision table. 5. Conclude the decision table into potential test cases Although the method can detect the ambiguity and inadequacy with a unique perspective, it cannot guarantee that all the test cases are useful.

C. CHARACTERISTIC ANALYSIS
Identifying the characteristic of the image is similar to the feature selection of the images. Feature selection is the process of selecting the most important feature that will have an impact on the model performance such as overfitting, accuracy, and mainly manage the training time. Similarly, the characteristic of every image is important which needs to identified and included in the training and evaluation dataset, which will greatly affect the accuracy of the prediction model.
When the COCO dataset is used in our experiment, first we analyze and evaluate the characteristic of the image. Without a doubt the characteristic of the objects is also more important for accuracy, which will be analyzed and discussed in our future work with other comparative datasets. From 91 objects in the COCO, we chose room images for our analysis, which contains a tv monitor, dining table, chair, vase, clock, refrigerator, potted plant, and people. With meticulous observation and analysis, we concluded the characteristic of the images into three causes such as brightness, sharpness, and grayscale. The brightness and sharpness of the images are altered on different range and verified with the predefined test cases scenario through cause-effect graphing.

IV. EXPERIMENT DESCRIPTION
The experiment revolves around the analysis of the prediction accuracy of the model with different scenarios involving the characteristic difference of the model. The YOLO -Darknet framework [43] was used to test and compare the model with the normal image and the new images with adjusted brightness and sharpness of the image. In this case, more types of datasets are required to improve the accuracy of object detection.
The table 1 shows the confidence score of the object detected in the brightness adjusted image, which was obtained from the COCO dataset. Among the many characteristics, we have chosen two causes brightness and sharpness to test and compare the test images. The same result is shown in table 2 with the detected image, along with the bounding boxes. From the result, we can clearly see the reduction of detected objects, with respect to the changes in the brightness of the images. Also, the confidence score is declining on each adjusted image. According to the result, it can be concluded that the object detection is greatly affected below and above 50%. Although there are few object detections, the prediction percentage is affected.
Similarly, the sharpness of the image is adjusted between −75 and 75% and the results are tabulated in table 3 and 4. Contradicting to the brightness, the sharpness adjusted images are more stable with increment, but decrement of changes in sharpness percentage exhibits failure in detection. This implies that the training dataset lacks images with different characteristic, which may affect the training model and test results.
Testing a machine learning model has several potential limitations. Therefore, we discuss the possible threats to identify the object in the images. This study identifies the faults in dataset preparation through the comparison of brightness/sharpness corrected images with the original images. In this case, more types of datasets are required to improve the accuracy of object detection Considering the original images, it is difficult to generalize our accuracy of the trained model. We need to prepare an evaluation dataset that can test all possible characteristic of the images. For this reason, we use the CETA (Cause-Effect Test Analysis) tool [44], drafting all possible test cases to verify the accuracy of the object detection. Using the CETA tool, which was developed from our earlier research, we have derived all possible test cases, including the characteristics of the images and objects. This quality assurance tool creates test cases with the help of cause-effect graphing. Except the time to identify and input the characteristic in the tool, the speed of the test case generation is very fast with negligible time complexity, owing to the graphing method. Figure 1 shows the test case generation in the tool with 13 causes and 2 effects. The two effects represent the result of whether the objects in the image are detected or not detected.
The layout of the tools has 6 windows that describes 1) Characteristics of Cause-effect for the images: It gives the detailed description of the specification and conditions of cause-effect. 2) Contraction of the Cause-effect contains the cause and effect list.

3) Test Scenario 4) Test Values 5) Cause-Effect Graph 6) Test Cases.
By adjusting the brightness and sharpness of the image between −75 to 75%, the test cases are derived for the COCO dataset using the cause-effect method and tabulated in the table 5. The characteristics are categorized as the causes ranging from C1 to C13, in which the effects are divided into E1 and E2. The total test cases obtained range to 6,345 with 140 test scenario generated from decision table that consist of 13 causes and 2 effects. When the number of objects detects exceeds 50%, with the combination of all image characteristics, the expected result of test case succeeds. With these test cases, the evaluation dataset can be prepared to test the model for the prediction accuracy.

V. ANALYSIS OF EXPERIMENT
Comparison performance of the evaluation dataset with and without the test cases has given detailed knowledge and importance towards the preparation of the dataset. Research questions that motivated the study are explained in detail with the experimental result.

Question 1: Does the training dataset include all possible characteristics?
We compared the result of the original image and the altered images in the same trained YOLO model as shown in Tables 2 and 4. The result shows that the confidence score changes with minimal changes of brightness and sharpness in the images, while the deduction fails with the maximum changes. From the table, it is evident that the training dataset does not include all possible characteristics in the dataset, resulting in the low accuracy of detection results.

Question 2. Could the possible characteristics of the image be identified to generate an accurate model?
The pre-trained model in the YOLO consists of 91 categories of images from different sources. For object detection, the model is undoubtedly a well-trained model that can detect most of the objects with the better accuracy. However, examining the results of the comparison table 1, it could be observed that the performance of the detection is inconsistent.
In other words, we need to include more training data with the additional characteristics of the images.

Question 3: What if the model failed in the crucial industry connected with national security and health care centers?
Pattern recognition has taken a significant role in the many high platforms such as national security, nuclear energy, and the medical field. With advanced methods and algorithms, the trust of the machine learning algorithm has accepted. Hence, if a simple characteristic is missed during the training, it may cause a major issue in the high-risk field. To emphasis the importance of the issue, the paper compares the result of the altered image with the actual image. With the Cause-effect graphing tool, it becomes even more efficient to assure the quality of the trained model.

Question 4: How to obtain more accurate results?
To find out whether the model can produce more accurate results, we possibly need a quality assurance method. Quality assurance for the machine learning technique has always reminded a challenge for the following reason. If the training dataset does not include all the features of the images, then it will not produce an accurate result.
To minimize the probability of failures, ML models has to be tested with balanced dataset, especially for sensitive domains. Generally speaking, the outcome of an ML model is a prediction, which is not easy to compare or verify against some kind of expected value. Nevertheless, developers test the machine learning model performance by comparing predicted values with the model output values, which is different from testing the ML model for any input, due to its limitation. The so-called black box testing of ML models can employ a variety of techniques, such as metamorphic testing, model performance, dual coding, comparison with linear models, and coverage guided fuzzing and testing with varying data slices. There is also the problem of causality. A machine learning algorithm doesn't know if a regularity found on input data is a cause for a prediction or just  a correlation. Thus, making the quality assurance difficult for the ML-based models.
In accordance with this issue, we use the cause-effect graphing tool to create the test cases to prepare the evaluation dataset and test all possible outcomes of the images with the public COCO dataset. The comparison result shows that the need for characteristic identification and the cause-effect graphing tools help to create the evaluation dataset to test the results. The existing research focus on the quality and accuracy for the machine learning models in literature but our study explains that the missing characteristic of the images can disturb the accuracy of the object detection and cause an imbalanced learning. We intend to consider more unique characteristics in various datasets in future studies.

VI. CONCLUSION
We have analyzed and examined the importance of preparing the evaluation dataset through the cause-effect testing method. COCO open dataset was used for the experiment to test two types of scenarios indulging the characteristics of the image. To our interest were how the percentage of the test dataset differed with the prediction accuracy in the images. The two characteristic such as brightness and sharpness were identified to respond to the research question, while CETA tool was used to create the test cases with the various characteristics. The result shows the decline in the prediction accuracy with respect to the changes in the adjusted images, indicating the importance of the image characteristics. To the best of our knowledge, this is the study that focuses on the preparation of evaluation dataset with test cases that includes the characteristics of the images. Assessing the performance of the testing with and without the adjusted images, our result derived to a conclusion that the dataset can achieve a better result with a well-prepared train dataset and evaluation dataset, using the cause-effect based test cases.
SUN IL CHON received the master's degree in electronics engineering from Jeonbuk National University. He is currently a Key Researcher with ThinkforBL Consultancy Services. He conducted consulting in SW engineering. He is researching the development of a smart barn service incorporating artificial intelligence SW technology. VOLUME 8, 2020 DO YEON CHOI majored in software engineering at Jeonbuk National University and conducted consulting in software engineering and testing. She is currently working in a smart barn research incorporating AI technology and conducting quality assurance and software engineering activities, to lead a core researching in ThinkforBL Consultancy Services.
JI HWAN PARK majored in electronic at Sungkyunkwan University and conducted consulting in the SW engineering field of about 300 companies. He was an Adjunct Professor with the Department of Knowledge and Information Engineering, Ajou University, and the Department of Software Engineering, Jeonbuk National University. He is working on the application of SW technology to the field of smart livestock. He is currently the Vice President of the Korea Software Engineering Network (K.SEN) and the Korean Representative of the Aisa Software Quality Network (ASQN).