A Hybrid Method for Objective Quality Assessment of Binary Images

In the paper, a novel hybrid method for an automatic quality assessment of binary images is proposed that may be useful, e.g., for computationally limited embedded systems or Optical Character Recognition applications. Since the quality of binary images used as the input for further image analysis strongly influences the obtained results, a reliable evaluation of their quality is a crucial element for the validation of such systems. Assuming the availability of several video frames, an objective quality assessment of individual video frames may also be helpful for the choice of a binarized image leading to the “best” final results. Nevertheless, most of the image quality assessment methods concern the analysis of grayscale or color images, utilizing the available image datasets containing numerous images subject to various distortions together with the corresponding subjective quality scores. Therefore, a reliable quality assessment of binary images is troublesome due to a small number of datasets and methods dedicated to binary images. The approach presented in the paper is based on the combination of one of such metrics, known as Border Distance, with some other methods, utilizing the nonlinear model based on the weighted sum and product of individual metrics. The experimental verification of the proposed hybrid metric conducted for the publicly available Bilevel Image Similarity Ground Truth Archive dataset leads to a significantly higher correlation with subjective quality scores in comparison to the other methods.


I. INTRODUCTION
One of the most crucial problems in computer vision and image analysis applications is a reliable quality assessment of images necessary for automatic verification of their usefulness for further processing. The ability of an objective image quality assessment (IQA) is also significant for real-time video analysis applications where the automatic choice of the ''best'' video frame may improve the final results, particularly in low processing power systems.
Depending on the specific application, various kinds of images may be used, e.g., natural images, such as photos, synthetic images generated for realistic computer games, panoramic images [1], multi-spectral remote sensing The associate editor coordinating the review of this manuscript and approving it for publication was Byung-Gyu Kim.
ones [2], [3], medical images, underwater images [4], [5], screen content images [6], etc. Many such types of images may contain specific types of distortions, although there are also some more typical issues related to blurring, the presence of noise, or lossy compression artifacts.
The motivation of research towards a reliable quality assessment of binary images is related to their applicability in many areas, e.g., related to the analysis of results of document image binarization. Despite the growing popularity of deep learning methods, mainly used for high-level feature extraction and classification purposes, where color or grayscale image patches are often used directly, binary images are still valuable for Optical Character Recognition (OCR) and other shape analysis tasks. OCR engines may be a good example of applications where the input images are binarized -not always using the best possible method but their quality is crucial for the results of further analysis as well as in the development and verification of the post-processing stage in the multi-step image binarization algorithms.
The choice of a better binarization method before the text recognition, particularly for non-uniformly illuminated images, should however result from a reliable assessment of the quality of its results. Therefore, some novel binarization algorithms are still of interest [7], particularly tested for challenging historical document images due to the presence of many distortions and artifacts. Binary images are also useful in a variety of other applications, including machine vision, forensics, personal identification, etc. [8]. Binary images may also be an output from some types of sensors and devices, such as touch screens, resistive drawing tablets, or pens.
Many image binarization competitions have been organized at major thematic conferences in recent years, including the Document Image Binarization Competitions (DIBCO) [9], as well as the recent Time-Quality Document Image Binarization [10] and binarization of photographed documents [11] competitions. Although time measurement is relatively simple for such algorithms [12], quality evaluation of resulting binary images may be conducted using several methods, including Levenshtein distance calculation or OCR accuracy. Nevertheless, such metrics require text recognition to be performed, hence a faster reliable quality assessment of binary images would be useful for predicting the OCR accuracy for degraded images. Recent challenges are related to the binarization of non-uniformly illuminated document images captured by smartphones, where the quality of the obtained binary image is crucial for further OCR results [13].
During recent years many general-purpose IQA metrics were proposed which may also be applied for a limited number of image types mentioned above. A review of the most popular metrics may be found in survey papers [14] and [15]. Typically, general-purpose IQA metrics are based on the optimization of their correlation with subjective quality scores, hence, several available datasets may be used for their verification. Such databases contain images subject to various types and amounts of visible distortions, with subjective quality ratings expressed as Mean Opinion Score (MOS) values delivered by human observers. One of the most popular such general-purpose IQA datasets is known as Tampere Image Dataset (TID2008), further extended into TID2013 [16], which contains 3000 test images assessed by volunteers from 5 countries performing 985 subjective experiments for 25 reference images and 24 types of distortions for each reference image with 5 levels for each type of distortion. A recent extension of these datasets, known as HTID, has been developed for the verification of no-reference IQA metrics [17]. It contains 2880 color images cropped from the real-life photos acquired by mobile phone cameras to the size of 1536 × 1024 pixels. Even bigger dataset proposed for the same purpose, known as KonIQ-10k [18], contains 10,073 quality scored images. On the other hand, additional subjective studies have led to the development of Cardiff University Image quality Database (CUID) [19]. Another good example may be the application of the ''blind'' IQA for the evaluation of super resolution methods and super resolved images [20] Similar, however usually smaller, datasets are also available for some specific types of images, mentioned earlier, e.g., panoramic [1], underwater [21], or screen content images [22]. Nevertheless, such images are usually more or less similar to natural images included in general-purpose IQA datasets. Therefore, some general-purpose metrics may be partially useful in modified versions also for such types of images.
Nevertheless, a direct application of general-purpose IQA metrics, even with some modifications, for quality assessment of binary images is not possible due to a different ''nature'' of image data as well as their interpretation, e.g., much larger contrast sensitivity in comparison to grayscale or color images. Therefore, different metrics should be provided for this purpose, preferably optimized with the use of different image datasets containing only binary images with distortions typical for them.
Unfortunately, such binary image quality datasets, containing distorted binary images with MOS values are not as popular as large-scale databases containing color images. The most widely known metric, proposed by Zhang et al. [23], referred to as Border Distance, has been developed using the Bilevel Image Similarity Ground Truth Archive, provided by Zhai et al. [24], [25], [26]. This dataset contains a set of 7 bilevel images distorted in a number of ways and to a number of different degrees, leading to 315 distorted images assessed subjectively in the scale from 0 (lowest quality) to 1 (identical to the reference image). Some other tools, used i.a. for the evaluation of image binarization methods, may also be considered as methods for an indirect quality assessment of resulting binary images. The most of them, such as precision, recall, F-Measure, accuracy, etc., originate from data classification and utilize the confusion matrix. Since they are based on the comparison of the resulting binary images with the available ground-truth data, provided e.g. in the Document Image Binarization Competitions (DIBCO) datasets [9], [27], they may also be useful for the general quality evaluation of binary images. Nevertheless, the verification of image binarization as well as the quality assessment of binary images requires the knowledge of ''ground truth'' images [28], [29]. Therefore, experiments presented in the paper focus on the full-reference approach to the binary IQA.
As the application of elementary metrics fo general-purpose IQA not always lead to high correlation with subjective scores, some hybrid (combined) metrics have been proposed to overcome this issue [30], [31], [32], also based on the use of neural networks [33], and genetic algorithms [34], [35] for a higher number of elementary metrics.
The motivation of the paper is the integration of the approaches to the binary IQA and the idea of combined metrics, and the verification of the usefulness of individual elementary metrics for this purpose. The two various approaches are proposed and experimentally verified, leading to excellent performance in comparison to the use of single elementary metrics.
The rest of the paper is organized as follows: Section II concerns a brief overview of the binary IQA metrics, Section III discusses the proposed approach, the experimental results are presented in Section IV, whereas Section V concludes the paper.

II. OVERVIEW OF METRICS USED FOR THE EVALUATION OF BINARY IMAGES
Since binary images are usually the result of image processing, particularly image thresholding, their quality assessment is often conducted indirectly and is related to the evaluation of image binarization algorithms. For this purpose, some typical classification metrics may be easily used, assuming the availability of the binary ''ground truth'' (reference) images.
Considering black and white pixels as ''zeros'' and ''ones'', equivalent to positive and negative samples in classification (or reversely depending on the application, e.g., for document images, black pixels are typically treated as ''ones'' due to the white background), the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) may be easily determined by a simple comparison of all pixels from the reference and assessed images. Then, some typical classification metrics may be calculated, such as precision (ratio of TP to the sum of all positives), recall (ratio of TP to the sum of TP and FN), also known as sensitivity or true positive rate (TPR), and some additional metrics, such as F-Measure (F1-score), defined as the harmonic mean of precision and recall, or accuracy (ratio of TP+TN to all samples), that might be however misleading for imbalanced data sets.
Some other metrics, used in the paper, that may be determined in a similar way based on the pixel-to-pixel correspondence between the reference and the assessed image [27], are: specificity (ratio of TN to the sum of TN and FP), SFMeasure (F-measure based on sensitivity and specificity), BCR (balanced classification rate being the mean of specificity and sensitivity), BER (balanced error rate), NRM (negative rate metric), PSNR (peak signal to noise ratio), and GAccuracy (geometric mean of sensitivity and specificity).
Additionally, three metrics suggested in the paper [27] are used, namely: pseudo-Precision, pseudo-Recall, and pseudo F-Measure, where the numbers of TN, TP, FN, and FP, are determined utilizing the additional morphological thinning leading to the skeletonized ''ground truth''. The idea of these metrics is the improvement of the pixel-based metrics due to a gap between them and the OCR results [36]. Nevertheless, an early attempt to a development of combined metrics of binary images for the OCR accuracy prediction is presented in the paper [37].
The application of the metrics typically used for the evaluation of document image binarization results for the full-reference quality assessment of binary images is also possible for two other measures, namely Distance Reciprocal Distortion (DRD) [38], and Misclassification Penalty Metric (MPM) [39].
The idea of the DRD metric is based on the observation that the perceived quality of binary document images strongly depends on the distance between two pixels, hence the diagonally neighboring pixels influence a central pixel in focus less than horizontal or vertical neighbors. Therefore, all flipped pixels between the reference and assessed images are weighted by the reciprocal of a distance to the center pixel. The normalized weighting matrix is typically calculated for the 5 × 5 pixels neighborhood of all flipped pixels and then multiplied by the values of all pixels from the original image block that differ from the flipped pixel. Finally, the local DRD values determined for the neighborhoods of all flipped pixels are aggregated and divided by the number of the number of non-uniform (not fully black or fully white) 8 × 8 blocks in the reference image [38].
The Misclassification Penalty Metric was originally developed for the evaluation of motion segmentation results. It utilizes the penalization of the misclassified (flipped) pixels by their distances from the reference objects border. Finally, the normalization is conducted by the division by the sum of all pixel-to-contour distances in the image [39].
Another interesting approach, considering the distances between the objects and the flipped pixels, is the Border Distance [23], based on the assumption that the changes located close to the border regions are less visible. Hence, the Border Distance for the flipped background pixels is defined as the distance to the nearest pixel representing the object that may be calculated using three various methods: chessboard (D8), city-block (D4) or Euclidean distances, leading to three types of the metric. Considering these distances together with the image resolution, the Border Distance values may be applied as weights during the calculation of the Mean Square Error (MSE) or Peak Signal to Noise Ratio (PSNR), leading to three types of the BDMSE and BDPSNR metrics, respectively [23].
The experiments related to the use of the above metrics and their combinations were conducted utilizing images from the Bilevel Image Similarity Ground Truth Archive [24], [25], [26], selected as the only possibility of the verification of objective metrics with subjective quality scores. Some representative images from this dataset are presented in Fig. 1.
To verify the performance of metrics, Pearson Linear Correlation Coefficients (PLCC) between the individual (elementary and combined) objective metrics and Mean Opinion Score (MOS) values provided in the dataset, were computed for 301 distorted images. Since the values of the fullreference metrics, based on the comparison with reference images, are obvious, in contrast to no-reference (''blind'') metrics, the MOS values provided for these 14 reference and lossless compressed images (identical to the originals) were not used [40]. The performance of elementary metrics was additionally verified using prediction monotonicity metrics, namely Spearman Rank Order Correlation Coefficient (SROCC) and Kendall Rank Order Correlation Coefficient (KROCC), typically used for the evaluation of newly proposed general-purpose IQA metrics. The obtained results are presented in Table 1.
Typically, changes in individual pixels, perceived as image distortions, influence the image structure and may be considered its complication. Therefore, an increase in variance and image entropy may be expected in this case. Nevertheless, an appropriate use of these two features as the potential binary image quality metrics requires an additional masking. Hence, both features are calculated only for the flipped (false positive and false negative) pixels, using the assumed neighborhood (mask sizes from 3×3 to 11 ×11 have been considered in the experiments), aggregated and divided by the image size. The performance of such obtained metrics is presented in Table 2.

III. PROPOSED APPROACH A. THE IDEA OF HYBRID METRICS
The idea of combination of some metrics was initially verified in the conference paper [40] where a simple weighted product of four metrics was proposed, leading to the increase of PLCC from 0.8145 obtained for the chessboard type BD-PSNR metric to 0.8611 achieved for the combination of MPM, Specificity, pseudo-Precision, and pseudo-Recall. To achieve better performance, a further improvement of this approach is proposed based on the extension of the formula by the additional use of the weighted sum: where Q comb denotes the proposed combined metric, a n , b n , and c n determine the weighting coefficients obtained as the result of their optimization maximizing the correlation with MOS values, and Q n are the values of N elementary metrics. The variable n denotes the current number of the elementary metric.. The proposed metric fusion based on the nonlinear combination of multiple metrics with the use of weighted average and weighted product makes it possible to achieve the values of the weighting coefficients using typical optimization methods, such as Nelder-Mead simplex or quasi-Newton gradient optimization. The experiments were conducted using 19 elementary metrics presented in Table 1, and the best pair of entropy and variance metrics listed in from Table 2, starting from the optimization of the weighting coefficients for the combination of all metrics, according to formula (1). Then, one of the five pairs of the local entropy and the local variance was added and the optimization procedure was launched to verify which of the five pairs led to the highest performance. Due to relatively small changes in performance for various mask sizes (as shown in Table 2) and simplified implementation,  the same size of the mask was used for the calculation of both features.
Although the highest correlations of individual metrics listed in Table 2 were achieved using 11 × 11 pixels mask, the additional verification showed that the use of smaller masks led to better performance of the combined metric. A similar situation may also be observed for general-purpose IQA metrics, as well as for the use of combined metrics for remote sensing images, since it is known that ''the elementary metrics being complementary to each other constitute the best solutions'', as stated, e.g., in the paper [41]. The results obtained for five considered pairs added to the optimized combination of 19 elementary metrics from Table 1 are presented in Fig. 2. The illustration of the scatter plot obtained for 21 metrics, including the local entropy and the local variance, is presented in Fig. 3.
To verify the advantages of the proposed idea of metrics' combination according to formula (1), a simplified ablation study has been conducted. This verification is based on the elimination of the first component (weighted product) or the second component (weighted sum) from the formula (1) before the optimization of weights. The comparison of the results obtained for the ''best'' combinations of 19 metrics and two features is presented in Table 3.
The computational complexity of the proposed approach, based on the combination of 21 metrics, is relatively high, even with the possible parallel computation of metrics. Therefore, the limitation of their number might be beneficiary, assuming a relatively small decrease in performance of the designed hybrid metrics compared to the combination of 21 elementary metrics. Starting from the optimized combination of 21 metrics, the optimization of 20 metrics, excluding a selected one, was conducted. Then, the highest PLCC obtained for each of such 21 scenarios was selected to identify the ''least contributing'' metric. Such an approach is quite similar to backward sequential feature selection methods used in machine learning [42], however in this case the classifier is replaced with the use of Pearson correlation. Then, a similar procedure was repeated for the reduced number of metrics until the combination of two elementary metrics.
The results obtained for the limited number of metrics are presented in Section IV.

B. THE APPLICATION OF NEURAL NETWORKS
The combination of metrics based on the formula (1) makes it possible to achieve a relatively high correlation with subjective quality scores, yet not much higher than 0.95 for 21 metrics. Given the growing popularity of neural networks in various applications and some earlier successful attempts to combine general-purpose IQA metrics using shallow neural networks [33], [41], [43], their application for the binary IQA seems worth investigating. Nevertheless, the application of the popular deep learning methods for this purpose is limited by the availability of the large-scale training datasets [44]. Therefore, instead of the formula (1), a simple feed-forward neural network with 10 neurons in the hidden layer, typically used for approximation and regression purposes, was applied where the elementary metrics were used as inputs and the MOS values as the set of target values. The training of the network was conducted using the wellknown Levenberg-Marquardt algorithm, selected after the analysis of results obtained for three other training methods as discussed later.
Similarly to the construction of the simplified hybrid metrics discussed in Section III-A, the trained network with 21 elementary metrics used as inputs was the starting point for the limitation of the number of metrics used in the combination. Differently than in the previously described approach, starting from the best results obtained for 19 elementary metrics, the ''best'' pair of the local variance and the local entropy turned out to be the one calculated for the 5×5 pixels mask.
Since the network training was carried out starting from randomly selected weights, and the 301 data sets were automatically divided into training, validation and test sets, the final results were dependent on the above conditions. To minimize the influence of those circumstances on the obtained results, 100 repetitions of the network's training were conducted and the final PLCC value was calculated as the median value of the achieved correlations.
Additionally, the verification of the influence of the training method and the network structure was made for two types of neural networks: simple feed-forward network and the cascade-network with the additional connection from the input layer to the output layer as shown in the bottom part of Fig. 4.
The experimental comparison of the influence of the network training algorithms on the obtained results for those two structures was conducted for three assumed numbers of elementary metrics used as network inputs. For this purpose, 2, 4, and 6 elementary metrics selected during the reduction procedure based on the idea similar to backward sequential feature selection as discussed at the end of Section III-A. The Nevertheless, even for a relatively small number of metrics, their pairwise products may be used as additional inputs for the network, leading to an increase of the correlation of its output with the original MOS values. The concept of the additional pairwise products of elementary metrics used as the network's inputs allows for an increase of the PLCC values with subjective scores without the computation of any other metrics. The results obtained using both NN-based approaches in comparison to the application of the hybrid metrics based on the formula (1), as well as the results obtained using various training algorithms, are presented in Section IV.

IV. RESULTS AND DISCUSSION
The three approaches presented in Section III were independently tested during the ablation study, applying the method described in the last part of Section III-A to decrease the number of elementary metrics used as inputs. The PLCC values obtained for various numbers of elementary metrics, after the optimization of weighting coefficients according to formula (1), are presented in Fig. 5. For better readability, the names of the metrics included in the combined metrics are shown in the reverse order of their removal during the ablation study (as they were added starting from the ''best'' elementary metric during the additional verification of the optimization results). The number of elementary metrics used in each combination is shown in curly brackets.
Similarly, the results obtained using the simple shallow feed-forward neural network with various numbers of elementary inputs trained using the Levenberg-Marquardt algorithm are presented in Fig. 6, whereas the average results achieved for 100 repetitions using various training algorithms for the obtained combinations of 2, 4, and 6 metrics are presented in Table 4. For comparison, results obtained for the combinations of the same metrics, although with the use of the additional pairwise products of these metrics as network inputs, are presented in Table 5. Since the number of input metrics is more important than the choice of the training algorithm, in view of relatively small differences in the obtained results, the Levenberg-Marquardt method was selected for the feed-forward network training in the final  design of the network based on the reduction of the number of elementary metrics starting from 21 elementary metrics with their pairwise products. The finally obtained correlations obtained for the NN-based metrics with additional inputs (pairwise products of elementary metrics) are illustrated in Fig. 7.
As presented in Fig. 5, the combination of two to five metrics leads to significant improvements in the performance of the designed metrics compared to elementary metrics. Since computing the considered elementary metrics for binary images is much faster than computing such IQA approaches for grayscale or color images, using more metrics does not significantly reduce the processing speed, especially assuming potential parallel calculations of individual elementary metrics. Therefore, the application of the combined metric consisting of 7 or 9 metrics can be considered a reasonable choice to balance the correlation with MOS values and the overall computational complexity. As shown in Fig. 5, the use  of additional elementary metrics does not lead to a noticeable improvement in the results.
Even better results can be achieved using fewer metrics as inputs for a shallow feed-forward neural network, as depicted in Fig. 6, especially for two, three, or four input elementary metrics. Nevertheless, the obtained results depend on the random selection of the training, testing and validation subsets, and the starting point, therefore they differ for different numbers of input metrics. Comparing the median performance presented in Fig. 6 with the PLCC values achieved for the combined metrics shown in Fig. 5, they are higher, and using at least five input metrics always leads to better performance than the combination of all elementary metrics according to the formula (1).
Further performance increase for a relatively small number of input metrics can be achieved by using their pairwise products as additional network inputs. In this case, even the use of four metrics makes it possible to obtain a PLCC above 0.96, as shown in Fig. 7. Nevertheless, using more metrics together with their pairwise products does not lead to better results, and in this case the number of inputs is significantly higher than the number of neurons.
Analyzing the order of the eliminated elementary metrics, it may be easily noticed that the last eliminated metrics, considered ''the most influential'' in view of the correlation with subjective quality scores, are almost the same for the combined metric and the NN-based approach. The differences in the three main elementary metrics are related only to the mask size used for the local entropy and the local variance. Two other metrics (PSNR and NRM) are also present in the combination of 6 metrics in both cases. However, using the additional pairwise products as the inputs for the neural network, the local entropy, and the local variance are not as important -both these metrics are eliminated in the first two steps of the ablation procedure.
A visual comparison of the Pearson correlations achieved for various kinds of combinations of 4 elementary metrics is depicted in Fig. 8. As shown in the scatter plots, a highly linear correlation between the subjective and objective quality scores may be observed for the proposed combination of metrics using the shallow feed-forward neural network with pairwise products as additional inputs (Fig. 8c).

V. CONCLUSION
The novel hybrid method for objective quality assessment of binary images proposed in the paper in three variants makes it possible to achieve a very high correlation with subjective quality scores, significantly outperforming known elementary metrics. The proposed approach has been verified using the publicly available Bilevel Image Similarity Ground Truth Archive which contains not only the distorted binary images but also their subjective quality evaluation results expressed as Mean Opinion Scores. Although the best results may be achieved using the combination of metrics based on the application of a shallow neural network, satisfactory results may also be obtained by applying the model (1) without training.
The proposed metric may be used for the quality evaluation of binary images in various image analysis applications utilizing binary images, e.g., shape analysis or OCR engines, or for the indirect evaluation of image binarization results.
Since the subjective quality scores are not always equivalent to the final quality of image analysis results, our further research may be related to the additional verification and optimization of the proposed approach, e.g., given the text recognition accuracy. Nevertheless, such experiments would require the development of novel datasets containing binary images being the result of document image binarization as a possible comparison tool for further investigations.
Depending on the availability of the test datasets, further research on binary image quality assessment methods may also focus on binary masks used for segmentation and annotation purposes, particularly in the context of training the neural networks. Nevertheless, potential distortions of such masks may differ significantly from those present in the Bilevel Image Similarity Ground Truth Archive used in the paper. Therefore, the proposed hybrid method would require additional verification and tuning.
Some other directions of future work are the verification of some other network structures and training functions that may further improve the obtained correlation with subjective quality scores as well as research on more advanced deterministic models that may outperform neural networks considering the correlation with subjective scores.