Deep Feature-Based Three-Stage Detection of Banknotes and Coins for Assisting Visually Impaired People

Owing to the rapid advancements in smartphone technology, there is an emerging need for a technology that can detect banknotes and coins to assist visually impaired people using the cameras embedded in smartphones. Previous studies have mostly used handcrafted feature-based methods, such as scale-invariant feature transform or speeded-up robust features, which cannot produce robust detection results for banknotes or coins captured in various backgrounds and environments. With the recent advancement in deep learning technology, some studies have been conducted on banknote and coin detection using a deep convolutional neural network (CNN). However, these studies also showed degraded performance depending on the changes in background and environment. To overcome these drawbacks, this paper proposes a three-stage detection technology for new banknotes and coins by applying faster region-based CNN, geometric constraints, and the residual network (ResNet). In the experiment performed using the open database of Jordanian dinar (JOD) and 6,400 images of eight types of Korean won banknotes and coins obtained using our smartphones, the proposed method exhibited a better detection performance than the state-of-the-art methods based on handcrafted features and deep features.


I. INTRODUCTION
With the rapid advancements in technology, smartphone has been widely used in various applications. As one of them, there is an emerging need for a technology that can detect banknotes and coins to assist visually impaired people using the cameras embedded in smartphones [1], [2].
In previous studies on banknote detection, high detection performance was observed by applying speeded-up robust features (SURF) to the banknotes [3]. However, the performance of SURF was significantly degraded when the images captured in complicated and diverse backgrounds were used [4]. In other research [5], the classification of fake banknote using deep learning was proposed, which did not require the pre-classification of banknote images in the The associate editor coordinating the review of this manuscript and approving it for publication was Paolo Napoletano . denomination and input direction. However, the regions of banknotes were manually segmented from the input image, which requires user's assistance to use this method in actual smartphone. In addition, most previous studies on banknote detection using deep learning have used databases with simple backgrounds or with the application of a slight rotation such that the objects can be easily recognized. Thus, the studies that examine the detection performance using the images captured in various conditions are lacking [6], [7].
Therefore, the problem definitions can be as follows; the difficulties of automatic detection of banknotes in complicated background and lacking of performance evaluations in various experimental conditions. Moreover, considering that coins are commonly used in everyday lives, small-sized coins in addition to banknote should be also regarded as detection classes different from the most previous works.
Based on these research motivations, our research objective is the accurate banknote and coin detection and recognition in complicated backgrounds and various experimental conditions. Our research has the following contributions, significances, and advantages compared with the previous works: -This study is the first deep learning-based approach to detect and recognize bills and coins with images captured by smartphone cameras in complicated background and various experimental conditions for assisting visually impaired people. Different from the most previous works, small-sized coins are also regarded as detection classes in our research. -To improve the detection performance by VGG- 16-based faster region-based convolutional neural network (Faster R-CNN) of the first stage of detection, false positive (FP) candidates are removed by applying post processing of the second stage based on the three features: the width-to-height ratio, detection box size, and detection score. -The candidates remaining after post processing are divided into coin, bill, and coin and bill sections according to the detection box size; for coin and bill candidates, the verification of the third stage of detection is performed using ResNet-18-based Faster R-CNN to detect the final banknote region. -Experimental results in various conditions and backgrounds confirm that our method outperforms the stateof-the-art methods for banknote detection. In addition, the self-collected Dongguk Korean Banknote database version 1 (DKB v1) and the developed models with algorithms are disclosed for a fair evaluation by other researchers as shown in [8].
This paper is organized as follows. In Section II, related works are described, and the proposed method is introduced in Section III. In Sections IV and V, we present the experimental results with analysis, and conclusions, respectively.

II. RELATED WORKS
Previous studies on banknote detection can be largely classified into handcrafted feature-based and deep feature-based methods. Several handcrafted feature-based studies examined detection and recognition methods using SURF. SURF was used mainly because it is effective for images with rotation or scaling changes. Compared with other handcrafted feature models, this method requires lower computational costs, which leads to a shorter time for the localization or matching of features. The detection method with SURF involves extracting features within images using the Hessian matrix [3]. Subsequently, approximately 20 images per class are obtained under various conditions, and then the features are matched to verify the detection performance. In addition to this handcrafted feature-based extraction method using entire images, some studies examined more efficient ways of using SURF based on foreground segmentation [1], [2]. These studies distinguished the banknote and background within images using a pixel-based adaptive segmenter (PBAS). Distinguishing foreground and background not only reduces the computation cost and improves the computation speed but also does not extract features from unnecessary areas in the images except the banknote. Moreover, the number of false matches in the recognition results can be reduced if features are not extracted from unnecessary areas. Adaptive boosting (AdaBoost), which is a widely used algorithm [9], involves transforming a weak classifier into a strong classifier through repeated training.
In a previous study [10], banknote recognition was conducted using AdaBoost and SURF-based methods. However, detection and classification processes using SURF are inefficient for finding the desired objects in images with complicated backgrounds or environments. The reason is that the SURF algorithm involves transforming color images into grayscale images to find features; consequently, important information is lost when color images captured by smartphone cameras are used. Moreover, even when the object and background in color images are different, SURF transforms the image into grayscale and recognizes the object and background as similar features, which leads to false detection of objects. Another handcrafted feature-based method involves using the fast radial symmetry (FRS) transform [11], in which geometrical patterns are extracted. The FRS transform is a gradient-based interest operator. The number parts in banknotes became gradients when the FRS transform was applied to regions with radial symmetry. Then, the unique geometrical patterns of the number parts in the denominations of Mexican banknotes were extracted using the extracted gradients. Then, the extracted patterns were applied to the test banknote to classify to which denomination the respective banknote belongs [12].
Another study applied principal component analysis (PCA) [13] as a handcrafted feature-based method [14]. In this study [14], the region of interest (RoI) was extracted based on the number parts located to the left and right of the denominations. The optimal number of eigenimages was determined using the extracted RoIs. Here, a total of 24 eigenimages were obtained by applying six denominations and four RoIs. Subsequently, the banknotes were recognized by applying the Mahalanobis distance using the features extracted based on eigenimages as input.
In addition, the studies on banknote recognition [15] were conducted by applying the k-nearest neighbors (KNN) classifier [16] and the decision tree classifier (DTC) [17]. In this study [15], features with RGB values of RB, RG, or GB were extracted for each denomination of Malaysian banknotes. The two algorithms, KNN and DTC, were applied to the three extracted features for training with 10-fold cross validation.
A deep CNN (DCNN) extracts features through fast computations using a convolution filter based on a large number of datasets. Unlike handcrafted feature-based methods, a DCNN learns both the object and the background and thus can detect objects more effectively using a large amount of information. Several previous studies focused on the recognition of banknotes for assisting the visually impaired. However, some DCNN models with outstanding recognition and detection performance such as Faster R-CNN and you-look-only-once (YOLO) [18] have numerous layers, which results in several parameters and computations, thus requiring high hardware specifications. In other words, these models are difficult to realize in a wearable device with low specifications. Thus, MobileNet with a relatively lower amount of computations is often used [19]. The amount of computations of MobileNet v1 is three times less than that of GoogleNet [20], and the number of parameters is only 60% of that of the latter. Furthermore, the number of parameters of MobileNet v1 is 30 times less than that of the visual geometry group (VGG)-16 and the amount of computations is approximately 3% of that of the latter.
A previous study [6] examined the recognition performance based on MobileNet for Indian banknotes with an accuracy of approximately 96.6%. Another study [7] performed banknote detection and recognition using a shallow CNN designed by the researchers. In this study, features are extracted through a newly designed CNN network. Then, the output feature map is divided into an S × S grid of cells based on the extracted features. Each cell contains the vector of bounding boxes and the information on class predictions. A grid cell handles prediction in each object in the image; each grid cell predicts the bounding box and class probabilities. Banknote detection and recognition are performed based on the results of these grid cells. In [21], they use YOLO-v3-based banknote detection and recognition method. They collected images of different denominations and augmented those images with different geometric and image transformations. Previous research proposed the currency recognition method based on deep learning, and compared the performances by Faster R-CNN, single shot multibox detector (SSD), and MobileNet [22].
Chowdhury et al. proposed the Indian banknote recognition method based on CNN [23]. In their method, the images of currency notes are checked whether they are Indian banknotes or not. In case of Indian banknote, the denomination is classified by k-NN and also classified by CNN. In other research [24], they adopted CNN model based on Alexnet architecture with Chilean bill data which were augmented by translation, rotation, scaling, brightness variation, and etc. They also proposed their method as automatically classifying bill. Jadnav et al. proposed the method of currency identification and forged banknote detection using deep learning [25]. They used Saudi and Indian currencies, and extracted features in depth and analyzed the banknote by using deep CNN.
As the researches of object detection, previous research proposed YOLO v2 [26]. Using a multi-scale training method, the YOLOv2 model can be operated at varying sizes, providing an easy tradeoff between speed and accuracy.
To overcome these drawbacks, this paper proposes a new three-stage detection technology for banknotes and coins by applying Faster R-CNN, geometric constraints, and residual network (ResNet). Moreover, previous handcrafted featurebased and deep feature-based methods consider only bills in the detection and recognition processes, but not coins. However, coins are commonly used in everyday lives; therefore, banknote detection and recognition were performed using a database of both coins and bills in this study.
The advantages and drawbacks of the proposed and previous methods are summarized in Table 1.

III. PROPOSED METHOD A. OVERVIEW OF THE PROPOSED METHOD
The flowchart of the proposed banknote detection method is shown in Figure 1. First, the input images were applied to the pretrained Faster R-CNN as the experimental data (first stage). Among the box regions detected through Faster R-CNN, FP candidates were removed by performing post processing when there are multiple detection boxes in the image (second stage). In the post processing step, FP candidates were removed based on the box size, width-to-height ratio of the box, and detection score. Finally, the detection boxes that were not removed in the second stage are classified into coin, coin and bill, and bill candidates according to the detection box size. The coin and bill candidates were verified using the pretrained Faster R-CNN to detect the final banknote regions (third stage).

B. FIRST STAGE OF DETECTION
In the first stage of detection, banknotes were detected using Faster R-CNN. Starting with R-CNN [27], the speed was improved compared to that of Fast R-CNN [28]. Ultimately, the detection performance was improved compared to that of Faster R-CNN, which has a faster detecting speed [29]. Accordingly, Faster R-CNN was used in this study as shown in Figure 1. VGG-16 [30] was used among the pretrained CNNs as a feature extractor, and the input image size (height × width × channel) was set to 600 × 800 × 3.
After extracting the feature map from VGG-16, region proposals were output through the region proposal network (RPN). RoI pooling was performed using the region proposals and feature map, and, subsequently, the results were used to classify objects and detect the bounding box. Tables 2-4 summarize the detailed architectures of the network used in this study. The feature extractor, whose architecture is summarized in Table 2, has 13 convolutional layers and rectified linear units (ReLUs), and four max pooling layers. In the existing VGG-16 network, the structure up to the last max pooling layer was used. The filter having the width × height of 3 × 3 is used in the convolutional layers. The numbers of paddings and strides are both 1 × 1 so that the size of the output feature map does not change when a convolutional layer is processed. A filter with the size of 2 × 2 × 1 is used in the max pooling layers. A stride of 2 × 2 was applied, and, consequently, the width and height of the output feature map of the convolutional layer were divided into two. The original image used in this study is an RGB image with the size of 1080×1920×3. The size of the input image is 600×800×3, which is converted through bilinear interpolation. Through the processes listed in Table 2, a feature map with the size of 38 × 50 × 512 was finally output. The feature map obtained thus is used as an input of the RPN and classifier. Table 3 summarizes the information of the RPN network, which generates the region proposals. Conv5_3 output, which is the end of the previous feature extractor, is used as an input; it consists of a 3 × 3 convolutional layer and two 1 × 1 convolutional layers. In Conv6, a 3 × 3 window was slid on top of the pixels of the feature maps to perform convolution. There are nine anchor boxes at the center of the sliding window. The feature map extracted through Conv6 becomes the input to the classification and regression layers. The anchor boxes output the objects and background score in the classification layer, and the bounding box regression vector in the regression layer.
As shown in Equations (1) and (2), the bounding box regression vectors were used to transform the anchor boxes into proposal boxes.
In Equations (1) and (2), x proposal , y proposal , w proposal , and h proposal are the center x and y coordinates and the width and height of the proposal box, respectively. x anchor , y anchor , VOLUME 8, 2020  Figure 1. Only the boxes above the intersection over union (IoU) threshold through non-maximum suppression (NMS) based on the score obtained from the classification layer remained as the proposal boxes. Here, the top 300 boxes become the region proposals and the input to the classifier in Table 4. The RPN and classifier were trained first during the training of Faster R-CNN. For the classification, nine classes (i.e., 10, 50, 100, 500, 1000, 5000, 10000, 50000 KRW, and background) were trained for KRW. Meanwhile, 10 classes (i.e., 1 qirsh, 5, 10 piastres, 1/4, 1/2, 1, 5, 10, 20 dinars, and background) were trained for JOD. When training the RPN and classifier, the weight was also trained so that the loss function value is minimized.
In Equation (3), i is the index of the mini-batch, p i is the probability of whether anchor i is the object or background, and p * i is the ground-truth label, which has the value of 1 when the anchor is the banknote object or 0 when it is the background. L cls is the classification loss function, which denotes the log loss of the class. v i is the bounding box regression vector of the anchor box, which corresponds to Equations (1) and (2) mentioned above. v * i is the bounding box regression vector for the ground truth related to the respective class.  L reg is the smooth L1 loss function and is only used for positive anchors, or when p * i = 1. N cls and N reg are the minibatch size and anchor locations, respectively, and the two loss functions are normalized. Accordingly, as the two loss functions, L cls and L reg , were trained, uniform balancing occurs through σ . When the two loss functions were minimized through balancing, the banknote location was detected, and the denomination of the detected banknote was distinguished. Table 4 summarizes the architecture of the network, which detects objects using a feature map and the RPN. A feature map of 38 × 50 × 512, which is the output of VGG-16, and 300 region proposals of the RPN were used as the input. A fixed feature map of 7 × 7 was obtained in the RoI pooling layer. The newly extracted feature map was used to obtain a 4096-dimensional feature vector through the fully connected layers FC6 and FC7.
The respective vector becomes an input to the classification and regression layers, which used the probability of 9 classes for KRW and 10 classes for JOD. In the classification layer, each proposal was classified into multiple classes. In the regression layer, the proposal box was transformed into the predict box by outputting the bounding box regression vector. The final detection result is obtained through NMS.

C. SECOND STAGE OF DETECTION
In this study, all the cases of an object being detected from parts that are not banknote objects or other classes being detected from objects are counted as FPs. When a banknote object was not detected in the image, the case was counted as a false negative (FN). As shown in Figure 2(b), incorrectly detected boxes are present due to Faster R-CNN. Therefore, FPs were removed in the second stage of detection for improving the detection performance obtained with Faster R-CNN. Specifically, the FPs generated when the FN error is set to be minimized in the first stage of detection by Faster R-CNN were removed in the second stage of detection. The following three handcrafted feature-based post processing methods were applied to remove the FPs in the second stage of detection.
First, the FP box was removed by post processing according to the width-to-height ratio of the detected box. In general, the ground-truth box of coins is closer to a square than the ground-truth box of bills, which is a rectangle. The widthto-height ratio of coins in this study is close to 1; thus, a minimum value is not required. Using the width-to-height ratio of bills, the threshold range of the minimum and maximum ratios was determined using the training data. When the width-to-height ratio of the detection box is not within the threshold, the detection box was removed as an FP. Figure 3 shows an image in which FPs have been removed after the first post processing. Some FPs may be present even after the first post processing. To remove these FPs, the second post processing method involves removing FPs based on the detection box size (the number of pixels in the detected box). As shown in Figure 4, the minimum and maximum ranges of the coin box size and bill box size observed in the training data were obtained. Then, based on the class label of the detected box, the box size was removed as an FP if it does not fall within the range of coin box size or bill box size. Figure 5 shows an image in which FPs have been removed after the second post processing. VOLUME 8, 2020 Some FPs may remain even after the second post processing. As shown in the left image of Figure 6, there may be two detection boxes in a banknote object. To address this problem, when the IoU of the detection boxes is 50% or greater, the third post processing method removed the remaining FPs by treating only the first ranked objects as true positives (TPs) based on the detection score obtained from Faster R-CNN. Here, TP refers to the case where the banknote in the input image has been correctly detected according to the corresponding class. This post processing was not applied to the candidates that have been detected as coins and was only applied to the candidates that have been detected as bills. The reason is that TP was removed along with coins when the FPs were removed based on the detection score of Faster R-CNN only, as the coin size is small. Figure 6 shows an image in which FPs have been removed after the third post processing.

D. THIRD STAGE OF DETECTION
As explained in Section III.C, the distributions of coin class (class 1), coin and bill class (class 2), and bill class (class 3) based on the detection box size with threshold information, shown in Figure 4, were obtained from the training data. Furthermore, a Faster R-CNN-based banknote detector, which uses the ResNet-18 as the feature extractor, was trained using the training data of each class.
Subsequently, it was determined to which class the data belong among the three classes (classes 1-3) in Figure 4 based on the detection box size obtained after the second stage of detection during the testing process. The third stage of detection was performed using the pretrained Faster R-CNN only for the class data belonging to class 1 or class 3. The training and testing performance can be improved if a separate Faster R-CNN is used as the form and size of class 1 and class 3 vary. However, the third stage of detection was not performed for class 2 because the training and testing performance of Faster R-CNN is not guaranteed for the data with mixed coins and bills. Therefore, in this study, the last fully connected layer in the ResNet-18 [31] pretrained with  the ImageNet database [32] was removed to be used as the feature extractor of Faster R-CNN. Table 5 shows the architecture of ResNet-18. The input image has the size of 224 × 224 × 3 in the input layer of ResNet-18. In the first convolution, the output of size 112 × 112 × 64 is produced through 64 filters with the size of 7 × 7 × 3. A feature map of size 56 × 56 × 64, which is half the aforementioned size, is obtained through max pooling. This feature map becomes the input to Conv2, and it then passes through four convolutional layers with the filter size of 3 × 3 × 64. In Conv3, filters with the sizes of 3 × 3 × 128, 3 × 3 × 64, and 1 × 1 × 64 were employed. The feature information of small objects such as coins can be delivered to the next layer by using the shortcut and residual block of ResNet. Similarly, the final feature map of size 7 × 7 × 512 can be obtained through Conv3, Conv4, and Conv5. As shown in Figure 1, the result was input to the RPN and classifier to perform the final banknote detection. The architectures of the RPN and classifier are summarized in Tables 6 and 7, respectively.

IV. EXPERIMENTAL RESULTS AND ANALYSIS A. EXPERIMENTAL ENVIRONMENT
There is a lack of an open database of banknote images captured using smartphone cameras. Therefore, the experiment in this study was conducted using the Dongguk Korean Banknote database version 1 (DKB v1). The DKB v1 contains  eight classes, namely, 10, 50, 100, 500, 1000, 5000, 10000, and 50000 KRW, with each class having 800 images, yielding a total of 6,400 images. The images were captured using the frontal viewing camera of Galaxy Note 5 [33]. The images of the banknotes were captured from various distances. To reflect the real-world environment as closely as possible, the images were captured under conditions of various locations, lighting, and cases where the bills were randomly folded. The size of the obtained image is 1920 × 1080 pixels. Figure 7 shows the images in the DKB v1.
Furthermore, the experiment was conducted using the open database of JOD [34] to verify whether the proposed algorithm can be applied to various types of banknote images. The JOD open database contains nine classes (i.e., 1 qirsh, 5, 10 piastres, 1/4, 1/2, 1, 5, 10, 20 dinars), yielding a total of 330 images. The size of the obtained image is 3264 × 2448 pixels. Figure 8 shows the images in the open database of JOD. As shown in Figure 8(b), the open database of JOD includes several images of street quality such as bills being severely folded or occluded and with poor quality due to uneven lighting or light saturation. The algorithm proposed in this paper was trained and tested using a desktop computer equipped with Intel R Core TM i7-950 CPU@3.07GHz, 20 GB memory, and NVIDIA GeForce GTX1070 graphics (1920 compute unified device architecture (CUDA) cores) [35]. The algorithm was realized using MATLAB R2019a and R2017b version [36] based on compute unified device architecture (CUDA) (Version 10.0) [37] with the CUDA deep neural network library (cuDNN) (version 7.1.4) [38].

B. TRAINING OF THE PROPOSED METHOD
As explained in Section III.B, Faster R-CNN was used in the first stage of detection in this study. For end-to-end training, two-fold cross validation was performed by dividing 6,400 images of the DKB v1 into 3,200 images each for the training and testing (validation) sets. In other words, 3,200 images were used for the training, and the remaining 3,200 images were used for testing (validation) in the first fold. In the second fold, the two subsets for training and testing (validation) were switched. The average value of accuracy of the two tests was used as the final performance indicator. Faster R-CNN was trained with the four-step alternating training method [29]. By using the stochastic gradient descent [39], we determined the weight value that minimizes the difference VOLUME 8, 2020  between the training result and the ground-truth value. The parameters used in this study for training are as follows: base learning rate = 0.001, batch size = 1, gamma = 0.1, momentum = 0.9, weight decay = 0.0005, and number of iterations = 120,000. Figure 9 shows the training loss graphs when the DKB v1 was used. Figure 10 shows the training loss graphs when the JOD open database was used. In both Figures 9 and 10, the training loss converged as the number of iterations increased, which indicates that Faster R-CNN was sufficiently trained in this study.

C. CLASS ACTIVATION MAP OF FEATURE USING THE PROPOSED METHOD
In this subsection, the features obtained through the proposed network are analyzed through the class activation map (CAM) of the third-stage detector of Figure 1 for the input images. The regions whose color is closed to red represent the important features extracted by our network. This kind of analysis with CAM images has been widely adopted in the researches of deep learning-based image processing and recognition [40]. CAMs were obtained from Conv1, Conv2_4, Conv3_5, Conv4_5, and Conv5_5 of the feature extractor in Table 5. As shown in Figure 11, the CAM was obtained by considering the images of coins and bills of the DKB v1 and JOD open database as the input.  Figure 11 illustrates that more abstract CAM images were obtained as the convolutional layer became deeper. Both coins and bills had more highly activated results (marked in red) in the object region compared with the background. These results indicate that the proposed network generates feature maps from which banknotes can be easily detected.  proposed method to evaluate its testing performance. Based on these results, the detection accuracy was calculated by determining the recall, precision, and F1 score [41], [42]  according to Equations (4)- (6).
P, Q, and R represent the numbers of TPs, FPs, and FNs, respectively.
F1 score = 2 precision × recall precision + recall (6) Table 8 lists the recall, precision, and F1 score values, as the output detection threshold of Faster R-CNN is adjusted from 0.1 to 0.5 in the DKB v1. As listed in Table 8, with the increase in threshold, the recall value decreased, whereas the precision value increased. The criteria for determining TP become stricter with an increase in the threshold; thus, R of Equation (5) increases, but Q of Equation (4) decreases. R is minimized even when Q increases in the first stage of detection, and, then, Q is reduced in the following step. Therefore, the output detection threshold of 0.1 with the highest recall value in Table 8 was used for Faster R-CNN.  Table 9 lists the accuracies of the first stage of detection for the coin, bill, and coin and bill sub-datasets. As summarized in Table 9, the first stage of detection for bills has a higher accuracy because the object size is larger than that of coins and the features are more easily observed.   Table 10 lists the recall, precision, and F1 score values, as the output detection threshold of Faster R-CNN is adjusted from 0.1 to 0.5 in the JOD open database. As listed in Table 10, with the increase in the threshold, the recall value decreased whereas the precision value increased. As explained previously, R of Equation (5) is minimized even when Q of Equation (4) increases in the first stage of detection, and, then, Q of Equation (4) is reduced in the following step. Therefore, the output detection threshold of 0.1 with the highest recall value in Table 10 was used for Faster R-CNN.  Figure 12 illustrates the result images of the first stage of detection. As shown in Figure 12, the result images contain several FPs in addition to TP. These FPs are removed during the second and third stages of detection.

2) SECOND STAGE OF DETECTION
The FPs present after the first stage of detection are removed sequentially based on the width-to-height ratio of the box (1st step), box size (2nd step), and detection score of Faster R-CNN (third step), as explained in Section III.C. Table 11 lists the accuracies of the second stage of detection per step for the DKB v1 when the first, second, and third steps are applied sequentially. As summarized in Table 11, the precision value and F1 score were the highest when steps 1-3 were applied. Moreover, the second post processing step applied to both  coins and bills resulted in a higher precision and F1 score for the same recall value, compared with when applied to either coins or bills. The accuracies of applying the third post processing step only to bills and to both coins and bills were compared, as presented in Table 11. The precision and F1 score for applying the third post processing step to both coins and bills were slightly higher than those when applying the third post processing step to only bills, but the recall value also decreased. The reason is that TP was removed along with coins when the FPs are removed based on the detection score of Faster R-CNN only, as the coin size is small. This study used a method where the FPs are removed in the subsequent step by maintaining the recall as high as possible in each step; therefore, the third post processing step is applied only to bills as explained in Section III.C. Table 12 lists the accuracies of the second stage of detection for the coin, bill, and coin and bill sub-datasets in the DKB v1. As summarized in Table 12, the second stage of detection for bills has a higher accuracy because the object size is larger than that of coins and the features are more easily observed. Table 13 lists the accuracies of the second stage of detection of Figure 1 per step for the JOD open database when the first, second, and third steps are applied sequentially. As summarized in Table 13, the precision value and F1 score were the highest when steps 1-3 were applied. Moreover, the second post processing step applied to both coins and bills resulted in a higher precision and F1 score for the same recall value, compared with when applied to either coins or bills. The accuracies of applying the third post processing step only to bills and to both coins and bills were compared, as summarized in Table 13. The precision and F1 score for applying the third post processing step to both coins and bills were slightly higher than those when applying the third post processing step to only bills, but the recall value also decreased. This study used a method where the FPs are removed in the subsequent step by maintaining the recall as high as possible in each step; therefore, the third post processing step is applied only to bills.    Figure 13 illustrates the result images of the second stage of detection. As shown in Figure 13, the result images contain FPs (yellow box in the upper right image and green box in the lower left image) in addition to TP. These FPs are removed in the third stage of detection.

3) THIRD STAGE OF DETECTION
The accuracy of the third stage of detection was computed. Table 14 lists the accuracies of the third stage of detection of Figure 1 for the coin, bill, and coin and bill sub-datasets in the DKB v1. As summarized in Table 14, the third stage of detection for bills has a higher accuracy because the object size is larger than that of coins and the features are more easily observed. Table 15 list the accuracies of the third stage of detection of Figure 1 for the coin, bill, and coin & bill sub-datasets in the JOD open database. As summarized in Table 15, the third stage of detection for bills has a higher accuracy because the object size is larger than that of coins and the features are more easily observed. Figure 14 shows examples of correct detection for all the three stages of detection performed using the proposed method. As shown in this figure, the correct detection results   were still obtained when banknotes were folded, some parts were occluded in the input image, or there was severe light saturation on the surface. Figure 15 shows examples of incorrect detection for all the three stages of detection performed using the proposed method. As shown in this figure, the incorrect detection results were obtained when banknotes were severely folded, the object was slanted, some parts were not in the input image or were occluded, or there was severe light saturation on the surface.
By summary, our method produces the minimum number of FN (causing the higher recall of Equation (5)) in the first stage of Figure 1 although the number of FP is also increased (causing the lower precision of Equation (4)). The results are shown as the bold numbers of Tables 8 and 9 in case of DKB v1, and those of Table 10 in case of JOD. Then, the number of FP is further reduced by the second stage of Figure 1 (causing the higher precision of Equation (4) and F1 score of Equation (6) while the recall is maintained). The results are shown as the bold numbers of Tables 11 and 12 in case of DKB v1, and those of Table 13 in case of JOD. Finally, the images including only the candidates of classes 1 (coin) and 3 (bill) of Figure 4 are processed by the third stage of detection of Figure 1 (causing the higher recall, precision, and F1 score). The results are shown as the bold numbers of Table  14 in case of DKB v1 and those of Table 15 in case of JOD.

E. COMPARISONS WITH THE STATE-OF-THE-ART METHODS
In the following experiment, the accuracy of the proposed method was compared with those of the state-of-the-art  methods. The performance of the proposed method was compared with those of the SURF-based banknote detection based on handcrafted features [1]- [3], [10], Faster R-CNN-based banknote detection based on deep features [22], MobileNet-based banknote detection [6], and YOLO v2 [26] and YOLO v3-based detection methods [21]. As summarized in Table 16, the proposed method exhibited a higher accuracy than the state-of-the-art methods for the DKB v1. As also summarized in Table 17, the proposed method exhibited a higher accuracy than the state-of-the-art  Tables 16 and 17. MobileNet-based detection, which is based on deep features, uses a small number of layers. Thus, it shows a lower accuracy for detecting small objects such as coins and ultimately demonstrates a lower accuracy than the Faster R-CNN-based and YOLO-based detection methods. The YOLO-based detection methods exhibited a higher accuracy than the Faster R-CNN-based method, but the former had a lower detection performance than the proposed method. Figures 16 and 17 present the graphs of the changes in precision, recall, and F1 score according to the IoU threshold of the proposed method and the state-of-the-art methods for the DKB v1 and JOD open databases, respectively. As shown in these figures, the proposed method had a higher detection accuracy than the state-of-the-art methods for all IoU thresholds.

V. CONCLUSION
In this study, a new method was proposed for banknote detection with banknote images captured in complicated backgrounds and various environments using a smartphone camera. To improve the detection performance in the VGG-16-based Faster R-CNN of the first stage of detection, post processing methods were applied as the second stage of detection based on the three features, namely, the widthto-height ratio, detection box size, and detection score, to remove the FP candidates. For the candidates remaining after the post processing, verification was performed as the third stage of detection by the ResNet-18-based Faster R-CNN to detect the final banknote region. Furthermore, the self-collected DKB v1 and the developed models with algorithms were disclosed for a fair evaluation by other researchers as shown in [8]. When the experiments were conducted with the DKB v1 and JOD open databases, high detection performance was obtained for bills, but FP detection errors were produced for coins.
Further studies would be conducted on deep networks that can detect small objects such as coins in images more accurately. In addition, a shallow network-based detection method that can shorten the processing time would be examined.