Scene Graph Generation with Structured Aspect of Segmenting the Big Distributed Clusters

Accurate fruit counting is one of the important phenotypic traits for crucial fruit harvesting decision making. Existing approaches perform counting through detection or regression counting. Detection of fruit instances is very challenging because of the very small fruit size as compared to the whole size image of a tree, while regression-based counting gives impressive results but becomes inaccurate when the number of instances increases. Moreover, most approaches lack scalability and are applicable only on one or two fruit types. In this paper, we propose a fruit counting mechanism that combines loose segmentation and regression counting that works on six fruit types, such as Apple, Orange, Tomato, Peach, Pomegranate and Almond. Through relaxed segmentation, fruit clusters are segmented to extract the small image regions which contain the small cluster of fruits. Extracted regions are forwarded for the regression counting of fruits. Relaxed segmentation is achieved through a state-of-the-art deconvolutional network, while modified Inception Residual Networks (ResNet) based nonlinear regression module is proposed for fruit counting. For segmentation, 4,820 original images, including corresponding mask images, of all six fruit types are augmented to 32,412 images through different augmentation techniques, while 21,450 extracted patches are augmented to 89,120 images used for the regression module training. The proposed approach has superseded the counting accuracy of existing approaches of individual fruit types, but we have achieved an overall 94.71% accuracy.


I. INTRODUCTION
Yield estimation is becoming increasingly important in digital agriculture, which assists farmers to streamline harvesting resources which boost the cost-cutting for harvesting, enabling them to market the yield in a better way to get higher profits. With prior estimation of yield, farmers can make substantiate decisions to arrange the labor and machinery for ripping the crop, early order of required packing stuff, manage logistics to transfer and prepare sizable storage and processing facilities [1]. With prior decisions, farmers can devise better marketing and sales strategy to get a higher price. On the other hand, manual fruit estimation in orchards is quite labor-intensive, giving inaccurate numbers and infeasible at a large scale.
Agri vision using image processing through Computer vision is a growing field that can also assist in yield estimation. Traditional image processing techniques are inefficient due to varying lighting conditions, color complexion, lack of robustness, occlusion and process handengineered features against each specific scenario [3], [4]. With Deep learning techniques, such as Convolutional Neural Networks (CNN), limits of image processing have been extended, which solves the complex Computer Vision problem, such as classification, detection and segmentation [5]. In addition, deep learning techniques are efficient enough to generalize across various fruit types and environments that are dynamic in lighting conditions. Significant progress has been made to devise different approaches to formulate an efficient and accurate system for orchard fruit counting. Several object detection methods have been developed based on localization and classification of fruits [6], [7]. Still, lack of accuracy when instance size is small and image capturing from closer are not practically feasible. Counting through object estimation based on density maps is also quite an effective technique [8]. Explicit object counters, which use unmanned aerial vehicles (UAVs) as input sources, are also developed [9], [60].
Regression models give state-of-the-art results in object counting because of direct optimization of respective loss function for count prediction [14], [38]. In contrast, optimization of detection-based models is required to perform complex tasks, such as shape, size, and spatial location of the object instances. Perfect detection brings perfect count; however, bounding box and pixel-level annotated training data is expensive to acquire as compared to point level annotations which give approximate localization of object, and exact shape is not required for counting [39], [40].
The main contributions of this research are We estimate the orchard's yield by counting fruit instances. The primary focus is not only to build a highly accurate deep learning mechanism but also to develop a generalized approach so that model can be trained without prior knowledge about the type of the fruit.
To count the fruit instances without segmenting the individual instances, we formulate the estimation problem as a non-linear regress problem which is helpful for many reasons. First, regression on small patches is more efficient than segmenting out the individual fruit instances. Second, from the supervised learning point of view, annotating individual instances is a challenging task compared to annotating the segmented regions containing a small number of fruit instances. Finally, generalization of the model is very important to learn directly from annotated data without explicit information about the fruit type.
Finally, deep learning inspired models are implemented in generalization of the solution across different datasets, light condition and variable sizes.
Driven by the inabilities of proposed techniques, we have devised a completely data-driven counting method based on loose semantic segmentation and direct regression form images. With the emergence of CNNs, segmentation has become crucial for image analysis tasks in various fields including precision agriculture [10]. Semantic segmentation is clustering phenomenon to group the objects of the same category together [11]. Fruits are entangled in clusters; therefore, individual fruit instances are not always obvious, amplifying the indulgence of the segmentation-based counting method for tiny fruit instances. We have developed a convolutional-deconvolutional segmentation network for the binary segmentation of the image to separate the fruit instances from other regions. As deep learning models are data-hungry, data augmentation techniques are applied equally to respective point-level masked images. Segmented image extracts tiny patches containing fruit clusters from the original RGB image through connected component analysis.
The extracted patches go to the counting module to get the count on individual patches, which is summed up to get a total count for a single image. Our experiment illustrates that the proposed loose segmentation-based counting model obtains better and more efficient counting output than detection-based regression methods.
In the next section, we have described the proposed approach for counting. Then, relating to the problem in previous systems, we have explained our estimation mechanism for fruit counting in the 3rd section. Then, in the 4th section, we have elaborated on the dataset, training setup, and evolution of output. Lastly, the 5th section summarizes our work and feeds the future course.

II. GUIDELINES FOR MANUSCRIPT PREPARATION
Object detection has been in the spotlight over the past years, and fabulous work has done [14], [15]. This problem is tried to be solved through unsupervised way through clustering of objects based on motion similarities [12] or structural similarity [13], but such unsupervised approaches have accuracy limitations, and to achieve higher accuracy supervised approaches are considered. Broadly, counting solutions are of three categories [16]: (1) clustering-based counting, (2) regressing based counting, and (3) object detection-based counting.
Clustering, unsupervised learning, based approaches are the initial work on counting problems. Objects are clustered based on similar features, such as texture, appearance, color and motion [17], and objective is to maximize the likelihood, which grouped the individual object instances on low-level features. For example, a motion analysis-based mechanism is proposed for the moving objects where a parallel KLT tracker is used to observe the motion and appearance of feature points, and clustered are made based on observed features [18]. Unsupervised approaches use lower level features and perform inaccurately on counting when we see in contrast with state-of-the-art Deep learning approaches.
Regression-based counting approaches are very accurate and efficient because the counting mechanism is learnt explicitly rather than optimizing object localization. They learn the direct mapping from image features to count labels, and for this learning, a huge amount of annotated data is required. [14] proposed a method, called glance, which explicitly learns the counting by mapping labelled counts on the image. Regression-based approaches are inefficient and give low accuracy when object instances are in large numbers [15].
Counting through object detection, draw the bounding boxes on detected objects, and just count the bounding boxes. Ground-truth labels are given in bounding boxes around objects for training [19][20][21][22]. Perfect detection leads to perfect counting, but Chattopadhyay et al [15] manifested that detection method can perform poorly because the model needs to learn the object shape, size and localize it regardless of occluded real work conditions. Therefore, methods of detection based on pixel-level ground truth are also proposed [2], [23].
Song et al. [24] suggested a counting method with two models: (1) bag-of-words model to discover the fruit instances in an image, and (2) aggregate model to sum up the count using a statistical approach on a bunch of given images. Maldonado et al [25] presented a method for green orange fruits counting based on correlation between visible fruits and whole fruits on trees. Feature extraction is performed by combining the techniques such as Gaussian blur thresholding, histogram, color conversion, spatial filtering and Sobel operator. Input image is converted into a bas-relief representation on which filtering is applied and forward to SVM which decides whether the object is fruit and counts the positive decisions. However, large adjustable parameters and manual feature extraction are prolonged and not robust in occluded conditions. Linker [35] suggested an estimation procedure based on light distribution. Dorj et al. [36] used color features to recognize fruit instances, conversion of RGB image to HSV, and different preprocessing techniques used for counting.
Rahnemoonfar et al. [7] proposed an inception-ResNet based estimation approach that maps the labelled count on images and reduces detection and localization cost. Training is performed on synthetic data and tested on read tomato images. Chen et al [27] suggested a deep learning approach that directly maps total count to input images. Candidate regions are extracted through a convolutional network-based blob detector, another convolutional network is employed to estimate the count in each extracted region, and a regression model map estimated count to a final count. Qureshi et al [28] proposed two methods: (1) texture base segmentation based on K-Nearest neighbor classification and segmentation, and (2) segmentation-based method which uses a support vector machine for classification. Bargoti et al. [29] presented a segmentation-based approach that consists of a multilayer perceptron and convolutional neural network. Segmentation is generated using watershed segmentation and individual fruits are counted through conducted Hough transform.
Liu et al [30] presented a segmentation and 3D localization model for counting. Fully convolutional network is used for segmentation and localization using an incremental structure motion algorithm. Ponce at al [31] proposed the counting method based on mathematical morphology which segment the olives to extract feature representation. Häni et al. [32] proposed a semantic segmentation model based on U-Net architecture and CNN for classification. Bellocchio et al [33] presented a weaklysupervised framework for explicit counting without supervised labels, only label whether instances belonging to the fruit class is required. Proposed an objective function to keep track of the predictions at different spatial locations of image. Roy et al [37] presented a counting approach where a semi-supervised clustering based on coloring is performed for fruit identification, and spatial characteristics based on unsupervised clustering. Xiong et al [65] used YOLOv2 for fruit detection and linear regression for fruit counting.
Tu et al [34] presented a counting framework based on detection through multiple-scale faster-RCNN which detects the lower features effectively by incorporating feature maps for regions of interest. First, high and Lower-level features are extracted through a multiple scale detector, then RGB and depth detectors are trained which are finally combined through late fusion methods.

III. Proposed Approach
This section illustrates the proposed loose binary semantic segmentation-based yield estimation approach where binary segmentation extracts small patches from an image containing a fruit cluster. The high-level design of our approach pursues a traditional computer vision workflow where the counting module follows the segmentation module. Two-step computational process for yield estimation is outlined in Fig. 1. The proposed segmentation module generates the loosely segmented fruit cluster regions from RGB images on the first step. Then, responding fruit cluster regions are extracted from the input RGB image based on segmented regions. Each extracted region is forwarded as input to the counting module to obtain the individual fruit count. At the end, individual counts are summed up to get the overall prediction count against a given input image, both segmentation and counting modules are built on deep learning architectures. For each module, task-oriented convolutional architectures are introduced, which are trained without prior knowledge about the fruit type to build a generalized yield estimation approach that can be trained only from the data. Although both modules are trained separately, they are not independent entirely since binary masks produced by the segmentation module will be used to extract the sub patches containing fruit instances from original images. These extracted sub-images are used for the training of counting modules. In two subsections below, both modules are described along with rationale behind design.

A. Segmentation
Regression counting on the whole image at once is computationally expensive when hundreds of trees are in an orchard. It requires many labelled samples that are pretty tedious to get and become extremely time-consuming when there are hundreds of fruit instances in a single image. As earlier established in [14], [15], regression-based counting achieves great results when the number of instances in images are small; however, accuracy gets compromised as the number of instances per image increases. Moreover, fruits grow in clusters, and processing the whole image is costly. So instead of processing the image as whole, counting over the non-overlapping patches containing clusters of fruit, is required. Therefore, disjoint patches of segmented fruit clusters are generated to provide the thousands of small patches for the training of the counting module.
From the design point of view, output of the segmentation module is kept loose because instead of segmenting the individual fruit instances we want to segment the clusters so that corresponding patches can be extracted. Moreover, due to a large number of fruit instances in the image, annotation of the exact ground truth is highly tedious, and becomes even more expensive as the Deep learning paradigm requires thousands of such annotated images. Since the background is almost uniform, learning for regression with loose and exact segmentation also becomes similar, and eventually, chances of involving the distinctive features from background are very low. Zhou et al [41] testifies the claim by visualizing the network that reveals saliency in the foreground. Fruit instances are very small compared to the whole image and also partially occluded; therefore, soft segmentation of fruit clusters is a suitable thing to do.
Due to the fewer number of training parameters, we have used the SegNet architecture [42], [43] for generating the loose segmentation of fruit clusters instead of deconvolutional networks with fully connected (FC) layers [63] having many more training parameters. As for cardinality of categories and nature of the domain is concerned, dealing with the loose segmentation is less complex as compared to multi-class semantic segmentation because variations in pixel intensities are restricted in single image. Additionally, main purpose is not to obtain an overall highly accurate segmentation mask, rather aim is not to miss any fruit cluster region in the image for the training of the counting module eventually. Fig. 2 illustrates the used segmentation network. The front-side convolutional substructure of the segmentation network is based on VGG architecture [26] where five 2×2 max-pooling operations are followed by convolutional and nonlinearity layers, which helps compress the feature map to 32 times before backend deconvolutional operation.

Rectified linear unit [64] and batch normalization [46] is used after every convolutional and deconvolutional layer. In addition, 2×2 Max-pooling, with stride 2, is used, while 2×2 max-unpooling for un-pooling using corresponding pooling indices from the front-end network.
Class imbalance is very high in the fruit counting domain since the ratio of fruit cluster to background incurs a big difference. To address the class imbalance problem, weighted categorical class-entropy is involved as a loss function that allows adjusting the weights depending on the misclassification.
B. Counting From Fig. 1, it can be seen that RGB images used for counting modules are extracted after segmenting out the fruit clusters by the segmentation module. Extraction of multiple patches from the original big image satisfies the need of a huge training dataset so that the model can learn input-output mapping. We have used the deep learning inspired counting approach to get the generalizable and robust counting solution. The combination of convolutional and pooling layers, CNN, is the deep learning approach that replicates the operational mechanism of the human vision system [44]. Input to the CNN is an image that goes through different convolutional and pooling layers and produces the representative feature map as output. Journey of the feature map between input to output layer goes from many hidden layers which consist of a stack of convolutional and pooling layers. Training of CNN goes through two stages: (1) feedforward and (2) backward propagation. During feedforward stage, loss is calculated based on the predicted output from the produced feature maps and labelled outputs. In backpropagation, gradient of loss is calculated with respect to each weight parameter, and parameters are updated for next feedforward calculations based on gradient. Two staged processes go through many iterations and terminate when loss stops to decrease further.
Typically, CNN learns a feature map with two spatial and one channel dimensions simultaneously, increasing parameters. On the other hand, inception models ease this process and learn feature representation with fewer parameters because they work on spatial and cross-channel correlations. Although different inception models had been introduced with slight variation [45], [47], but, Inception-ResNet [48] outperformed the ImageNet dataset [49]. Influenced by this performance, we used the modified Inception-ResNet-A with proposed CNN network. Usually, fruits are extremely crowded and vary in size due to natural variation in size and image capturing position incurs the size variability; therefore, high-level semantic feature plays a crucial role compared to receptive fields. Reason to this, indulgence of modified Inception-ResNet-A enlarges the receptive field [50]. The proposed network architecture is shown in Fig. 3. First layer of the network is 5 X 5 convolutional followed by 2 X 2 with stride 2 max-pooling produces 64 feature maps. To reduce the dimensions of the first layer feature map, 1 x 1 convolutional is applied. Next, two 3 x 3 convolutional followed by 2 x 2, stride 2, max-pooling layers are applied, producing 96 and 126 feature maps. Then, the modified inception layers come which take feature maps of multiple size through concatenating residual units [51], and the result of different filter sizes. Convergence of residual network is faster due to residual connections which skips connection to make a path for gradient flow. Fig. 4 illustrates the architecture of the modified inception-ResNet-A model. Last layer, having 1 x 1 convolutional, calculates 126 feature maps instead of 256 as in original Inception-ResNet [48]. The Inception layer consists of three concatenated layers, and the result is added to activation of the previous layer which passes from a rectified linear unit. After the inception layer, 3 x 3 convolutional is again applied, followed by 2 x 2, stride 2, max-pooling, which increases the accuracy when used before a fully connected layer [52]. Size of the fully connected layer is 626. Deep learning models are prone to overfitting, which can be mitigated through dropout technique [53] where we have randomly dropped the 40% connections. Instead of regression output, we have applied SoftMax with 11 outputs because the number of fruit instances in extracted regions are less than 12 which makes SoftMax suitable.

IV. Experiment
This section demonstrates the effectiveness of proposed methodology. First, we explain the used datasets for both modules. Next, training methodology for both modules along with training setup and implementation details. Lastly, evaluation and comparison with other proposed approaches are given.

A. Dataset
The dataset consists of 6 different fruits including Apple, Almond, Orange, Peach, Tomato, and Pomegranate. Sample image of each fruit type is shown in Fig 5. 4820 original images are augmented to 32,412 images, and used 80% for the training of segmentation module, while remaining are used for validation. Although images for segmentation are gathered from different datasets including Google Images, and have different sizes, they are resized to 900 x 650 pixels. Fifty images of each fruit type are used to test the system's accuracy. Annotation of dataset especially for segmentation is very tedious to obtain for huge dataset, with this reason, different augmentation techniques are used to enlarge the dataset. Data augmentation is essential to teach the network the desired invariance and robust properties, when only few training samples are available. We have applied the transformation with generalization ability [54]. Commonly used transformations, such as leftright flipping, elastic deformations [55], and rotation, are applied. These transformations are also applied with the same parameters on corresponding mask images. Breakdown of the images after augmentation against each fruit type that is used for segmentation training is given in Tab. 1. For the training of the counting module, 21,450 sub patches are extracted from the original images, and each patch contains 0 to 11 fruit instances. Maximum value of the assigned label was 11. We also used augmentation techniques, such as lift-right flipping, color changes, and rotation, to enlarge, but preserve the assigned label simultaneously. After augmentation, we have 89,120 subimages divided into training, validation and test sets. 88, 820 sub-images are used for training, while 17,764 are used for validation which becomes almost 20% of the training set. Finally, 300 sub-images, having 50 images of each fruit type are used to test the counting accuracy.

B. Training Setup
For segmentation, the network was trained for 2,000 epochs, with batch size of 16, over the augmented dataset. To minimize the error, SGD-momentum was used with learning rate 0.02, momentum 0.8, and weight decay 0.0002. Xavier initializer was used to initialize the parameters [56]. For the training of the counting module, Adam optimizer was involved in having learning rate and weight decay equal to 0.0001. Then, network was trained for 150,000 epochs with batch size 32. Both networks were implemented using Keras on a machine having 16 GB RAM, and Nvidia 1080Ti GPU.

C. Analysis & Comparisons
The proposed approach has been evaluated qualitatively, and counting results are also compared with results reported by different articles which worked on fruit counting specific to the focused fruit types. Loss and accuracy graphs of both segmentation and counting modules are also given.

C.1 Segmentation Module
Here, evaluation of segmentation against three metrics are given. Performance of proposed segmentation module is assessed against generating loose binary segmentation, and precision, recall, and accuracy are calculated. Values for precision (~87) and recall (~84) are seemed low because the ground-truth masks are loosely annotated; however, loosely marked contours involve almost all the fruit patches in the image. We have visually examined the test segmentation result and find almost no fruit containing region undetected by the segmentation network. Nevertheless, higher segmentation accuracy is achieved. The precision, recall and accuracy score are shown in Tab. 2 where TP is the number of true positives (correct segmentation), TN is the true negative, FP is the number of false positives (false segmentation), and FN is the number of false negatives (miss). The segmentation module is trained for 2,000 epochs and the final training and validation accuracies are 95.5% and 87.8% respectively. From the gap of training and validation accuracy, it can be concluded that the model is slightly overfitting the training data which is a curse associated with deep learning models. The accuracy graph in Fig. 6 shows the training and validation accuracies corresponding to epochs. Segmentation loss for training is started decreasing from 9.2 and lowered to 0.11, while validation loss is reduced from 9.35 to 0.25 after 2,000 epochs. The graph in Fig 7 shows the loss journey throughout training.

C.2 Counting Module
Counting module is training for 150,000 epochs with approximately 89,000 images divided into training and validation sets with 80% and 20% ratio respectively. Training and validation losses (Fig. 8) started reducing from 8.6 approximately, but training loss went to 0.07 and validation loss ended up at 0.12 at the final epoch. Validation loss went lower to training loss at some epoch but remained high most of the training time. From Fig. 9, it could be seen that counting modules also faced overfitting as there is a difference between the training and validation accuracy where validation accuracy remained lower than training accuracy. At the end of the last epoch, training and validation accuracy ended at 97.2% and 93.9%, respectively. Then, the counting module is evaluated by comparing the predicted fruit count with ground-truth count. During training, 97.57% accuracy is achieved but achieved lower to 92.5% on average against all fruits during testing. Apple achieved the highest 96.2% accuracy, while Almond 89.5 slowest among all the fruit types. Below in Tab. 3, a breakdown of test accuracies is given against each fruit type.

V. CONCLUSION
Best to our knowledge, this was the first attempt to involve multiple fruit types to estimate fruit yield simultaneously. Almost all the known fruits have some common characteristics, such as circular shape, skin texture, and background which makes it a suitable fit to count the lack of big dataset for single fruit type. Through shared features, we made a single pipeline for fruit counting. Moreover, relax segmentation mitigates the unnecessary process of the image regions where fruit instances are not present. It's very difficult to obtain the exact mask of the image, so loose segmentation allows to extract the cluster regions for further processing to count the instances. Use of SegNet makes the segmentation generation faster due to having a smaller number of parameters. As established in literature, regression method shows state-of-the-art result, and the involvement of inception-ResNet-A incurs not only higher accuracy but also lower the computation cost.
In the future, we plan to involve more fruit types and build a counting mechanism for video which will eventually converted into mobile application.