Intelligent Underwater Stereo Camera Design for Fish Metric Estimation Using Reliable Object Matching

The resolution of the computed depth maps of fish in an underwater environment limits the 3D fish metric estimation. This paper addresses this problem using object-based matching for underwater fish tracking and depth computing using convolutional neural networks (CNNs). First, for each frame in a stereo video, a joint object classification and semantic segmentation CNN is used to segment fish objects from the background. Next, the fish objects in these images are cropped and matched for the subpixel disparity computation using the video interpolation CNN. The calculated disparities and depth values estimate the fish metrics, including the length, height, and weight. Next, we tracked the fish across frames of the input stereo video to compute the metrics of the fish frame by frame. Finally, the fish median metrics are calculated for noise reduction caused by the fish motions. Hereafter, the fish with incorrect measurement by the stereo camera is cleaned before generating the final fish metric distributions, which are relevant inputs for learning decision models to manage a fish farm. We also constructed underwater stereo video datasets with actual fish metrics measured by humans to verify the effectiveness of our approach. Experimental results show a 5% error rate in our fish length estimation.


I. INTRODUCTION
Deep learning methods for stereo image processing are gaining popularity. Many studies have been involved in stereo imaging [1][2] [3], but only a few applications have been applied to modeling underwater scenes. Underwater images have more significant challenges, as images that were taken generally have problems with image degradation, poor contrast, blurring, color deviation [4], poor visibility, light attenuation, and water turbidity [5]. For stereo images, depth information and disparity map information are essential for quality 3D modeling [6][7] [8]. To integrate an underwater stereo camera for 3D modeling in the underwater environment, we used it to capture and match each fish using the left and right images based on the proposed unsupervised underwater stereo matching neural network. The dense disparities obtained between the left and right fishes are computed to obtain the depth map of the 3D model of each fish. Through this 3D model, the fish's body length, height, and width can be estimated more accurately. Estimated values combined with various sensors and weight regression formulas can establish the growth curve of the fish. Monitoring the size of the fish can help breeding experts record the breeding process and reduce labor costs. Figure 1 shows the diagram of our proposed stereo matching for underwater object reconstruction using the left and right images as the inputs. The stereo image rectification is a preprocessing technique to obtain the correct intrinsic and extrinsic parameters by calibrating the stereo camera system. Then, each corrected image is inputted to the instance segmentation neural network to transform each image frame into a set of fish objects and background objects. For object matching, the correspondence in the left image is searched in the right image to generate the disparity map. But a single disparity value cannot accurately restore the pixel depths of the 3D object.
Another potential problem is that the object's pose in the left image may be slightly different from that of the right image because the left and right cameras of the low-cost stereo camera system might not be accurately synchronized. Also, the target fish object has multiple gestures since it freely swims in the underwater environment, which brings additional noise in measuring the exact 3D information of the fish. Lastly, the fish might overlap with other fish in the captured images, which degrades the mask accuracy of the fish objects even with well-designed instance segmentation CNNs.
In solving these mentioned difficulties, the left image object and the right image object are cropped and aligned to form the input pair, which is further processed. Our video interpolation CNN (VICNN) [9] based stereo matching algorithm calculates the residual disparity of each pixel in the left object. The core of our stereo matching algorithm is VICNN, which synthesizes the intermediate object to establish the pixel correspondences between the left and right objects. Instead of using a single frame image, based on the proposed object matching scheme, we tracked each fish across frames and calculated a sequence of 3D models to reduce the biometric noise introduced by the gesture variations of a freely swimming fish. This mechanism is nonintrusive and reduces manual handling of the fish to prevent stress [10] and disturbance and avoids injury caused by fish catching in estimating the biological information. The contributions of the proposed approach are as follows. First, the proposed deep neural network establishes real-time pixel correspondences between stereo images. Second, traditional deep learning models require a large set of human-annotated label data for training. Our approach is directly trained from the raw video data; thus, the deep neural networks training complexity for 3D model reconstruction is significantly reduced. Next, it is difficult to establish precise pixel correspondences from the texture-less stereo images using traditional stereo matching algorithms. The interpolated signals of the matched object pairs ensure the correctness of the computed disparity image from precise correspondences with minimal object matching error. Finally, the object-based stereo matching optimization algorithm contributes to the design of the image warping of the disparity to minimize the resulting smoothness. The remainder of this paper is as follows. Section 2 provides the details of our methods, and Section 3 is on the experimental results. Finally, Section 4 is our conclusions and future works.

A. RELIABLE OBJECT MATCHING WITH SEMANTIC SEGMENTATION CNN
As shown in Figure 1, our approach integrates camera calibration for preprocessing as the first stage to obtain the stereo system's correct intrinsic and extrinsic parameters. Then, the internal and external parameters are inputted into the image rectification process to obtain the rectified images. This rectification process projects images using a common horizontal image plane as it twists back the left and right image pixels to have the same coordinates in the horizontal plane. Image rectification yields all epipolar lines to be parallel in the image plane. For illustration, we assumed that the input stereo images were rectified before being applied to the proposed system for further processing. Each rectified left, and right image pair is first inputted to a semantic segmentation neural network [11] with excellent object detection and segmentation performance. The semantic segmentation CNN transforms each frame image into a set of reliable fish objects and the background object. Let ( ) and +1 ( +1 ) be the tracked object in frames and + 1 of the left (right) video, respectively. The object pair ( , ) is a stereo object if the matched cost between and +1 is small, and their motion vectors are similar. Given the object , the basic processing of object matching is to search the corresponding object in the right image along the x-axis since the images have been rectified to have a horizontal epipolar geometry. Figure 2 depicts the basic concept of image capturing using a stereo camera system. In Figure 2, the foreground object, i.e., the fish, is overlapped with different backgrounds in the rectified left and right images, which decreases the accuracy of object matching. Thus, we perform the semantic segmentation CNN to obtain the masks of and , in which the backgrounds are removed.
The object matching scheme first computes the motion vector between ( ) and +1 ( +1 ) based on the computation of the matching cost with the object pair ( , +1 ) (( , +1 )) as the input. The matching cost between objects and +1 is measured by aggregating pixel-wise matching costs in both objects with the support weights [12]. Let x (x +1 ) be the center of the object ( +1 ). The support weight between pixels x 1 and x 2 is defined as where ∆ x ,x +1 and ∆ ,x +1 represents the color difference and the spatial distance between pixels x and x +1 , respectively; is the variance of color difference; is determined according to the size variance of all the objects. The value of (x , x +1 ) measures the strength of the pixel correspondence (x , x +1 ). Notice that the motion vector of the pixel x can be computed as u x = x +1 − x . To assume every pixel in would have a similar motion vector, we can compute the matching cost of the object pair ( , +1 ) by combining the pixel-wise support-weights in both objects: where the pixel pair (p , p +1 ) is constrained to have the motion vector u x . Using (2), for each object in frame , we can define the matched object +1 * with the center at x +1 * in frame + 1 as where +1(x ) is the set of all possible objects within the search window with x as the center in frame + 1.
Once the motion vector u (u ) of the object ( ) is determined, the matching cost defined in (2) for stereo object searching can be refined as where x and x be the center pixels of and , respectively; ‖u − u ‖ 2 is the L 2 distance between motion vectors u and u ; > 0 is the Lagrange multiplier. Using (4), for each object in the left image, we can define the best-matched object * with the center at ( , * , ) in the right image as * = min where is the set of all possible objects in the right image for matching the object . Notice the disparity of object with the center pixel x can be computed as = − , * . Obviously, the pixel-wise motion vectors (optical flow) in an object are similar but not the same because individual parts of the object might perform different actions. Similarly, the pixel-wise disparities of an object are not the same as that of center pixel since the depth information of a real-world 3D object is not the same everywhere. As mentioned above, given a detected object pair ( , ), Equation (5) can compute the basic disparity for each pixel in . Obviously, for each pixel x in , this disparity difference ∆ x should be estimated to model the disparity of x as

B. PIXEL-WISE RESIDUAL DISPARITY ESTIMATION WITH VIDEO INTERPOLATION CNN
where is the disparity value of the center pixel x defined by Eq. (5). Suppose we translate the centers of matched objects into the common original point (0,0). In that case, the left object and the right object are aligned with each other and form a new object pair ( ′, , ′, ) which can be used to estimate the pixel disparity difference ∆ x of the pixel x ∈ . Figure 3 shows the proposed stereo matching algorithm based on the video interpolated CNN (VICNN) [9], synthesizing the middle object +1 2 ⁄ ′ with the object pair ( ′, , ′, ) as the input. For each pixel x in +1 2 ⁄ ′ , the VICNN computes a pixel-wise kernel pair ( (x), (x)) to interpolate the pixel value of x = ( , ) in , ′ using the following equation: where 〈⋅,⋅〉 is the inner product operator; (x) ∈ ′, and (x) ∈ ′, are patches with the common center . The kernel pair can also be used to compute the disparity difference of x: Although CNN-based video interpolation can generate accurate interpolated images for both uniform regions and edges, it cannot ensure the correctness of displacement vectors for pixels in uniform regions. Therefore, instead of proposing a new architecture for VICNN, we modified the loss function used by VICNN by adding additional metrics to improve or optimize and further re-train VICNN using our own set of underwater training videos to precisely generate the displacement image → for the underwater image pair ( ′, , ′, ).
We revised the training procedure of the original VICNN by integrating the total variation of the detected displacement vectors to ensure that the estimated displacement vectors will be very smooth. The authors of VICNN [9] used two input receptive patches ,1 ,2 at the center of ( , ) with the corresponding input patches ,2 i,2 which are smaller than the receptive field patches where both are also centered with the same location. ̃ is the ground truth color and is the ground truth-gradient at ( , ). Initially, the loss function measures the difference between the interpolated pixel color and its corresponding ground truth defined as: where subscript is the ℎ training example and is the output of the neural network's convolutional kernel. Using only the color loss and even with the integration of 1 norm, which is the sum of the absolute values of the distances in the original space to preserve the edges of the image [13], still leads to a blurry result. The integration of gradients in the loss function corrects the shortcoming of the color loss. The gradients of the input patches are first computed followed by convolution using the estimated kernel, which generates the gradient of the interpolated image at the pixel interest. Based on the eight immediate neighboring pixels, eight versions of gradients were computed using finite difference and all added into the gradient loss function defined as: where belongs to the eight ways to compute the gradient. Meanwhile, ,1 and ,2 are the gradients based on input patches ,2 ,2 and the input ground-truth gradient is in ̃. The final loss of VICNN combines the color and gradient loss as where ( = 1) determines the smoothness of the output. To verify the quality of the displacement vector estimation of the original loss function ( ) of VICNN, we also add TVL1 [14] factor as an additional term of (12), a popular approach to remove impulse noise and preserve image edges [13] to optimize the optical flow detection. We integrated the sum of the gradient values of the displacement vectors, i.e., |∇D → | + |∇D → | as the total variation function for , to smoothen the results of the detected displacement vectors. The final loss function is denoted as The gradient provides shape information and TVL1 for temporal information and will increase its displacement vector estimation's overall robustness and reliability.
In training the neural network, image ground truth is needed to train the parameters of the neural network. We follow the concept of VICNN using three consecutive video frames of both the left ( , +1 , +2 ) and right ( , +1 , +2 ) incorporating the second frame ( +1 ) as our ground truth and will help determine the final parameters of VICNN to generate the displacement vectors using (8). Figure 4 shows an example to estimate the final disparity image of the target based on the object-based stereo matching algorithm.

C. 3D OBJECT RECONSTRUCTION FOR FISH METRIC ESTIMATION
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Once the disparity value of pixel i in has been computed, the depth value of the pixel and its 3D coordinates can be computed as  For each point p ∈ , we project onto the three eigenvectors v1, v2, v3 (defined by the matrix ) to get the new 3D coordinate points: The converted coordinate points estimate the body length , body height , and body width of the fish based on the fish posture recognition result. As shown in Figure 5, the training fish objects are labeled either 'side view' or 'front view' to train our semantic segmentation CNN to segment fish objects with correct posture labels from the input image in the testing phase. Based on the posture label and the converted coordinate points using (16), the formula to estimate the fish metrics is as follows: In this work, the objects identified as 'side view' class are considered the reliable ones for fish metric estimation, while the 'front view' objects are skipped to avoid having extra noises in the fish metric measurement results.
Another factor that affects the estimation accuracy of underwater fish metrics is the distance between the left and right camera lenses. The larger the value of , the more accurate the fish metrics will be. However, increasing the distance requires a large stereo camera which is difficult to use in offshore cages. Thus, we used a small value of ( =11.4 cm) to set up the camera though it is limited in measuring the metrics of the fish far away from the camera.
With this set of cameras, we perform different experiments to determine which among them are effective ranges. First, we used a fake fish with a body length of 30cm with a distance of 70cm and 90cm from the lens. To test further, we also used shooting distances of 120cm, 150cm, 180cm, and 200cm, as shown in Figure 6. After calibrating the captured images, we used the method proposed in this paper to estimate the body length and use the error between the estimated body length and the actual body length for comparison. Finally, we compared the results of the estimated body length of the fish using our captured fake fish images using different distances to get the best result. Based on Table  1, the highest error rate (9.6%) is at a 200 cm distance and is considered the maximum effective distance range from the camera lens. Fish more than 200cm away from the camera lens are discarded in the body length estimation due to a higher error rate.

III. EXPERIMENTAL SET-UP AND DATASETS
We used two GoPro cameras to attain the stereo camera lens setup, as shown in Figure 3(a). The computer used for training the neural network for object detection, image matching, and 3D reconstruction is an Intel Core i7 -8700 3.2GHz CPU, 32.0GB RAM, and NVIDIA GeForce GTX 1080ti GPU using the Python environment. The Pool 13 of the University Aquatic Center is located at the National Taiwan Ocean University in Keelung City, with 150 porphyry sculptures, and the Pingtung Hengchun Aquaculture site with approximately 40,000 golden pomfret was the experimental environment. Two environments were considered to ensure that our approach works for less dense and highly dense aquaculture tanks or cages. The video collected from these locations was used to train and test the neural network using the stereo camera to capture the left and right images. Figure 7(a) shows the labeled images for training our semantic segmentation CNN and the video interpolation CNN. A total of 2 hours, 25 minutes, and 3 seconds were taken from the A13 Aquatic Center Pond, while a total of 1 hour, 29 minutes, and 41 seconds from the Henchun Ocean open-sea cage were used for the experiment.

IV. EXPERIMENTAL RESULTS
The left and right underwater images were taken by the stereo camera passed through the Mask_RCNN neural network, where the fish and the background are segmented through instance segmentation. There are two types of fish species, where the ponds contain porphyria sea bream and the Hengchun ocean has the golden pomfret. In training the Mask-RCNN neural network to perform instance segmentation, we manually labeled the images and utilized 200 images for the training data and 500 images for the testing data that were taken from the Aquatic Center with an accuracy rate of 90%. The segmentation results are shown in Figure  8(a). Meanwhile, from the data collected from the Hengchun aquaculture site, 500 images were used for the training and 800 images for the testing with an accuracy rate of 85%, and segmentation results are shown in Figure 8  The Mask-RCNN training and testing loss result is shown in Figure 9. As reflected in the loss curves for training and testing, the loss value close to zero is at the 200 th iteration. Segmenting underwater target objects from their background is challenging considering fish's continuous or active movement, varied textures, quality of water, and luminosity. These problems should be considered more robust and accurate fish detection as a requirement for fish length and density estimation.

FIGURE 9. Mask-RCCNN (a) training loss and (b) test loss for the A13 pool at the Aquatic Center and (a) training loss and (b) test loss for Hengchun offshore fish cage.
The detection accuracy results for the two data collection sites for Mask-RCNN are shown in Table 2. Accuracy has different results for the two environments since the quality of the collected video data varies in terms of water quality and turbidity and affects the identification of the target objects. Video collected from the pool environment with fewer fish populations has a higher accuracy rate of 95% compared to the open fish cage with a dense fish population of 90%. After rectification, the proposed object matching scheme tracks objects across video frames and searches the matched right object for each left object, whose initial disparity value is then determined by the displacement vector between the centers of the matched objects. The results of the pairing of the 3D object using the left and right images are shown in Figure 10. The matching accuracy for the images collected from the A13 pool of the Aquatic Center is 90%, while the Hengchun open-sea cage of the ocean is 80%. The lower result from Hengchun is due to the highly dense fish population, making it hard to perform matching due to high fish object overlaps. The initial disparity from the detection result using the 3D object is obtained by subtracting the object's center point. But in this method, all pixels of the target object is the same with the disparity result. This approach is integrated into the 2D measurement method and will cause a considerable loss of information on the original three-dimensional object. To deal with this, we integrate VICNN [9] to perform disparity finetuning to bring back the information from the original object in a three-dimensional format. We compared the initial results from the fine-tuned disparity in Figures 11 and 12, which show slight changes from the original objects after the fine-tuning disparity integration.   Figure 13 shows the results for comparing the disparity of the stereo image matching using our approach with semiglobal block matching (SGBM) [15] that integrates pixel-wise matching based on mutual information and the approximation of the global smoothness constraint. SGBM detects occlusions and disparities using subpixel accuracy. The results using our video interpolated optical flow, even with poor image quality due to water turbidity, have better disparity results with the results generated by SGBM. This is because the structure or appearance of the fish using our method is more visible and apparent. To calculate the disparity of pixel in the object , the depth value of pixel and its 3D coordinates , , are Based on our fake fish experiment results in Table 2, the most effective stereo camera lens distance is 200cm, and fish beyond that distance will obtain a significant error. Though the stereo camera system is an affordable device to get the image depth and can be used to estimate object size, it is only limited to a certain distance; such intervention was made to address its limitations and make sure to have the best optimization results. We also considered the continued movement of the fish, where their body and tails swing while swimming, which makes their estimated body length different for each frame. We used a tracking method to estimate the fish body length in each frame and calculate the average accumulated body lengths of different frames to deal with this. The obtained average value will be used as the final estimated body length of the fish.
To verify the error value with the actual body length of the fish, we track a single fish and measure its body length in full frames. We first manually measured the exact body length of the single fish, placed it separately in the water tank, and took a video using a stereo camera. We tracked this single fish by capturing the left and the right images and then estimated the body length in full frames. Since the estimated body length of the fish is not consistent for each frame due to its continuous movement, we considered the final body length by getting the average of the body length for each frame.
The body length estimation of the single fish using 20 frames is in Figure 14, where the average body length is 20.895cm, with an average error of 2.38%. The maximum error is in Frame 2 at 5.52% error. The average body length is computed as 1 ∑ =1 while the average error is calculated where is the estimated body length and is the error value for each frame . To calculate the length error value ( ) for each frame , we use = ℎ -ℎ ℎ * 100.  Figure 15 is an illustrative diagram of our proposed effective range filtering, executed after the 3D estimation. Figure 15(a) is the result of the instance segmentation from the collected video from Cage-1; (b) is the disparity image result, and the blue dots in (c) represent 12 segmented fish images from (a). The depth value of the fish is the 3D point cloud depth value where the fish is closest to the lens. Therefore, the effective range is filtered based on the depth value, and only the fish within the range of 50 -200 cm will be estimated in terms of body length.  Table 3 shows the manual measurement of the body length of the fish and the estimation results, and its corresponding estimation error using 2D, 3D, and estimation using the most effective camera lens distance (50cm to 200cm), where the fish outside the effective range were discarded due to high error rate based on the earlier experiment that was carried out. For the 2D measurement, we used the flat left and right images and combined them to generate a single image as the source image. The fish is first segmented, and the straight line of the segmented fish body part is used as the fish length. The fish length is calculated from its nose to the tail fork with the longest axis using the pixel-wise measurement of the segmented fish image. The 2D measurement has some problems since the body of the free-swimming fish is not straight, which makes the length measurement inaccurate. For the 3D measurement, we used the depth and 3D coordinate positions since there are instances where the fish body is in a curve form or different posture. We measured the length of the fish by the distance of the fish from the camera. The results show that the error of the estimated body length after the 3D reconstruction is significantly reduced compared to the results from the 2D estimation. Integrating the effective camera distance for the Porphyry seabream fish located in a smaller pond area seems not significant in the error reduction since it has almost the same results as the 3D estimation, unlike the results for the golden pomfret with a significant error reduction. The effectiveness of camera distance is more relevant to large fish cages, as manifested in the result for golden pomfret located in an open cage with more transparent water and a larger fish cage size. When the water is clear, the fish can still be seen even far from the stereo camera lens; thus, discarding the fish outside the effective camera distance will significantly reduce error for large cages

V. CONCLUSIONS AND FUTURE WORKS
First, the deep neural network establishes the pixel correspondences between stereo images in real-time. Second, traditional deep learning models require large humanannotated labeled data for training. Our proposed approach is directly trained from the raw video data. With this, the training complexity of deep neural networks for 3D model construction is reduced. Third, establishing precise pixel correspondences from textureless stereo images using traditional stereo matching algorithms is very challenging. The interpolated signals of the matched object pairs ensure the correctness of the computed residual disparity image from precise correspondences with minimal object matching error. Lastly, the object-based stereo matching optimization algorithm is involved in the disparity image warping to minimize the resulting smoothness. The stereo camera can capture fish information from the underwater environment for fish body length estimation and fish weight conversion. This information can help establish the fish growth curve, and with the data on the feed amount, the meat exchange rate can be derived. Thus, farmers can optimize fish feeding, reduce production costs, and increase farm profit. The use of noncontact methods eases labor costs and prevents fish death and illnesses. This method also provides aquaculture farmers with the current condition of their fish products at any time. For the next step of our research, we plan to combine a stereo camera with a sonar system to generate a more accurate and precise method for measuring fish body length and weight. Fish species identification using the sonar enables the relative sonar to provide a depth reference value for 3D images, so combining them will improve the estimation accuracy.