Girth Measurement Based on Multi-View Stereo Images for Garment Design

In this paper, we propose a novel girth measurement system based on multi-view stereo images for garment design. Our system is set in a fixed location to capture three pairs of stereo images for the subject by six calibrated and synchronously triggered cameras. An important feature of this system is the use of an optimized semantic segmentation network that can efficiently segment the girth region in the captured six-view stereo images. Another important feature of this system is the use of color subspace classification and coordinate clustering that can effectively constrain the stereo matching within the scope of markers. Then, the system performs only on the corresponding clusters to extract stereo matching point pairs of markers correctly. The space coordinates of 3D point corresponding to each stereo matching point pair are calculated in each coordinate system of stereo cameras. The unified coordinates of these 3D markers are transformed from three different coordinate systems into one unified coordinate system. Girth is measured by curve fitting of these markers and calculating the length of the fitting curve. Our proposed system performs passive and intelligent girth measurement in garment design, and overcomes the problem of too many invalid stereo matching point pairs in girth measurement. Experimental results demonstrate its accuracy. Our system has a maximum bust measurement error of 1.28cm for woman and 1.31cm for man and a maximum waist measurement error of 1.18cm for woman and 0.99cm for man, which are within the error limit regulated by China national standards GB/T 2664-2017, 2665-2017, 2666–2017 and textile industry standard FZ/T 81004-2012.


I. INTRODUCTION
With the rapid progress of information technology, garment design is developing towards automation and digitalization [1]- [3]. Nowadays, simple ready-to-wear design can no longer satisfy people's pursuit of individuality. Garment design is bound to be customized [4]- [8]. Clothing fit is the primary factor affecting the quality of customization. Therefore, precise anthropometric measurements of different individuals must be made first. Traditional anthropometry is to measure the human body size manually with a tape, which has strong controllability and flexibility [9]. However, the manual measurement process is complex and time-consuming, and its measurement error depends on the experience of The associate editor coordinating the review of this manuscript and approving it for publication was Zhan-Li Sun . different tailors. Modern anthropometry is to measure the human body size automatically with a device. It can be divided into two main categories according to whether the device emits light source: active and passive.
The most common active devices are 3D laser scanner [10]- [12] and 3D structured light scanner [13]- [15]. The 3D laser scanner emits lasers to the subject, receives lasers bounced off the subject's surface, and perform precise anthropometry according to the positions, time intervals and optical axis angles of the received lasers [11]. The 3D structured light scanner emits structured light to the subject, records the interference fringes formed on the subject's surface, and performs accurate anthropometry according to the recorded distortion [14]. However, these devices are expensive and bulky high-end devices [16]. What's more, laser and structured light scanning often make the subject feel uncomfortable worrying about eye damage during the data acquisition process [17]. The recent developed active devices are Kinect [18], [19] and RGB-D cameras [20], [21], focusing on low-cost and portability. These devices emit infrared light to the subject, obtain the depth map, and generate a 3D model of the subject [20]. However, the resolution of the depth map is low, the image registration is difficult due to the lack of geometric features, and the anthropometry is rough [22]. For both high-end and low-cost active devices, the subject is required to stand still without any movement for at least a few seconds to complete the scan. Otherwise, the scanned image will be deformed, thus affecting the precision of anthropometry.
The passive device can effectively solve this problem [22]- [29]. It uses common cameras to capture images of the subject in one shot. With technological advancement in vision system, cameras become cheaper and the captured images are of higher resolution, with rich geometry and texture details. These images are used for body reconstruction and anthropometry. References [23] and [24] are single-view anthropometric systems. They are scale-based. They estimate the anthropometric length based on the observed known value of the reference which may be a fixed line segment or a grid. These single-view systems are simple and cost effective. However, they can only measure the length, but not the girth. Moreover, the measurement points need to be selected manually, lack of automation. References [25] and [26] are dual-view anthropometric systems. They capture the front and side views of the subject, calculate the width and thickness of the measured girth based on the scale, and obtain the girth data by linear regression. These dual-view systems are also simple and cost effective. They can measure both the length and the girth. But it is difficult to completely recover the complicated curved surface such as bust and hip, and the measurement error of girth is large. References [27], [28] and [22] are multi-view anthropometric systems. References [27] and [28]are both single camera multi-view systems. They capture the multi-view images of the rotating subject with a single camera. Reference [27] reconstructs 3D human body model from the captured multi-view images, and selects the exact cross section of the model to measure the girth. Reference [28] calculates the length of the same girth in different views based on the scale, and estimates the girth data by model fitting. Nevertheless, the multi-view images are not captured synchronously, and the subject's movement may result in decreased accuracy. Reference [22] is a multi-camera multi-view system. It comprises 60 cameras and a carefully designed synchronization trigger, and captures 30 pairs of multi-view stereo images in a single shot. For each paired stereo cameras, camera calibration is performed. For each paired stereo images, stereo matching is performed, resulting in 30 dense and uniformly distributed point clouds. Multi-view registration and surface meshing are carried out to reconstruct the 3D human body model from the 30 point clouds, and thus the anthropometry is completed. This system can achieve high anthropometric accuracy. However, it's expensive and bulky, with complex matching and fusion for 60-view images. Hundreds of thousands of feature points have to be identified and matched, and complex algorithms must be used to remove as many mismatches as possible, which is a difficult and time-consuming task and not suitable for garment design. Furthermore, in order to improve the matching accuracy and reduce the reconstruction error, the image should be preprocessed to remove the background. In this paper, we propose a novel girth measurement system based on multi-view stereo images for garment design. Fig. 1 shows the overall view of our proposed system.The proposed system integrates semantic segmentation with stereo matching for simple, accurate and intelligent girth measurement. Three pairs of stereo images are captured by six synchronously triggered cameras (R1, L1; R2, L2; R3, L3) from the front (0 • ), side (90 • ) and back (180 • ) of the subject. An optimized Pyramid Scene Parsing Network(PSPNet) is used to segment the specific regions of the measured girths from each pair of the captured stereo images. Stereo matching is only carried out within the segmented regions of the specific girths in each stereo image pair with cluster constraints to extract the markers. The actual spatial coordinates of the markers in the space of each stereo camera pair are calculated in accordance with the respective calibrated parameters. The Euclidean spatial coordinates transformation is performed to transform the coordinates of three groups of markers from three different spaces into one space. The spatial coordinates of these markers are finally fitted to achieve girth measurement. Our proposed system is small in volume and simple in structure. It can effectively eliminate measurement errors caused by the subject's movement. It can simplify the stereo matching process and improve the matching accuracy as well. However, since bust and waist are the most significantly affected girths by the respiratory movement of the subject, these two girths are selected for experiment and verification in our system. The rest of the paper is divided into four parts. In Section II, we discuss the related works.In Section III, we present our proposed system, which includes system configuration, semantic segmentation, stereo matching and spatial coordinates calculation, coordinates transformation and girth fitting. In Section IV, we report experimental procedures and results. In Section V, we make a conclusion.

II. RELATED WORKS
Precise matching in the stereo image pair is a prerequisite for the girth measurement in our system. Survey on the image matching methods can be found in reference [30]. The common matching methods can be broadly divided into three categories: grayscale correlation based method, feature based method and transform domain based method. The grayscale correlation based methods calculate the correlation between the template and the image to be matched to search the best matching position. The possible correlations include Mean Absolute Difference (MAD) [31], Sum of Absolute Differences (SAD) [32], Sum of Squared Differences (SSD) [33], etc. The larger the template, the higher the computation cost. The feature based methods extract the feature descriptors of two images and match the features according to the similarity of descriptor. Potential image features include points [34], edges [35], surfaces [36], etc. The most commonly used point matching algorithms include Scale Invariant Feature Transform (SIFT) [37], Speeded-Up Robust Features (SURF) [38], etc. They have good translation and rotation invariance and anti noise performance. Usually, the mismatch removal is conducted to achieve accurate feature matching results [39]- [41]. The transform domain based methods transform the rotation in time domain into translation in frequency domain by Fourier Transform [42], Walsh Transform [43], Wavelet Transform [44], etc. They have fast algorithms with easy implementation, as long as all points in the two images are shifted by the same direction and amount. In our system, the stereo images to be matched are of the same size, so the method based on grayscale correlation is not suitable. There is not only translation but also slight rotation between the two images. Hence, the method based on transform domain is also inappropriate. In order to complete the girth measurement, color markers in the two images need to be well matched. Therefore, the feature based method SURF is first selected to match the captured stereo image pair. Fig. 2 shows the matching result of the complex background images with SURF. Fig. 3 shows the matching result of the simple background images of the same subject with SURF. It can be observed that in both cases, there are a large number of invalid matching points in the matching results, which are useless for girth measurement. Table 1 shows the statistical values of the matching results in Fig. 2  Invalid and valid but mismatching point pairs also account for up to 72.1%. As shown in Fig. 4, invalid matching point pairs (a) and valid but mismatching point pairs (b) are useless for girth measurement. Only valid and matching point pairs (c) are required. Hence, invalid and valid but mismatching point pairs must be removed as many as possible [45], [46]. This is a difficult task considering the large proportion of these pairs. Therefore, before matching, it is necessary to properly segment the regions where the girths are located (in our system, bust and waist) to reduce the matching areas and background interference.
Traditional image segmentation methods include threshold based segmentation, region based segmentation and graph theory based segmentation. The threshold based segmentation methods [47]- [49] divide the grayscale histogram of an image into several classes by several thresholds. The pixels in the same class belong to the same object. They are simple, fast and efficient, but sensitive to noise, and not suitable for images with complex background. The region based segmentation method [50] connects pixels with similarity, thus forming the final segmented region. It is suitable for images with complex background, but is complicated and slow. The graph theory based segmentation methods [51]- [53] map an image  into a weighted graph, and the complex image segmentation problem is simplified into an optimization problem by the optimal partition theory of graphs. They are fast, but only suitable for binary segmentation, and require manual intervention. All these traditional image segmentation methods do not meet the segmentation requirements in our system: fast and intelligent, in complex background, and for multiple segmentation results.
Modern image segmentation methods have developed with the progress of Convolutional Neural Networks (CNN) [54], [55]. A Full Convolutional Network was proposed in 2015 [56]. Since then, many semantic segmentation methods based on deep learning have emerged, including Deconvolution Network (DeconvNet) [57], Deep Parsing Network (DPN) [58], RefineNet [59], Conditional Random Fields as Recurrent Neural Networks (CRFasRNN) [60], Piecewise [61], DeepLab [62], PSPNet [63], etc. FCN selects CNN network as the basic framework and introduces the full convolution layer, but it does not make full use of the context information, and the segmentation precision is low. DeconvNet improves FCN by introducing deep deconvolution network, while the segmentation effect is not as good as FCN in the scene with strong illumination contrast. DPN makes use of group convolution to reduce computation complexity, while it ignores the finer details of the image. RefineNet improves the decoder structure and fuses low-level and high-level semantic features by up-sampling. However, the network capacity is large and the training time is long. CRFasRNN combines CRF and RNN into an end-to-end network to improve the segmentation accuracy of FCN, nevertheless, it lacks the utilization of context information. Piecewise combines CNN and CRF to effectively improve performance, but the model takes up much memory and the training time is long. DeepLab utilizes the empty convolutional layer instead of up-sampling, while fails to capture fine object boundaries. PSPNet fuses the features of different scales, so as to learn the features of the subject more effectively, increase the multi-resolution receptive field, and further improve the segmentation accuracy. Table 2 shows the Mean Intersection over Unions (MIoUs) of various methods tested in the PASCAL VOC 2012 data set [63]. The MIoU of PSPNet is 82.6%, which is the highest among all the methods. Therefore, we choose PSPNet based on the spatial pyramid structure to segment the regions where the girths are located.

III. PROPOSED SYSTEM
In the proposed girth measurement system, we use six POINT GREY GS3-U3-28S4C-C industrial cameras to build the multi-view stereovision system, as shown in Fig. 5(a). This camera has a Sony ICX687 CCD chip, with a 1/1.8 inch size, a 1928 × 1448 maximum resolution, 2.8M effective pixels, a 128 MB onboard buffer, and a 2 MB data flash memory. The multi-view stereovision system also includes a PC, with an Intel Xeon E5-2620 CPU, 32 G RAM and a 8 G Nvidia Geforce GTX 1080 discrete graphics card. The PC controls the cameras and communicates with the cameras via a USB3.0 interface. Zhengyou Zhang's camera calibration method [64] is used to calibrate our multi-view stereovision system. The calibration board is shown in Fig. 5(b) with a 15mm×15mm size. As shown in Fig. 5(c), the subject wears tights during measurement. On the surface of the measurement position of the tights, there are circular markers in a repeated order of yellow, orange, purple and blue, with 1.5 cm spacing. VOLUME 8, 2020 The whole procedure of the proposed girth measurement system is shown in Fig. 6. System inputs include three pairs of left and right view images (0 • , 90 • , 180 • ) synchronously captured by the calibrated multi-view stereovision cameras. System output is the girth measurement data. This system consists of three main parts: semantic segmentation of the girth region, stereo matching and coordinates calculation of markers, unified coordinates transformation and girth fitting. In the semantic segmentation part, an optimized PSPNet is trained. The trained network is then used to segment the bust and waist regions respectively from each captured stereo image pair. In the stereo matching and coordinates calculation part, stereo matching is only performed with cluster constraints within the bust and waist regions in each stereo image pair to extract the markers. The actual spatial coordinates of the markers in the space of each stereo camera pair are calculated. In the coordinates transformation and girth fitting part, transformation is carried out to transform the coordinates of markers into a unified space. The unified spatial coordinates of these markers are finally fitted to achieve girth measurement.

A. SEMANTIC SEGMENTATION OF THE GIRTH REGION
We first perform semantic segmentation of the girth region. As described in the introduction, we choose bust and waist as the girths to be measured. We analyze 2700 anthropometric images captured by our system, with image resolution 1928 × 1448, to further verify the necessity of segmentation. The statistical distribution bar graph in Fig. 7(a) indicates that the bust regions are all in the resolution range between 400 × 400 pixels and 850 × 850 pixels. Specifically, the number of bust region sizes below 400 × 400 pixels accounts for 0%, the number of bust region sizes between 400 × 400 pixels and 600 × 600 pixels accounts for 33.3%, the number of bust region sizes between 600 × 600 pixels and 850 × 850 pixels accounts for 66.7%, and the number of bust region sizes above 850 × 850 pixels accounts for 0%. The statistical distribution bar graph in Fig. 7(b) indicates that the waist regions are all in the resolution range between 400 × 400 pixels and 800 × 800 pixels. Specifically, the number of waist region sizes below 400 × 400 pixels accounts for 0%, the number of waist region sizes between 400 × 400 pixels and 600 × 600 pixels accounts for 44.4%, the number of waist region  sizes between 600×600 pixels and 800×800 pixels accounts for 55.6%, and the number of waist region sizes above 800 × 800 pixels accounts for 0%. In conclusion, the bust region only accounts for less than 25.9% of the total image size, and the waist region only accounts for less than 22.9% of the total image size. Therefore, semantic segmentation before stereo matching can effectively reduce the matching region and improve the matching accuracy. As described in the related works, we select PSPNet as the network structure to segment the girth region. Fig. 8 shows the existing PSPNet network structure [63]. As an important part of the PSPNet, ResNet is used to extract the feature map of the input image. ResNet consists of conv1, conv2_x, conv3_x, conv4_x and conv5_x, with various depth layers [65]. The more layers, the deeper the network, the more adequate the feature extraction, but the more complex the model. Hence, it is necessary to choose an appropriate network depth for our girth measurement system, which not only ensures accurate feature extraction, but also simplifies the model as much as possible. As shown in Table 3, the Pixel Accuracy (PA), Mean pixel Accuracy (MPA) and MIoU of ResNet18, ResNet34, ResNet50 and ResNet101 increase with the number of network layers. For ResNet101, MPA and MIoU show significant increases (1.7% and 0.42%, respectively) compared to other networks (0.24% and 0.22% in average, respectively).
Hence, we select ResNet101 for feature extraction of PSPNet. Although the girth region has no obvious difference with other parts of a human body in color, shape and texture, it has significant difference in spatial and proportional relation. With adequate training, PSPNet can extract the features associated with the spatial and proportional relation of the girth region, and use the features to correctly segment the girth region from the human body.
We also improve the existing PSPNet network structure by replacing the 7 × 7 convolution kernel of conv1 with three 3 × 3 convolution kernels in series, as shown in Fig. 9. For the sake of distinction, we use PSPNet+ to refer to this improved version of PSPNet. As shown in Fig. 10, the receptive field and output size are the same for the 7 × 7 convolution kernel and the three 3 × 3 convolution kernels in series. The PSPNet+ can increase the network capacity and reduce the number of parameters without significantly increasing the network complexity and can thus enhance the network performance.
We select 2500 images containing bust and waist from the Look Into Person (LIP) open source dataset (50462 images with various resolutions) established by Sun Yat-Sen University [66]. In addition, 1500 images (resolution 1928 × 1448) are taken by the POINT GREY GS3-U3-28S4C-C industrial camera in our lab at a distance of 2-4 m. In summary, 4000 images are obtained. With random image clipping and scaling, the dataset size is expanded by a factor of three, that is, 12000 images. Fig. 11 shows some image examples of our bust and waist dataset.
We train the girth semantic segmentation model with 7200 images as the training set in the PSPNet and the PSPNet+. We set the model training parameter batch_size to 8 and epochs to 50. After training, we obtain a trained     that of the PSP-based model. In summary, the PSPNet+ has better performance. Since there is no exact boundary between the girth region and other regions, there may be irregular upper and lower edges in the segmentation results, and the segmented region is not exactly the same with the ground truth, which leads to relatively low MIoU. However, the measurement results of our system will not be affected because the segmented regions have included the exact girths to be measured. The semantic segmentation results of the anthropometric stereo image pair captured by our girth measurement system with PSPNet+ are shown in Figs. 12(a) and 12(b). The image is divided into three parts, yellow for the waist region, magenta for the bust region, and black for the background. The segmented image pairs of bust region and waist region can be obtained by respective masks, as shown in Figs. 12(c) and 12(d). In this way, the matching areas are intelligently reduced from the whole images to the smaller girth regions. Nevertheless, the spatial related features are sensitive to image rotations. Fig. 13 shows the semantic segmentation results of bust and waist regions from the captured human body images corresponding to clockwise rotation of 0 • , 45 • , 90 • and 180 • , respectively. Only the semantic segmentation result in Fig. 13(a) is correct. Figs. 13(b) and 13(c) have no semantic segmentation results, while the semantic segmentation result of Fig. 13(d) is upside down. Therefore, in order to segment the girth region from the human body correctly, there are constraints in the image acquisition. The subject is required to stand upright, keeping feet together and breathing normally to ensure the correct spatial and proportional relation in the captured images.  Fig. 2 and Fig. 3.  Fig. 14 shows the matching results of the bust and waist segmented image pairs with SURF, in which the subject is the same as in Fig. 2 and Fig. 3. Table 5 shows the statistical values of the matching results in  Table 1. Furthermore, the total number of matching point pairs is reduced to 21, which is much lower than the 175 of complex background and 43 of simple background in Table 1 and the computation cost of matching can be greatly reduced. There are no invalid match point pairs, therefore, no more effort is required to remove the invalid matching point pairs. However, there are also about a third of mismatching point pairs. If not removed correctly, they will lead to incorrect girth measurement. Even if removed correctly, the number of matching point pairs is less than the total number of markers, which will lead to a decrease in girth measurement accuracy. Hence, it is necessary to correctly match as many markers as possible.

B. STEREO MATCHING AND COORDINATES CALCULATION
In our girth measurement system, there are four colors of circular markers on the measured girth. These markers are repeated in yellow, orange, purple and blue pattern.  It can be observed that the four colors are located in separate spatial areas of the HSV color space. Therefore, these separate spatial areas can be used to distinguish the markers of different colors. Table 6 shows the HSV range corresponding to the four colors. According to this HSV range, markers in the segmented image are classified into four different color categories denoted as Y, O, P and B. All pixels in the segmented image constitute a data set Z = {z 1 , z 2 , · · · z i , · · · }, i = 1, 2, . . . , n, n = 1928 × 1448. The segmented image is converted from RGB color space to HSV color space. Then the pixel z i has three components, i.e., z i (H i , S i , V i ). If V i is less than 46, the three components H i , S i and V i of z i are all set to 0. If V i is greater than 46, the pixel z i is classified according to its H i , and a fourth component C i representing the color category is added to that pixel z i . If V i is greater than 46 and H i is greater than 11 and less than 25, that pixel z i belongs to the color category O, C i =orange. If V i is greater than 46 and H i is greater than 26 and less than 34, that pixel z i belongs to the color category Y, C i =yellow. If V i is greater than 46 and H i is greater than 100 and less than 124, that pixel z i belongs to the color category B, C i =blue. If V i is greater than 46 and H i is greater than 125 and less than 155, that pixel z i belongs to the color category P, C i =purple. According to statistical analysis of the segmented image, the color pixels account for less than 1% of the total pixels of the whole image. In this way, the color pixels in the segmented image constitute a much smaller data set Z C = {z c1 , z c2 , · · · z ci , · · · }. Each element z ci has five components, i.e., The dataset Z C can be further divided into four subdatasets Z Cj , j=1,2,3,4. Fig. 16 shows the four subdatasets obtained by classification of a left-view segmented image and a right-view segmented image. Each of the four subdatasets belongs to a different color subspace. In each color subspace, the spacing between adjacent markers of the same color increases by a factor of 4, while the size of markers remains the same, which provides a guarantee for accurate clustering of pixels belonging to different markers. According to statistical analysis, the maximum horizontal distance of the pixels in the same marker is no more than 40 pixels, while the minimum horizontal distance between pixels in adjacent markers of the same color is no less than 100 pixels. Hence, all pixels in a subdataset can be further clustered into several clusters, each cluster corresponding to a marker. For subdataset Z Cj , each element z cjk H jk , S jk , V jk , C jk , x jk has five components. The distance between any two elements is defined as:

return C i 22: end function
The E neighborhood of an element is defined as: We set E to 40 and select the first element z cj1 as the initial pixel. The E neighborhood of z cj1 is denoted as M j1 , and the pixel number in M j1 is denoted as N j1 , N j1 = 1. If the second element z cj2 ∈ N ε z cj1 , then z cj2 ∈ M j1 , and N j1 = N j1 + 1. If Z cj2 / ∈ N ε z cj1 , the E neighborhood N ε z cj2 is denoted as M j2 , and the pixel number in M j2 is denoted as N j2 , N j2 = 1. If the third element z cj3 ∈ N ε z cj1 or z cj3 ∈ N ε z cj2 , then z cj3 ∈ M j1 and N j1 = N j1 +1 or z cj3 ∈ M j2 and N j2 = N j2 +1. If z cj3 / ∈ N ε z cj1 ∪ N ε z cj2 , the E neighborhood N ε z cj3 is denoted as M j3 , and the pixel number in M j3 is denoted as N j3 , N j3 = 1. The process continues until all the elements in the subdataset Z Cj are visited. In this way, we get several clusters M jm and several N jm , m = 1, 2, . . . , M , where M is the number of markers in Z Cj . We calculate the average horizontal coordinatesx jm of the pixels in M jm We reorder M jm ,so thatx j1 <x j2 < . . . <x jM .
In the above way, the subdatasets of the left-view and right-view shown in Fig. 16  If M L = M R + 1, the cluster M jm L with the smallest N jm L is removed and other clusters M jm L are reordered according tō x jm L . If M L +1 = M R , the cluster M jm R with the smallest N jm R is removed and other clusters M jm R are reordered according tox jm R . Finally, the number of clusters in the subdatasets Z Cj L and Z Cj R , i.e., the number of markers that can be stereo matched in the left and right views, is the same. Each M jm L corresponds to one special M jm R : M jm L → M jm R . Fig. 17 shows the matching results of the bust and waist segmented image pairs with SURF taking M jm L → M jm R as the matching constraint. What's more, only one pair of matching points closest to the center are kept in one pair of M jm L and M jm R . Table 7 shows the improved stereo matching results. The x and y coordinates of the matching pixels in the corresponding markers are output. All the markers are well matched by 100%. However, as shown in Table 8, 2D errors still exist between the 2D matching results and their corresponding 2D ground truths, which will cause 3D errors in the subsequent coordinate calculation.  After stereo matching of markers, we calculate the space coordinates of the real 3D points corresponding to stereo matching marker pairs with the calibration parameters of the stereovision camera pairs. Fig. 18 shows the convergent stereovision model [67] used in our system. The internal parameters of camera include the focal length of the left-view camera and the right-view camera, f L and f R . The external parameters of camera include a rotation matrix R and a translation vector T , which can be obtained by using Zhengyou Zhang's calibration method [64].
With (x L ,y L ), (x R ,y R ), f L , f R , R and T , the space coordinates (x, y, z) of markers can be calculated as [68]: The stereo calibration method will introduce a relative depth error of 1/20000 in the case of a long focal length (>25mm) and a fixed baseline length [68], which is exactly the case in our system. In our system, the measuring distance is 60cm, so the depth error is about 0.03mm, which is relatively small and can be ignored. Table 9 shows the 3D errors of coordinate calculation for the stereo matching results in Table 8. As shown in Tables 8 and 9, the 2D coordinate deviation of the matching points in one pixel will bring a 3D coordinate deviation of at least 1mm in the subsequent coordinate calculation. The more the 2D coordinate deviation, the greater the 3D error of coordinate calculation.

C. UNIFIED COORDINATES TRANSFORMATION AND GIRTH FITTING
In our system, three pairs of left and right view images (0 • , 90 • , 180 • ) are synchronously captured by three stereovision camera pairs. Each pair of cameras corresponds to a special coordinate system. After stereo matching and spatial coordinates calculation, three sets of 3D coordinates of markers in three different coordinate systems are calculated, denoted as S 1 , S 2 and S 3 .   Table 8.
We take the coordinate system of S 2 as the unified coordinate system. We still use Zhengyou Zhang's calibration method [64] to obtain the relative external parameters R 12 and T 12 from the coordinate system of S 1 to the coordinate system of S 2 . The unified coordinate transformation formulas are as follows [69]. Table 10 shows an exemplary unified coordinate transformation result from s 1 i to s 2 i . As a consequence, we have one set of uniform 3D coordinates of markers in the coordinate system of S 2 , denoted as S 2 .
As can be seen from Table 10, the y coordinates of markers are almost the same, i.e., the markers are almost on a same horizontal cross section. Therefore, the markers of the measured girth are projected onto the XOZ plane for fitting. Polynomical curve fitting (PCF) [70], Cubic Bezier curve fitting (CBCF) [71] and Polynomical with Intermediate Variable curve fitting (PIVCF) [72] are adopted to fit the markers respectively. Fig. 19 shows the exemplary curve fitting results for four typical cases, that is, woman's bust, woman's waist, man's bust and man's waist. The blue circle represents the original data of markers projected onto the XOZ plane. The green line represents the polyline obtained by directly connecting the original data points. The blue line represents the fitting curve obtained by PCF method. The orange line represents the fitting curve obtained by CBCF method. The red line represents the fitting curve obtained by PIVCF method. For Figs. 19(b), 19(c) and 19(d), these three methods show similar fitting results. However, for Fig. 19(a), the blue line shows obvious under-fitting, that is, the PCF method is not suitable for woman's bust and, thus is not suitable for our system. The other two methods show similar fitting results. Table 11 shows the RMSE of the other two methods (CBCF and PIVCF) in 10 randomly selected samples from  140 measurements. The RMSE of CBCF method is lower than that of PIVCF method. Therefore, the CBCF method shows a better fitting for the markers. However, the main purpose of our system is to measure the girth for garment design, while the markers are only a small part of the girth. We should double check the accuracy of these two methods for girth measurement. The fitting curve is divided into many short segments, and the length of the curve L is approximated by the sum of the lengths of these short segments L i .
In our experiment, N = 100. For garment design, the human body is considered to be symmetrical. Hence, the circumference of the measured girth should be 2L. Table 12 shows the exemplary girth measurement results of these two methods. Table 13 shows the error of the results in Table 12. The maximum absolute error of the circumference obtained by CBCF is 1.36cm, while that by PIVCF is 1.03cm, of which the latter is less than the former. The MAD of the errors obtained by CBCF is 1.07cm, while that by PIVCF is 0.81cm, of which the latter is less than the former. In each measurement, the circumference obtained by PIVCF is always closer to the ground truth than that obtained by CBCF. Therefore,  the PIVCF method is more accurate for girth measurement than the CBCF method. We choose the PIVCF method to perform girth fitting in our system. We can not only measure the circumference of the girth, but also draw the contour of the girth, which is very helpful for customized garment design.

A. EXPERIMENT SETUP
In practical girth measurement test, we adjust the focal length of the six POINT GREY GS3-U3-28S4C-C industrial cameras, so that they can capture the images of the subject with clear markers at a distance of 60cm. The normal breathing rate for an adult is 12 to 20 breaths per minute, so the cameras shoot at a low frame rate of 10 FPS. We choose the size measured manually with a tape as the ground truth of our test, in which anthropometric size definition and measurement method are strictly in accordance with China national standard GB/T 16160-2017 "Anthropometric definitions and methods for garment" [9]. A total of 70 subjects are tested in this experiment, including 33 women and 37 men, aged from 20 to 30 years old, with the height from 150cm to 185cm. Table14 shows the statistical characteristics of these subjects. To avoid occlusion, the subjects stand with their arms outstretched during the measurement. To avoid the error caused by breathing, 6 frame images shot at the same time by the 6 synchronous cameras are selected. Fig. 20 shows three pairs of stereo images captured synchronously from the front (0 • ), side (90 • ) and back (180 • ) of the subject. To avoid random errors, each subject is measured manually and by our system 5 times, respectively. The average value of the 5 measurements is taken as the final measurement result.  Our girth measurement experiment is divided into two groups by gender: man and woman. For simplicity, only some of the measurement results are shown, including those with the maximum absolute errors. Table 15 shows the girth measurement results of 12 subjects selected from a total of 33 female subjects, including 2 subjects with the maximum absolute error of bust and the maximum absolute error of waist. The remaining 10 subjects are selected randomly. Subject No.9 has the maximum absolute error of bust, i.e., 1.28cm. It conforms to China national standard GB/T 2665-2017 "Women's suits and coats" in which the tolerance for bust is ±2.0cm [73]. Subject No.4 has the maximum absolute error of waist, i.e., 1.18cm. It conforms to China textile industry standard FZ/T 81004-2012 "Dress and lady suit" in which the tolerance for waist is ±1.5cm [74]. Fig. 21 shows the comparison of the bust and waist measurement results of these 12 subjects between our proposed method and the manual method. The blue line with square represents the measurement results by our proposed method, while the red line with circle represents the measurement results by manual method. The two lines are very close and almost overlapping.  Table 16 shows the statistical analysis of the girth measurement results of the total of 33 female subjects. It can be seen that the mean value µ and standard deviation σ of the measurement results by our proposed method and the manual method are almost the same, which indicates that the proposed method can replace the manual method. The MAD of the measurement error for bust is 0.98cm, and the corresponding data tolerance is 100% ≤ | ± 1.5|cm. The MAD of the measurement error for waist is 0.87cm, and the corresponding data tolerance is 100% ≤ | ± 1.5|cm.   Table 17 shows the girth measurement results of 13 subjects selected from a total of 37 male subjects, including 3 subjects with the maximum absolute error of bust and the maximum absolute error of waist. The remaining 10 subjects are selected randomly. Subject No.16 has the maximum absolute error of bust, i.e., 1.31cm. It conforms to China national standard GB/T 2664-2017 "Men's suits and coats" in which the tolerance for bust is ±2.0cm [75]. Subject No. 16 and No.21 have the maximum absolute error of waist, i.e., 0.99cm. It conforms to China national standard GB/T 2666-2017 "Trousers" in which the tolerance for waist is ±1.0cm [76]. Fig. 22 shows the comparison of the bust and waist measurement results of these 13 subjects between our proposed method and the manual method. The blue line with square represents the measurement results by our proposed method, while the red line with circle represents the measurement results by manual method. The two lines are very close and almost overlapping. Table 18 shows the statistical analysis of the girth measurement results of the total of 37 male subjects. It can be seen that the mean value µ and standard deviation σ of the measurement results by our proposed method and the manual method are almost the same, which indicates that the proposed method can replace the manual method. The MAD of the measurement error for bust is 0.99cm, and the corresponding data tolerance is 100% ≤ | ± 1.5|cm. The MAD of the measurement error for waist is 0.83cm, and the corresponding data tolerance is 100% ≤ | ± 1.0|cm.

C. GIRTH MEASUREMENT EXPERIMENT FOR MAN
Overall, the maximum measurement error of bust is 1.28cm for woman and 1.31cm for man, which are within the ±2.0cm tolerance of bust for woman and ±2.0cm tolerance of bust for man regulated by national standards. The maximum measurement error of waist is 1.18cm for woman and 0.99cm for man, which are also within the ±1.5cm tolerance of waist for woman and ±1.0cm tolerance of waist for man regulated by textile industry standard and national standard. In summary, the error mainly comes from four steps: semantic segmentation, stereo matching, coordinate calculation and girth fitting. As mentioned above, the contributions of semantic segmentation and coordinate calculation to the error are relatively small and can be ignored. The contribution of girth fitting to the error is almost constant. However, the contribution of stereo matching to the error decreases with the increase of matching accuracy, which is the main contribution to the final girth measurement error and should be further improved in the future.
We compare the girth measurement error with five other cost-effective and portable anthropometric methods, namely, Sara et al. ' [25], as shown in Table 19. The bust MAD of our proposed system is 0.99cm for man and 0.98cm for woman, which is less than the bust MAD of [77], [78] and [79] with 1.97cm, 1.45cm and 1.60cm, respectively. The waist MAD of our proposed system is 0.83cm for man and 0.87cm for woman, which is less than the waist MAD of [19], [77], [78] and [79] with 2.57cm, 2.03cm, 1.47cm and 2.50cm,  respectively. The bust data tolerance of our proposed system is 100% ≤ | ± 1.5|cm for woman, which is less than that of [25] with 86% ≤ | ± 2.0|cm for woman. The waist data tolerance of our proposed system is 100% ≤ | ± 1.5|cm for woman, which is less than that of [25] with 98% ≤ |±1.5|cm for woman. In summary, our system not only can measure the girth simply and intelligently with low cost and portability, but also can achieve better measurement accuracy than other methods.

V. CONCLUSION
In this study, we solved the problem of intelligently measuring girth on the basis of images captured by the multi-view stereovision system. We presented a system composed of girth region semantic segmentation, marker stereo matching and coordinates calculation, unified coordinates transformation and girth fitting. We integrated girth semantic segmentation within the PSPNet+ network structure particularly for accurate and intelligent girth semantic segmentation. We classified the segmented images into different color subspaces and clustered the pixels in each color subspace into several clusters corresponding to markers. We performed stereo matching only on the corresponding clusters to obtain the matching marker pairs. We calculated the space coordinates of the real 3D points corresponding to stereo matching marker pairs with the calibration parameters of the stereovision camera pairs. We transformed the space coordinates of markers into one unified coordinates. We did curve fitting on the markers with unified coordinates and calculated the length of the fitting curve. The girth was measured, and the contour of the girth was depicted. The girth measurement performance of our proposed system was verified by the experiments of bust and waist measurement for woman and man. The results show that our system is efficient and reliable in the practical application of girth measurement. In our measurements, the measured girths have a maximum bust absolute error of 1.28cm for woman and 1.31cm for man, which are within the ±2.0cm error limit of China national standard GB/T 2665-2017 and the ±2.0cm error limit of GB/T 2664-2017. The measured girths also have a maximum waist absolute error of 1.18cm for woman and 0.99cm for man, which are within the ±1.5cm error limit of China textile industry standard FZ/T 81004-2012 and the ±1.0cm error limit of China national standard GB/T 2666-2017. In particular, our system is passive and portable, suitable for quick and accurate girth measurement, and with low cost.