A Robust End-to-End Speckle Stereo Matching Network for Industrial Scenes

The detection capability of deep learning-based stereo matching in industrial applications is inherently limited due to challenges posed by weak texture and inconsistent reflectance, making it difficult to accurately recover complex surface details. To achieve accurate measurements, this paper presents an end-to-end speckle stereo matching network that incorporates fringe, Gray code, and speckle projection patterns. The model is trained using a high-precision dataset consisting of thousands of pairs generated through binocular Gray code-assisted phase shifting. After establishing local correspondences between the left and right images using speckle patterns, the images are used as inputs to the network. The proposed network consists of two siamese 2D feature extraction networks. One network is dedicated to cost volume computation, while the other focuses on weight refinement feature extraction. The former network incorporates a lightweight module for extracting high-dimensional fusion features. These features are obtained from different dilation scales and randomly concatenated along the channel dimension. Patch convolution is utilized to effectively adapt to pixel features at various levels, reducing redundancy within the cost volume and improving the network’s capacity to learn from ill-posed regions. Experimental results demonstrate that the proposed network achieves a significant improvement of approximately 10.7% in matching accuracy compared to state-of-the-art networks on public datasets. Furthermore, this method exhibits outstanding matching results when applied to diverse industrial scenarios. The reconstruction error for the radius of optical standard spheres is below 0.06-mm, which meets the demands of the majority of industrial applications.


I. INTRODUCTION
Optical 3D measurement is extensively employed in various fields, including material science [1] and biometrics [2], due to its advantages of high speed, high accuracy, and non-contact nature.Based on different illumination and imaging approaches, it can be categorized into passive and active measurements.Active measurement techniques offer higher reconstruction accuracy in textureless regions, making them particularly suitable for industrial inspection [3].

Obtaining comprehensive three-dimensional information for
The associate editor coordinating the review of this manuscript and approving it for publication was Pinjia Zhang .key components of trains is crucial for the holistic health monitoring of high-speed and subway rail systems.Commonly employed measurement techniques, include static measurement methods, light sectioning, and multi-line laser scanning.Nonetheless, these methods exhibit low measurement efficiency or fail to capture complete three-dimensional dimensions in a single attempt.Active 3D measurement, involving the use of an active light source for illumination, employs distinct modulation strategies based on the object's surface features to interpret the structured light field and subsequently derive the object's three-dimensional information.This approach finds applications in tasks such as wheel size measurement, bolt loss detection, automated guided vehicle navigation, and robotic arm-assisted positioning.Nevertheless, the diverse dimensions of objects subjected to the holistic detection process for ensuring the secure operation of trains pose a challenge for traditional structured light methodologies in obtaining a comprehensive and dependable point cloud.Furthermore, the attainment of high-precision point clouds necessitates the aggregation of multiple frames, thereby constraining their applicability in dynamic measurement contexts.
Fringe and speckle patterns are two commonly used projection modes in structured light-based measurement methods.Speckle Profile Projection (SPP) utilizes speckle patterns projected with a spatial encoding strategy to provide local uniqueness for pixel labeling.However, ensuring the uniqueness of spatial pixels solely through projecting a single speckle pattern is challenging [4], [5].To address the matching difficulties in SPP, local matching is often utilized, where the differences between pixels in each region are regressed for correction [6].Alternatively, global matching estimates the disparity of all pixels directly by constructing an energy function that incorporates global information [7], [8].However, these methods often trade off matching accuracy for reliable matches through disparity smoothing, making it challenging to obtain precise 3D information from a single-frame speckle projection.
In recent years, scholars have raised several deep learning-based stereo matching methods that exhibit higher accuracy and robustness compared to traditional algorithms [9].Kendall employed a 3D-CNN to construct a 4D cost volume and utilized a soft attention mechanism, known as soft-argmin, to enable sub-pixel disparity regression [37].To enhance the accuracy of feature extraction, Chang raised a method that converts local matching into a global stereo matching approach by utilizing patch convolution with different sizes [10], and the incorporation of a commonly used spatial pyramid pooling structure [40], by expanding the receptive field, it simultaneously removes the constraint of a fixed input image size.Numerous existing stereo matching methods predominantly focus on optimizing performance for specific datasets, resulting in limited generalization to other datasets.This limitation arises due to the susceptibility of these methods to domain shift, making it challenging to extend their performance to unexplored domains [11], [12].For example, taking into account the occurrence of both positive and negative disparity in realworld scenarios, a semi-dense disparity map can be computed using binocular views, subsequently, the remaining regions can be completed using monocular views [13].By utilizing pyramid-based warping cost volume, the fusion of multi-scale composite costs enables the extraction of domain-invariant features [14], the generalization from synthetic domains to real domains can be accomplished through the utilization of drone imagery and LIDAR point cloud reconstruction, enabling the generalization from the synthetic domain to the real domain.Recent research has demonstrated that by guiding and filtering the cost volume, it is possible to suppress redundant information, thereby simultaneously reducing the burden of cost aggregation and enhancing prediction accuracy.For instance, feature correlation can be effectively enhanced by employing image-guided weights [15].Xu proposed a method that utilizes edge-preserving filtering with slice operations to effectively enhance the resolution of the cost volume [36].In order to capitalize on the strengths of both the group-wise correlation volume [28] and the concatenated cost volume, Guo proposed a cost volume filtering technique that involves directly concatenating feature maps from different levels to compute the cost volume [43].By utilizing stereo matching methods that employ edge-preserving filtering techniques [16], [17].The contour information of the target object in the predicted results can be effectively enhanced, thus improving the preservation of its shape and edges, during the training process, accurate Ground truth disparity can be sparsely sampled by incorporating edge information and saliency information [18].However, this approach may lead to inaccurate surface depth information.In addition to leveraging RGB images, utilizing non-visible spectral information has proven to be highly effective.One approach involves the use of an infrared projector and a single camera to construct a monocular infrared structured light system, which serves as guidance information [19].However, this method has limitations in dynamic detection since it cannot achieve detection in a single-frame imaging manner.Nonetheless, three-dimensional reconstruction methods based on deep learning are often constrained by the uniqueness of the data.When the detected scene or target undergoes changes, the reconstruction accuracy of the model tends to degrade.
This paper proposed a single-frame stereo matching method designed for high-precision 3D measurements.Simultaneous acquisition of fringe patterns, Gray code patterns, and speckle patterns is performed for industrial scenes, a rich and high-precision dataset consisting of 6480 pairs of scenes was constructed using a combination of binocular Gray code and phase shifting techniques, the network takes images in speckle pattern as input [20].In contrast to other frequently employed network architectures that utilize pyramid pooling structures for feature extraction, the network initially constructs a lightweight cascaded encoder-decoder module to extract high-dimensional fused features, the features acquired from various dilation scales are randomly concatenated along the channel dimension, resulting in enhanced matching capability for speckle points.Patch convolution is utilized to adapt to pixel features at various levels, enhancing the network's ability to refine the cost volume and suppress redundant information, this process further improves the feature matching capability of the network, the precision of the edge regions in the disparity map is strengthened by incorporating an additional edge loss function.Through a series of experiments, this method has been proven to achieve high precision and robustness in sub-pixel 3D reconstruction.The remaining parts of this paper are as follows.Section II presents the proposed method, primarily encompassing the principles of the manufacture of training data and the design of the stereo matching network.In Section III, the passive measurement capability of the model is verified using public datasets.The performance of the proposed model in active measurement is evaluated using our speckle industrial dataset and optical standard balls.Furthermore, the feasibility of the overall approach proposed in this paper for industrial applications is discussed.Lastly, Section IV provides a summary of this paper.The standard phase shifting profilometry (PSP) technique utilizes a set of sinusoidal fringes [21] that undergo equal phase shifts within one cycle and are projected onto the target scene, as shown in Fig. 2(a).

II. METHOD A. METHOD OF DATASET CONSTRUCTING
The intensity distribution of the captured fringe patterns by the cameras is as follows: where A(x,y) represents the background light intensity, B(x,y) represents the modulation degree, n represents the phase shift index with n = 0, 1, 2, . . ., N -1, and ϕ(x, y) represents the corresponding phase, which can be obtained [22] by the following: In ( 2), the wrapped phase is calculated using the arctangent function.In this case, the phase ϕ(x, y) is truncated within the range of (−π, π), and it is necessary to unwrap the phase to restore it to a continuous phase.Gray code utilizes a projection mode with black and white fringe [23], offering the advantages of high speed and error-free transmission, as shown in Fig. 2(b).In this paper, it is utilized for phase unwrapping in phase shifting method [24].By projecting N sets of Gray code images, 2 N periods of fringe patterns are marked, and unique identification can be recognized through the binary intensity sequence.
Before decoding the Gray code, it is essential to perform binary thresholding on the Gray code images.The threshold is determined based on the fringe pattern image: where m represents the number of captured fringe patterns, I i (n) denotes the grayscale value of a pixel during the projection of sine fringe patterns.The absolute phase (x, y) of the left and right images can be represented by ( 4), where k represents the decoded Gray code level.
Performing epipolar rectification on the absolute phase.The disparity of the absolute phase between the left and right images is computed based on the principles of binocular imaging [25]: (5) where D represents the disparity with respect to the left image as the reference, I L and I R denote the absolute phase pixel values corresponding to the left and right images, and disp represents the maximum estimated disparity in the current scene.
Obtaining three-dimensional information about a scene through passive single-frame stereo vision still faces several challenges.In industrial scenarios, many objects under inspection exhibit discontinuous surfaces, weak textures, or low reflectance.To enhance target features and improve matching accuracy in such cases, active illumination through controlled lighting is required.Laser speckle [26], [27], with its operational simplicity and cost-effectiveness, finds widespread application in active three-dimensional imaging in industrial settings.However, in measuring abrupt changes in surfaces, it is challenging to obtain dense disparity due to the discontinuity of the surface.Therefore, this study employs DLP-projected speckle images [34].In comparison to laser speckle, it possesses superior local randomness, global uniqueness, and higher matching accuracy.Initially, a black image of size 1280 × 720 is created.This image is then divided into several regions.Within each region, a random selection of pixels is made as seed points.Region growing is performed on these seed points, taking into account the continuity of pixel space, until a complete speckle image is generated.To ensure local randomness, it is required that the number of pixels with values of 255 or 0 in the image regions occupies approximately 42-45% of various sliding windows.Fig. 3 presents the speckle pattern utilized in this paper.
The deep learning-based stereo matching network exhibits severe domain shift issues with the dataset [41], indicating poor generalization to data from different scenes.The underlying cause for this is conventional datasets are tailored to specific fixed scenes, acquiring weights based on target colors and shapes.When there are changes in ambient lighting or when the model encounters previously unlearned target objects, it fails to correctly match corresponding points.In order to address the issue of poor generalization in the network due to domain-specific biases in the dataset, substituting local unique markers based on speckle patterns instead of conventional RGB information as inputs to the network model accurately resolves cross-domain generalization problems for different detection objects.
This paper adopts 7 Gray code images and 5 sine fringe images for binocular Gray code phase shifting imaging.Partial examples of the dataset are shown in Fig. 4.

B. NETWORK ARCHITECTURE
In this paper, we propose an effective end-to-end speckle matching network, primarily designed to address the challenge of accurate 3D measurement in complex industrial scenes.We propose two pairs of siamese feature extraction networks: the cost feature extraction network and the cost weight extraction network.Firstly, in the cost feature extraction network, we incorporate a high-dimensional feature fusion module, which generates high-dimensional fused features during the process of down-sampling feature extraction from input images.This module takes as input the fused features at different dilation scales.After regression, the weight (Multip Weight) is generated to adjust the Group-wise [28] cost volume, resulting in the final cost volume.
As illustrated in Fig. 5, the left and right speckle images are fed into two pairs of siamese networks.They enter the pyramid pooling module following the red arrows, and then the cost volume is calculated.Afterwards, a decodingencoding and regression operation is conducted, resulting in a Weight Correction Map of size 1 × H /4 × W /4. The obtained map is used to correct the Gwc-volume obtained through the blue arrow path.Subsequently, the volume is resized to (disp max − disp min ) × H × W through cost aggregation.Finally, the disparity regression and reprojection processes are applied to generate the point cloud model.
In the feature extraction, we incorporate a lightweight feature fusion module in the cost feature extraction section.This module utilizes depthwise separable convolutions, which decompose the conventional convolution operation into depthwise convolution and pointwise convolution.This approach ensures the same output while reducing computational complexity.Each Neck module is designed as follows: it consists of two convolutional layers with kernel size 1 × 1, with a 3× 3 convolutional layer inserted in between.The purpose of these layers is twofold: one is to modify the number of channels, and the other is to downsample the tensor.By representing the feature weights of the same pixel from different receptive fields, the network enhances the utilization of pixel neighborhood information.Fig. 7 illustrates the differences between the proposed method and pyramid pooling feature extraction at the same level.The Color Bar indicates the magnitude of positive differences, with darker colors indicating larger differences.We observe that the proposed method shows a stronger feature representation in areas with limited texture.
Ultimately, the feature maps generated by this module have a final output size of 320 × H /4 × W /4. In the cost weight extraction network, we continue to utilize a pyramid-like structure [30] for weight extraction.
For the cost volume, this paper follows the approach of Group-wise correlation volume [28]: where N C represents the number of channels in the 2D features, which are divided into N P groups along the channel dimension.The ''<, >'' denotes inner product.This cost volume calculation method provides rich similarity features for 3D cost aggregation, reducing the parameter requirements.After obtaining the refinement weights (Multip Weight), the cost volume is further adjusted.The adjustment in the k-th channel of the cost volume follows: where C final represents the cost volume used for cost aggregation, Mult represents element-wise multiplication of matrices, and w Multip represents the correction weight.Cost aggregation aims to accurately reflect the correlation between pixels while aggregating feature information.Similar to previous 3D convolutional stereo matching networks, this paper utilizes stacked hourglass modules for cost aggregation.Each module consists of two three-dimensional convolutional layers with Batch normalization and ReLU activation, as well as two hourglasses module.During training, weighted losses are computed using the cost weight extraction network and the outputs of two decoders-encoders, and backpropagated to supervise the network.During testing, only the output of the second hourglass is used, and the disparity map is obtained by upsampling along the disparity dimension.The regression of predicted disparity values follows the soft-argmin mechanism using soft attention: where l represents the disparity level, g l represents the probability at that level, and D max represents the maximum disparity trained by the model.Due to the limitations of Industrial scenes, there may be a disparity shift between the left and right fields of view, where the x-coordinate of a point in the left image is smaller than that in the right image.Therefore, a minimum negative disparity D min , which needs to be learned.
To address the sharpening of target disparity edges, the predicted map and the Ground truth are first thresholded and dilated.The threshold for binarization is set to half of the normalized value of the predicted map.Then, the Binary Cross-Entropy loss is computed between the two to provide additional supervision for edge pixels: where pre and Gt represent the predicted results and the Ground truth.
The total loss function in this paper is defined as follows, where the Smooth L1 loss [37] is computed for the regression results: where d i represents the predicted result of the i-th decoderencoder, d gt represents the Ground truth, E represents the spatial index of the edge pixels, and dw represents the result after disparity regression using cost weights, λ denotes the weight assigned to each individual loss.

III. EXPERIMENT
In this section, the capability of model in passive measurement is verified using public datasets Scene Flow [31] and KITTI [32], [33].The performance of the proposed model in active measurement is evaluated using our industrial dataset.The experimental datasets are described in Section III-A.Details of the experimental setup are presented in Section III-B.Evaluation metrics used in this paper and the effectiveness of the proposed modules and optimal settings are discussed in Sections III-C and III-D.Section III-E discusses the experimental results of our method on optical standard spheres.Section III-F analyzes the reconstruction results of our method in real industrial scenes and conducts a feasibility analysis.Our industrial datasets consist of two industrial cameras (Basler ace acA1920-40gc) with a resolution of 1920 × 1200 and a consumer-grade projector with a resolution of 1280 × 720.The fringe period is set to 16, and the working distance ranges from 0.5m to 1.5m.The camera baseline is 165mm, and the focal length is 12mm.In this study, we utilized 5 fringe patterns and 7 Gray code images.
The public datasets Sceneflow and KITTI are commonly employed for pre-training network weights and conducting network comparison experiments.They both utilize color (RGB information) as features to identify corresponding points for matching.However, due to limitations in industrial settings such as weak textures and non-continuous surfaces, this approach proves ineffective.Therefore, it is necessary to 6782 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.locally annotate feature points by training the model using the poses projected by speckle patterns on the object surfaces.After obtaining qualified weights, achieving high-precision 3D reconstruction in industrial scenes only requires capturing a single frame with projected speckle patterns.

B. IMPLEMENTATION DETAILS
The model in this paper is implemented on the PyTorch framework and trained using the Adam optimizer [35] with β1 = 0.9 and β2 = 0.999.Network training and testing are performed on a Windows computer equipped with an Intel i9 9900X CPU and two NVIDIA TITAN RTX GPUs.The batch size of 6 is used during the training process, while the batch size of 4 is used during the testing process.
For the Scene Flow dataset, we used initial learning rate of 0.004, and the model is trained for 32 epochs.The learning rate is halved at the 6th, 16th, and 26th epochs.During training, the input data with the resolution of 960 × 540 is randomly cropped to 512 × 256.
For the KITTI dataset, this paper transfer learning from the Scene Flow dataset.The model is trained for 200 epochs with an initial learning rate of 0.0002, following a learning rate cosine annealing decay strategy (T_0 = 5, T_mult = 2).
Finally, the industrial scene dataset.The model is trained for 48 epochs, with a maximum disparity setting of 768 and a minimum disparity of −256.

C. ABLATION STUDY
In this section, we discuss the network performance under different settings, including the high-dimensional feature fusion module (Neck Group), the loss used for supervising edge and cost volume correction weight (Multip Weight).As shown in Table 1, the new modules significantly outperform the baseline(without any proposed models) setting.The Neck Group feature extraction module, edge loss and the Multi Weight structure reduce the three-pixel error (3PE) by 37.4% on the KITTI dataset and decrease the end point error (EPE) by 23.3% on the Scene Flow dataset.In this context, D1-All represents the disparity estimates that are considered as erroneous if they exceed the maximum value between 3PE and 0.05Gt, where Gt denotes the Ground truth disparity.

D. CONTRAST EXPERIMENTS
In this section, we conducted an evaluation and comparison of our proposed model with other models to validate its effectiveness.For the Scene Flow dataset, we utilized the EPE and D1.Fig. 8 showcases the 3D reconstruction results obtained by PsmNet, BgNet, AcvNet, and our proposed method.For ease of visualization, zoomed-in results are displayed below the predicted results.
Our proposed method achieves more accurate disparity structures for each test sample.Compared to other methods, our approach demonstrates improved accuracy in areas with weak textures (image b,guitar fretboard) and small objects (image c,toy knife blade).This is attributed to the use of Multi Weight with different dilation scales in our network, which allows capturing cost volumes at various receptive field sizes.method.For ease of visualization, zoomed-in results are displayed below the predicted results.
Weight with different dilation scales in our network, which allows capturing cost volumes at various receptive field sizes.

TABLE 3. Evaluation of different methods (KITTI).
To quantitatively evaluate the performance of the compared methods, we compared nine different models and further calculated the average EPE and D1, as shown in Table 2.The dash (-) is used in table because these methods solely relied on the EPE metric for performance evaluation on the Scene Flow dataset, without considering the D1 metric.Overall, our method demonstrates a 10.7% improvement in D1 performance compared to the state-of-the-art AcvNet model in the testing results of the Scene Flow dataset.
To evaluate the effectiveness of our proposed model in complex scenes, we conducted comparative experiments on the KITTI2012 and 2015 urban street test datasets.Fig. 9 illustrates the disparity results comparison on the KITTI2015 dataset, with zoomed-in results displayed below the predicted results for better visualization.
Table 3 presents a performance comparison on the KITTI dataset.For KITTI2012, evaluation is conducted using the 3PE and the 2PE, where the maximum allowable error is set to three pixels (or two pixels).For KITTI2015, D1 errors are calculated for both background and foreground regions.
On the KITTI2012 dataset, our proposed method achieves a 9.8% improvement in 2PE compared to AcvNet, but experiences a slight retrogression of 1.36% in 3PE.However, for KITTI2015, our model demonstrates a significant performance improvement of 12.7%.

E. STANDARD SPHERE EXPERIMENT
To quantitatively analyze the accuracy of our proposed method, we fitted a sphere using 3D point cloud processing software to obtain the radius and center coordinates of the sphere.Fig. 10 illustrates the three-dimensional information of optical standard spheres obtained in a non-laboratory setting.The output point cloud achieves dense and closeto-ground truth spherical point cloud.There are minor   inaccuracies present at the contact area between the sphere and the tabletop, which could be attributed to the difficulty of projecting speckle patterns accurately onto the contact region.Apart from that, there are no apparent protrusions or indentations in the remaining regions.Table 4 exhibits the parameters of the fitted sphere.It is worth noting that the standard sphere used as the test dataset achieves sub-pixel reconstruction accuracy, with a reconstruction radius error of no more than 0.06mm.This level of accuracy meets the precision requirements for industrial inspection.

F. INDUSTRIAL SCENARIO EXPERIMENTS
In this section, to thoroughly demonstrate the effectiveness of our proposed method, we trained the model using abundant training data.We conducted tests on common industrial scenes with weak textures and inconsistent reflectivity, such as wheels, snap fasteners and hollow axles.
Fig. 11 presents some of the test results obtained from high-speed rail data in our experiments.Image a-c showcase the results for wheels, which contain large metallic areas.These areas are crucial and challenging to reconstruct in industrial scenes.The main components of the wheels consist of the tread surface and the axle.By using speckle images as input, our method successfully reconstructs dense disparity in regions where high reflectance phenomena.This helps mitigate disparity matching errors caused by non-uniform reflectivity to some extent.These results validate the robustness of our method in handling metal imaging scenarios effectively.Image d includes a steel rail with surface rust and a snap fastener.By establishing local correlations using speckle patterns, our proposed method is capable of mitigating erroneous matches caused by rusting.
For common reflective scenes, we performed 3D reconstruction on a hollow axle in Image e.The reconstruction was not affected by the large-area reflection along the axle's central axis, resulting in dense and continuous point clouds.Image f depicts the interface between the inner base of the wheel and the axle connection point.Approximately half of the field of view comprises low-texture regions, making it challenging to identify corresponding points.However, the proposed method in this paper enables clear reconstruction of the 3D information of the axle and base surface, which holds great significance for subsequent axle localization tasks.
Image h shows the pantograph pull rod on the roof of a train.Such scenes are typically challenging to reconstruct due to their special material properties.However, by utilizing a speckle projecting technique, our method can reconstruct dense and non-caking point clouds.This demonstrates the strong robustness of our proposed method in handling targets with low reflectance.

G. THEORETICAL ANALYSIS EXPERIMENTS
In order to facilitate a clear comparison between the conventional passive imaging method and the speckle imaging capability proposed in this study within an industrial setting, imaging experiments were conducted on the train bogie using both approaches.
The scene comprises the most challenging regions in stereo imaging, including low-texture and boundary areas, which are represented by the green and red boxes in Fig. 12.
Fig. 13 presents the disparity distribution of conventional passive imaging and the proposed method in two regions.This validation approach has been commonly utilized in prior studies [37], [40].
As depicted in Fig. 13(a), in the case of low-texture regions, the conventional method yields considerable noise as there 6786 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.are no prominent features available for accurate matching.The speckle projection pattern successfully suppresses noise in the disparity probability by establishing local feature uniqueness, leading to highly standardized and unimodal probability distributions.
The disparity discontinuity arising from the disparity jump between foreground and background leads to the presence of dual peaks in the disparity distribution of object edges.When disparity regression is further applied, it results in flocculent point cloud.As illustrated in Fig. 13(b), the conventional method exhibits dual probability peaks influenced by both the background and foreground, resulting in fragmented point clouds.Conversely, the proposed method achieves accurate matching of the unimodal probability for edges, effectively eliminating the phenomenon of flocculent disparity along the edges.
In conclusion, when dealing with scenarios involving weak texture or non-contiguous regions in the target, persisting with passive measurement techniques, where the model learns object color or shape information, can lead to numerous erroneous matches.However, our proposed method, which involves locally labeling target features, effectively addresses the challenges of matching points in weak-textured regions and mitigates the issue of flocculent disparity caused by probability jumps.

IV. CONCLUSION
This paper proposes an end-to-end speckle stereo matching network that achieves high-precision, high-generalization, and highly robust sub-pixel 3D reconstruction.To ensure reliable network reconstruction, a dataset of 6,480 pairs high-precision stereo was created using 7 plus 5 binocular Gray code phase shifting.Speckle images are used as network input.Quantitative analysis results demonstrate that the proposed network achieves approximately 10.7% higher matching accuracy compared to state-of-the-art networks on public datasets.It achieves radius reconstruction errors of no higher than 0.06mm for optical standard spheres and delivers excellent results on real high-speed rail data with weak texture levels and reflectivity non-uniformity.
There are several aspects of this method that could be improved.Firstly, due to the limited availability of GPU memory, a random cropping process is necessary during data input.While this allows for data augmentation and faster training, it can adversely affect the generalization of the learned weights to other data.Secondly, the proposed method demonstrates good performance in single-frame industrial inspections, but due to computational time constraints, it can only be applied to wheel-drop detection or low-speed rail-side inspections.Based on these analyses, we will explore lighter and faster stereo matching methods in the future.

FIGURE 1 .
FIGURE 1. Dataset production process, the figure illustrates the process of creating the dataset through three projection modes in this paper.

FIGURE 2 .
FIGURE 2. Gray code, fringe projection, the figure is from the 7 plus 5 Gray code phase shifting method used in the dataset generation.

FIGURE 3 .
FIGURE 3. Example of Speckle Pattern, the pseudo-random speckle pattern generated using black and white binary structured pattern.

Fig. 1 .
Fig.1.depicts the process of dataset creation in this study.It entails projecting Gray code patterns and stripe patterns

FIGURE 4 .
FIGURE 4. Schematic diagram of the proposed network structure, from left to right are: the image under normal illumination, the image under pseudo-random speckle projection mode, the Ground truth of the disparity map, and the point cloud Ground truth obtained using calibration parameters.

FIGURE 5 .
FIGURE 5. Schematic diagram of the proposed network structure,this network primarily consists of five components: feature extraction, 4D cost volume, cost volume correction, cost aggregation, and disparity regression.

FIGURE 6 .
FIGURE 6. Pixel sampling weights for different expansion scales, (a), (b), and (c) respectively represent window modes of the same central point at different dilation scales.

FIGURE 7 .
FIGURE 7. Feature map differences at the same level, the proposed method and the common pyramid pooling structure calculate the difference of feature map at the same layer, with brighter colors indicating larger differences in feature strength.

FIGURE 8 .
FIGURE 8. Comparison of methods (Scene Flow), from left to right, the images depict the original left image, PsmNet, BgNet, AcvNet, and the predicted results of our approach.The proposed method demonstrates decent accuracy in the pathological regions of the synthetic dataset.

FIGURE 9 .
FIGURE 9. Comparison of methods (KITTI 2015), From left to right, the original left image, PsmNet, BgNet, AcvNet, and the predicted results of our method are shown.Our approach demonstrates excellent performance in reconstructing small objects.

TABLE 4 .
Comparison of standard sphere experiment.

FIGURE 10 .
FIGURE 10.Standard sphere experience.The lateral view of the Ground truth and the output of the standard spherical point cloud.

FIGURE 11 .
FIGURE 11.Test results on real railway data.These images are selected from a test dataset and all represent important components in the high-speed rail field.Similar to the standard ball, they are untrained data used to test the effectiveness and robustness of the proposed method.

FIGURE 12 .
FIGURE 12. Reconstruction results of High-Speed train bogie.The first row of images represents the disparity map and point cloud results obtained through conventional passive imaging.The second row of images showcases the results obtained through the speckle projection method proposed in this study.

FIGURE 13 .
FIGURE 13.Probability distribution of disparities after applying the soft-max.The samples were selected from the two most challenging regions: (a) Low-texture region, (b) Boundary region.The first row displays the results obtained through conventional passive imaging, while the second row showcases the results achieved using the method proposed in this study.The x-axis represents the disparity, while the y-axis represents the probability distribution of disparities.

TABLE 1 .
Model evaluation in different Settings.

TABLE 2 .
Evaluation of different methods (Scene Flow).