Multiscale Adjacency Matrix CNN: Learning on Multispectral LiDAR Point Cloud via Multiscale Local Graph Convolution

Multispectral LiDAR can rapidly acquire 3D and spectral information of objects, providing richer features for point cloud semantic segmentation. Despite the remarkable performance of existing graph neural networks in point cloud segmentation, extracting local features still poses challenges in multispectral LiDAR point cloud scenes due to the uneven distribution of geometric and spectral information. To address the prevailing challenges, cutting-edge research predominantly focuses on extracting multiscale local features, compensating for feature extraction shortcomings. Thus, we propose a multiscale adjacency matrix convolutional neural network (MS-AMCNN) for multispectral LiDAR point cloud segmentation. In the MS-AMCNN, a local adjacency matrix convolution module was first proposed to efficiently leverage the point cloud's topological relationships and perceive local geometric features. Subsequently, a multiscale feature extraction architecture was adopted to fuse local geometric features and utilize a global self-attention module to globally model the semantic features of multiscale. The network effectively captures global and local representative features of the point cloud by harnessing the capabilities of convolutional neural networks in local feature modeling and the self-attention mechanism in global semantic feature learning. Experimental results on the Titan dataset demonstrate that the proposed MS-AMCNN network achieves a promising multispectral LiDAR point cloud segmentation performance with an overall accuracy of 94.39% and a mean intersection over union (MIoU) of 86.57%. Compared with other state-of-the-art methods, such as DGCNN, which achieved an MIoU of 85.43%, and RandLA-net, with an MIoU of 85.20%, the proposed approach achieves optimal performance in segmentation.


I. INTRODUCTION
W ITH the increasing improvement of devices such as 3D laser scanners and depth sensors, the application scope of 3D data is becoming increasingly wide.Due to affluent geometric shape information, these data play an important role in fields such as autonomous driving [1], [2], urban modeling [3], [4], and engineering surveying [5], [6].As a typical form, point cloud data contains spatial coordinates and related attribute information and plays a crucial role in the recognition and segmentation of 3D scenes.Achieving accurate point cloud semantic segmentation enables a more detailed understanding and description of scenes, thereby improving scene perception capabilities.In order to enhance the discernment and segmentation capabilities of objects, initial single-wavelength light detection and ranging (LiDAR) systems have progressively evolved into multispectral LiDAR systems, aiming to obtain more comprehensive spectral information.Due to the lack of spectral information, traditional single-wavelength LiDAR cannot comprehensively describe the features of point cloud scenes and obtain satisfactory segmentation results.In contrast, multispectral LiDAR and hyperspectral LiDAR can simultaneously obtain spectral information from multiple wavelengths, thereby achieving more accurate point cloud segmentation results [7], [8].In real point cloud scenes, multispectral LiDAR data is voluminous, and the feature selection is complex, making the task of multispectral LiDAR semantic segmentation challenging and valuable.
In recent years, two mainstream methods exist for multispectral LiDAR point cloud segmentation: the image-based approach [9], [10] and the point cloud-based approach [7], [11].The image-based approach primarily converts the multispectral intensity, echo, and elevation information of the multispectral LiDAR into 2-D raster data for classification.Utilizing the constructed multispectral imagery, a subsequent procedure is executed to extract 2-D spectral attributes [12], texture characteristics [13], and normalized vegetation indices [13], among other representative features.Subsequently, a classic machine learning methodology is applied to discern various land cover types.This method simplifies the complexity of data processing.However, converting 3D point clouds into 2-D image data results in the loss of spatial 3D information and spectral information, thereby decreasing the accuracy of semantic segmentation.With the advancement of computer technology and hardware, more research has shifted towards directly processing multispectral LiDAR point clouds.The point cloud-based segmentation method, similar to the image-based method, utilizes intensity, elevation, and designed spatial descriptors to classify the 3D point cloud, thereby accurately segmenting spatial information.Luo et al. [7] employed the random forest algorithm to classify Titan's multispectral LiDAR point cloud directly, thus verifying the potential of spectral information derived from multispectral LiDAR for land cover classification.Shi et al. [14] performed feature selection on the spectral and contextual attributes of the multispectral LiDAR point cloud using an equalization-based optimization algorithm, followed by classification of the 3-D point cloud using a support vector machine.Their approach achieved higher classification accuracy than solely utilizing the raw coordinate information.However, designing optimal spatial features is a cumbersome task in large-scale and complex scenes, which may lead to unreliable accuracy in point cloud semantic segmentation.
Nowadays, deep learning techniques have experienced significant advancements in various fields, including speech recognition, natural language processing, and computer vision.Deep learning has facilitated an extensive exploration of the abundant spectral characteristics and the potential for point cloud segmentation in multispectral LiDAR.In the domain of deep learning for point clouds, PointNet [15], as a pioneering end-to-end model for point cloud deep learning, directly utilizes raw point clouds as input to extract point features through multilayer perceptrons (MLP) and has demonstrated outstanding performance in tasks such as point cloud classification and semantic segmentation.To address the issue of poor performance in capturing local structural information in PointNet, researchers have proposed a series of improvement methods.DGCNN [16] employs edge convolution operations to replace the stacked MLPs in Point-Net in order to preserve permutation invariance while extracting local geometric features from point clouds.On the other hand, HDGCN [17] introduces the graph convolution operator, DGConV block, which aggregates local neighborhood features within the graph and propagates them to the neighboring points.By utilizing a hierarchical structure of DGConV blocks, the network achieves local and global feature extraction from point clouds.As for DDGCN [18], it constructs a similarity matrix in the local graph, which incorporates both point cloud distances and orientations, thereby enabling the extraction of local features in a dynamic neighborhood graph.Graph-based methods model point clouds as the topological structure of graphs and design corresponding convolutional operators, utilizing graph convolutional neural networks (CNNs) for feature extraction and classification.These methods [16], [17], [18], [19], [20] have demonstrated exemplary performance in semantic segmentation.However, most graph convolution-based methods only consider the relationship between the central point and its neighboring points in the local graph while neglecting the importance of relationships among neighboring points.Moreover, in methods based on graph CNNs, most approaches only extract local geometric features from a single scale, neglecting the multiscale neighborhood structure information.This limitation results in a restricted capability of the network to describe scene features.The research [8], [21]

A. Projection-Based Method
Projection-based methods are closely related to 2D image processing, where the fundamental idea is to project 3D point cloud data onto a 2D plane or utilize multiple-view images and then process them using 2D CNNs.As a pioneering work in this approach, MVCNN [22] aggregates multiview image features into a global feature descriptor, enhancing segmentation accuracy and precision by observing visual information from different object viewpoints.To improve network robustness and the accuracy of multiview fusion features, several variant methods [23], [24], [25] have been proposed based on MVCNN.GVCNN [26] groups different visual descriptors extracted by CNNs under different viewpoints based on discriminative scores.It then aggregates the visual feature operators of each group through global pooling to obtain corresponding segmentation results.In contrast to previous methods, View-GCN [27] adopts a graph convolutional network structure.It converts multiview point cloud data into a View-Graph, which is used to aggregate node features of multiple views for learning global shape descriptors.In summary, projection-based methods integrate the projection information of multiple viewpoints or multiple point clouds to enhance the expressive power of point clouds.However, this method often sacrifices the spatial 3D features of point clouds, and the extensive use of projections brings higher time costs and memory consumption.

B. Voxel-Based Method
Voxel-based methods are primarily based on dividing the point cloud into multiple regular 3D grids and then utilizing deep learning models to segment each voxel.Among these methods, VoxNet [28] is one of the earliest point cloud segmentation networks that introduced voxelization.VoxNet directly processes sparse 3D point clouds and effectively captures their shape information by incorporating voxel feature encoding.These networks transform unstructured point cloud data into structured voxel grids and employ CNNs for learning.However, these networks frequently struggle to establish high-resolution voxelized models.To address this issue, OctNet [29] introduces an octree structure, which efficiently handles and represents 3D point clouds with irregular distributions and nonuniform densities, thereby enhancing network performance and efficiency.Additionally, PointGrid [30] adopts space-filling curves to map point cloud data onto a 3D voxel grid.It better learns local geometric feature details by performing convolutions and pooling operations on the voxels.It is worth noting that although voxelbased methods have performed well in point cloud segmentation, the voxelization process sacrifices specific spatial details.Moreover, constructing and storing high-resolution voxel grids requires substantial memory resources, resulting in typically lower computational efficiency.

C. Point-Based Method
Point-based methods directly process 3D point clouds without voxelization or projection.As a pioneering work, PointNet [15] introduces a method based on MLP that can directly perform deep learning on unstructured point clouds while ensuring permutation and rotation invariance in the results.To address the inability of PointNet to capture local features, PointNet++, proposed by Qi et al. [31], models multiscale and multilevel local regions through processes such as hierarchical sampling, local feature extraction, and feature aggregation to capture a broader range of contextual information.The PointNet series networks have demonstrated exemplary performance in tasks such as point cloud classification and segmentation, and many networks [32], [33], [34] have been developed based on this foundational framework.Recent research has discovered that multiscale features exhibit robustness in capturing density variations within point clouds and in aggregating geometric information from different scales.To achieve this objective, 3DMAX-Net [35] has designed a multiscale contextual feature learning block that combines upsample and downsample unit blocks to obtain rich contextual features from point clouds at different scales.Inspired by global feature aggregation algorithms in image processing, 3D-PSPNet [36] adopts a pyramid structure.At each scale of the pyramid, local contextual information within subscenes is independently obtained through grid pooling, and global features are obtained through feature aggregation, thereby enhancing the interaction between large-scale contextual information and small-scale local details.MS-PCNN [37] employs a U-shaped structure of upsampling and downsampling to utilize multiscale global and local features.MNFEAM [38] accelerates semantic abstraction and aggregation of features in the network by capturing semantic relationships in the local feature space with different receptive fields from multiscale neighboring point sets.Enhancing the receptive field of networks to obtain rich local and global contextual information and improve the expressive capability of point cloud features has become a hot research topic in the current field.

D. Graph-Based Method
With the advancement of graph neural networks, graph-based methods have been widely used for mining unstructured data.GACNet network proposed by Wang et al. [39] utilizes the graph attention convolution block to establish local graphs between points and their neighboring points.By employing attention mechanisms to compute edge weights between the central point and its neighbors, the relationships between different nodes are weighted, enabling the superior propagation of important node information.Similarly, DGCNN [16] utilizes EdgeConv to construct dynamic local graphs to extract and learn local semantic features efficiently.DGCNN variants have been developed to enhance the capability of extracting local features and improve overall performance.LGGCM [40] leverages local spatial attention convolution and global spatial attention module to capture geometric features of local point cloud spaces and global contextual information.AGConv [41] dynamically learns point clouds of different semantic parts, generating adaptive graph convolutional kernels, thereby enhancing the flexibility of local convolutions.3D-GCN [42] introduces a deformable kernel in 3-D space for extracting point cloud features at multiple scales.By establishing topological structures and fully considering the interrelations between points, graph-based methods have proven effective for 3D point cloud data.
Most existing graph-based methods typically construct local graphs using ball queries or k-nearest neighbor searches and utilize MLPs to extract point features.However, these approaches only consider the pairwise relationships between the center point and its neighborhood in the local graph while neglecting the inherent connections among the neighborhood points.This limitation leads to insufficient network capturing of

A. Study Area
The study area is located near the University of Houston, USA.The data were acquired on February 16, 2017, using an Optech Titan MW (14SEN/CON340) LiDAR system.Optech Titan is a multispectral LiDAR system containing three bands (1550, 1064, and 532 nm) with a pulse repetition frequency of 175 kHz per channel (525 kHz total) and a scan angle of ±26°.The average flight height during scanning is 500 m above ground level, and the average point density of the multispectral lidar point cloud after data preprocessing is 11.2 points/m 2 .The whole dataset covers 4167 m × 1200 m, and a total of 14 LAS datasets were obtained, which include 20 land cover classes, e.g., healthy grass, artificial turf, evergreen trees, deciduous trees, bare earth, residential buildings, roads, and cars.In this study, we manually selected 17 sample scenes from a pool of preprocesses 14 multispectral LiDAR point cloud LAS datasets.These scenes covered 27 20 000 m 2 and were partitioned into training, validation, and testing sets according to the methodology illustrated in Fig. 1.According to the study of relevant LiDAR data classification, we mainly considered six classes of land cover, i.e., impervious ground, grass, buildings, trees, cars, and powerlines.The samples for impervious surfaces, grass, buildings, and trees have sufficient numbers, while the samples for cars and powerlines are only one-eighth the size of the trees class.

B. Multispectral LiDAR Data Processing
Optech Titan is not strictly a multispectral LiDAR system.The three channels of laser beams have different downward tilting angles, so not every point has intensity data for all three channels [11].To consolidate the intensity values of the three channels in Titan point cloud data onto a single point, the point cloud data from the three independent channels were merged into a single point cloud data using the method described in [7], based on the principle that adjacent points have correlated intensity information.Only the Titan dataset's first channel (1550 nm) return points were the sole determining factor for inclusion.Any point lacking intensity return values from other channels at that position was excluded.
To obtain the ground truth for the scene, the multispectral LiDAR point cloud is manually labeled point by point with corresponding class labels.Due to the limited capacity of the GPU, the whole sample area cannot be directly input into the network.Therefore, we constructed the multispectral LiDAR Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.dataset for the study area by following the S3DIS [43] and Semantic3D [44] datasets.Typically, when using the S3DIS dataset for deep learning-based point cloud methods, the number of points input into the network is fixed.Meanwhile, considering the point density of the Titan dataset in this area, we have improved the farthest point sampling and K-nearest neighbor (FPS-KNN) sampling method [44] to quickly obtain a fixed number of training samples while preserving the integrity of the scene.Fig. 2 shows the process of our sampling strategy.
where F out is the result of feature normalization, [min, max] is the range of values after feature normalization.F in , F in_ min , and F in_ max , respectively, represent the original input feature and its minimum and maximum values.Individually the 9-D features of every point in the original data are normalized using (1).
2) Optimized FPS-KNN: Uniform sampling [15], voxel sampling [45], and block sampling [46] are standard point cloud sampling methods.The FPS-KNN sampling method for multispectral LiDAR point cloud data can better preserve the integrity of objects and effectively generate samples that cover the whole scene, compared with the above-mentioned methods.KNN compensates for some pointwise spatial relationships lost in downsampling by acquiring a fixed number of neighboring points in FPS.Considering the density of multispectral LiDAR point cloud and data augmentation in this study area, we improved the FPS-KNN method, and the flowchart is as follows. 1

A. Network Architecture
This study highlights the importance of selecting the optimal segmentation scale in the multispectral LiDAR point cloud scene, where objects exist at multiple scales.To address this challenge, we designed the detailed architecture of MS-AMCNN, as depicted in Fig. 3 Finally, the fused features are transformed into multispectral LiDAR point cloud class labels using a fully connected layer.

B. Local Adjacency Feature Convolution
We build the local graph of each point in the scene by searching the point cloud using KNN for the input samples.As shown in Fig. 4, consider the local graph G(V, E) consisting of the set of points P i = {p i , p i1 , p i2 , . . ., p iK } R 3+C , where p i is the central point of the point set P i , and p ij denotes Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the K neighboring points of p i .Here, 3 and C, respectively, stand for the spatial coordinate dimension and spectral feature dimension of the multispectral LiDAR point cloud.For the local graph, V i = {1, 2, . . ., K} and E i ⊆ V i × Vi represent the set of vertices and edges, respectively.
Let F i = {f i ; f i1 ; f i2 ; . . .; f iK } be the feature set corresponding to P i .The correlation between neighboring nodes is calculated by constructing the adjacency matrix of the local graph using a self-attentive mechanism.In order to capture feature differences in the local neighborhood, the relative positional coordinates Δ P i = {Δp i1 , Δp i2 , . . ., Δp iK } and feature differences Δ F i = {Δf i1 , Δf i2 , . . ., Δf iK } of all adjacent points p ij to the central point p i and their corresponding features f ij are calculated.The implementation is as follows: Subsequently, a self-attention mechanism is employed to generate the adjacency matrix of the neighboring points, which further obtains the relationship between adjacent nodes in the local graph.This method satisfies the point cloud permutation invariance [47].We obtain the local autocorrelation matrix by concating the relative positional coordinates and the differences in high-dimensional features.The related matrix R measures the high-dimensional spatial relationship between features of adjacent nodes, and it is defined as follows: In the equation, γ and θ represent two different MLPs with nonlinear activation functions.The symbols || and × denote the connect operation and matrix multiplication.We utilize Softmax to diminish the redundant features among different nodes in the related matrix and generate the adjacency matrix A. Each element of A is defined as follows: A ij and R ij are the elements of adjacency matrix A and correlation matrix R, respectively.The central node features are updated by multiplying the adjacency matrix with the original features.The formula is as follows: Finally, by incorporating the captured global shape structural information f i with the local neighborhood information fi , the output features of the central point p i in the local graph are obtained.The output features of the LAF-Conv are as follows:

C. MSFE Block
We have designed a multiscale structure to improve the accuracy of point cloud segmentation and enhance the diversity of local features.The effectiveness of this method has been well-established in previous studies [31], [45], [46].Based on the locally self-attentive adjacency matrix, LAF-ConV is implemented to extract local features.In order to improve the efficiency of point cloud segmentation and the sensitivity of local feature extraction, we design an MSFE block to construct local graphs with different numbers of neighboring points and utilize LAF-ConV for feature extraction and multiscale feature fusion.
Previous studies have demonstrated the effectiveness of multiscale structures; therefore, we adopt the MSFE block architecture, as shown in Fig. 5.For the original training samples of the input network, we perform KNN search on different numbers of neighboring points to construct multiscale sampling results.We construct corresponding local graphs at different scales of neighboring points and use the LAF-ConV block for local selfattentive adjacency matrix feature extraction.For the extraction results of the LAF-ConV block at each scale, we use three different MLPs with nonlinear, learnable activation functions for feature transformation and pool the features through a pooling layer for feature fusion to obtain the output results.The entire  process of the MSFE block is shown in the following: where

D. Global Self-Attention Mechanism
The self-attention mechanism possesses the capability to capture long-range dependencies between features dynamically.Moreover, it can adaptively adjust weights based on the information from different positions in the input model, thereby better modeling contextual relationships and significantly enhancing the model's generalization ability [48].
Although LFA-ConV and MSFE block can extract local structural features of point clouds through neighborhood graphs at different scales, they still need global contextual information of the scene.Therefore, we introduced a GSA block to learn the fusion of local point features at multiple scales to perceive global contextual relationships sufficiently and improve the accuracy and generalization ability of point cloud scene understanding.GSA block as shown in Fig. 6.
The output features of the MSFE block are used as the embedding layer input of the GSA block.Specifically, the input Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I AVERAGE ACCURACY EVALUATION METRICS FOR DIFFERENT METHODS UNDER THE THREE SCENARIOS
features Y = {y 1 ; y 2 ; y 3 ; . . .; y S } .The following formula can express the computation of the GSA block: Among them, Q, K, and V represent query, key, and value vectors, respectively.Where W Q i , W K i , and W V i are the linear learnable transformation matrices for the query, key, and value, respectively.In the GSA block, the self-attention mechanism is used to calculate attention, which can be expressed as follows: Here, the Softmax function is used for normalization to calculate the weight of each key vector, which is then multiplied by the value vector to obtain the self-attention result of that head, and using σ for linear transformation of the results.Finally, the self-attention results obtained from all heads are weighted and fused and added to the original embedding features to obtain the output of the GSA block.The GSA block can be represented as a whole as follows:

E. Implementation Details
All experiments were conducted on a workstation with 64 GB of memory, an Intel Core i7-12700k processor, and an NVIDIA GeForce RTX 3090.In the experiments, we utilized crossentropy as the loss function and Adam [49] as the optimizer, with an initial learning rate set to 0.001, and trained the model for 500 iterations.During the model training process, we saved and evaluated the best-performing model on the test data.To quantitatively assess the performance of the proposed network, we employed six commonly used metrics: overall accuracy (OA), average precision (AP), average recall (AR), average F1-score (A-F1), mean intersection over union (MIoU), and Kappa coefficient [50].

A. Overall Performance
Due to the limited research on point-based deep learning methods for multispectral LiDAR, some classic point cloud deep learning semantic segmentation algorithms were selected as comparison methods in this study, including PointNet++ [31], GACNet [39], RandLA-Net [51], DGCNN [16], LDGCNN [52], and PointTransformer [53].These methods have achieved significant results in the point cloud and have been widely used for semantic segmentation tasks with point cloud data.
The average accuracy evaluation metrics table and the segmentation comparison chart of several methods are shown in Table I and Fig. 7, respectively.In general, the network architecture proposed in this study achieved state-of-the-art performance in most metrics.Compared with other models, it improved by 2 to 3 percentage points.Our model accurately delineated the contour shapes of different categories in the multispectral LiDAR point cloud scenes, effectively extracting the boundaries of objects when compared with the ground truth.Due to the sparse point cloud density in the multispectral LiDAR point cloud scenes considered in this study, PointNet++ and GACNet, which are suitable for indoor point cloud scene semantic segmentation, struggled to effectively extract local information in complex urban scenes, making the extraction and recognition of categories such as power lines, cars, buildings, and trees difficult.RandLA-Net and PointTransformer methods involve downsampling operations on the sample point clouds and, while compensating for key point information loss by designing local feature aggregation modules or self-attention layers, still experience some degree of local and global information loss.These two networks exhibited limited recognition capabilities for small impervious ground and grassland regions in the scene, making them inadequate for comprehensive semantic feature modeling of multispectral LiDAR scenes.Due to the similarity of spectral features between impervious ground and grassland categories, as well as the similarity in height between buildings and trees, many of the comparative methods exhibited lower accuracy when classifying these categories.Conversely, graph-based methods such as DGCNN, LDGCNN, and our proposed model performed well in the semantic segmentation of scenes.
Our model showed higher consistency with the ground truth and fewer misclassified outlier points compared with the other two graph-based methods, demonstrating its strong ability to extract geometric features.Our network achieved average OA and MIoU of 94.39% and 86.57% on three test datasets, respectively.Compared with other methods, our approach achieved the best segmentation performance, demonstrating our method's superiority.

TABLE II SEMANTIC SEGMENTATION RESULTS ON THE MULTISPECTRAL LIDAR DATASET EVALUATED
In particular, Table II depicts the intersection over union (IoU) evaluation results of our proposed method for various classes in TestArea3.Within multispectral LiDAR point cloud scenes, our approach exhibits remarkable segmentation accuracy for grass, buildings, trees, and impervious surfaces compared with previous methods.It is worth noting that our method achieves an impressive OA of 95.82%.However, our model's identification of cars and powerlines falls short of ideal performance.This can be attributed to the diverse shapes observed in the powerline category within multispectral LiDAR scenes and the substantial discrepancies in their spectral characteristics and uneven distribution, resulting in segmentation ambiguity.Generally, our model performs exceptionally well in the IoU rankings for several categories, effectively enabling the segmentation and recognition of objects in multispectral LiDAR point cloud scenes.

B. Specific Scenario
As shown in Fig. 8, to better showcase the model's performance, we selected four typical subscenes from three test sets.We conducted a comparative analysis to highlight the distinctions between our proposed method and DGCNN and PointTransformer.DGCNN is a graph-based approach, whereas our method was developed based on a variant.On the other hand, PointTransformer is a Transformer-based method.Both of these approaches demonstrate remarkable semantic segmentation accuracy.In Scene A, the area represents a grass region in a parking lot.Compared with DGCNN, our model can extract the grass area more accurately, mainly due to constructing the adjacency autocorrelation matrix in the local neighborhood graph, effectively allocating weights to capture the relevance between neighboring points.Similarly, in Scene B, the area consists of complex impervious ground and grassland.DGCNN's EdgeConV block, in constructing the local graph, needs a fixed number of points in its KNN search, resulting in relatively limited local geometric features that fail to fully reflect the actual characteristics of the impervious ground and grassland in this scene.In Scene C, the area is mainly composed of buildings and trees.The buildings are surrounded and obscured by trees with uneven distribution.Our model performs better segmentation in recognizing edge points among different object types.For example, DGCNN misclassifies several building edges as tree points in this area, struggling to distinguish building points close to trees effectively.Our model's MSFE block successfully integrates neighborhood graph features extracted at multiple scales by LAF-ConV, enhancing the interclass separability of our model.In Scene D, the area represents a parking lot with

C. Analysis of Computational Cost
We compared our proposed method with other networks regarding network parameter count and computational complexity on the multispectral LiDAR dataset.Table III shows that our method has a higher number of parameters compared with PointNet++, GACNet, RandLA-Net, DGCNN, and LDGCNN networks.Additionally, our method also exhibits relatively higher FLOPs (floating-point operations).Despite having a larger parameter count, our method achieves the highest accuracy regarding OA and MIoU within acceptable hardware constraints, striking a good balance between accuracy and computational cost.
On the other hand, when comparing the approaches with and without the MSFE block and GSA block, it is evident that the majority of model parameters are concentrated in the MSFE block.This is primarily due to the increased computational cost of the multiscale local LAF-ConV.compared the two sample generation methods: random sampling [51] and FPS-KNN [21].

D. Ablation Experiments
From Table V, as the sample quantity increases, models trained with samples generated using the same sampling strategy exhibit an upward trend in accuracy.When the sample point quantity is set to 4096, the accuracy improves by approximately 0.5% compared with the experiments with 1024 and 2048 points.In other words, the size of the multispectral LiDAR point cloud samples is positively correlated with the accuracy of scene segmentation.On the other hand, when the sample points are all set to 4096, our proposed sampling method achieves the highest semantic segmentation accuracy for multispectral LiDAR compared with Random Sampling and FPS-KNN.This can be primarily attributed to our method's ability to expand the sample quantity, enhance the expression of scene semantics in multispectral LiDAR point clouds, and better match the density of airborne multispectral LiDAR point clouds.Compared with GAC [39], AdaptConv [20], and EdgeConV [16], our LAF-ConV model demonstrates superior performance in accuracy evaluation metrics, with the highest achieved MIoU.Hence, our proposed LAF-ConV method facilitates improved capture of the local characteristics of multispectral LiDAR by the network, thereby enhancing its robustness.
3) Effectiveness and Performance of MSFE Block: In order to fuse multiscale local geometric features, we designed an MSFE structure to enhance the diversity of local features.In the MS-AMCNN model, the number of LAF-ConV blocks in the MSFE block and the number of KNN neighbor points in each local graph are two key parameters.In our model, the default number of LAF-ConV blocks in the MSFE block is set to 3, corresponding to neighborhood point numbers 12, 20, and 32, respectively.To set these parameters reasonably, we conducted a series of comparative experiments on Test Area 3 to verify the effectiveness of multiscale structure in improving network accuracy.The specific evaluation metrics are shown in Table IV.
For the number of LAF-ConV blocks in the MSFE block, we can observe that when the number of LAF-ConV blocks is set to 1 or 2, the network's performance is poor compared with the MSFE block with three layers of LAF-ConV blocks, especially in terms of MIoU.However, as the LAF-ConV blocks increase to three layers, there is a significant improvement in the fusion of local features and perception capability.This increase also allows for a more diverse range of geometric information among different objects without significantly increasing time consumption.However, when the number of LAF-ConV layers increases to 4, the network's performance decreases with the increase in layers.Therefore, moderately integrating multiple LAF-ConV blocks can increase the model's receptive field.However, excessive stacking may negatively impact the model's performance.Consequently, the optimal setting that balances performance and effectiveness is determined to be three layers.

4) Number of Neighboring Points in LAF-ConV:
The number of selected neighboring points in the LAF-ConV block significantly impacts the extraction of local features in multispectral LiDAR data.In the MSFE block, we observed that the network's performance improves as the number of neighborhood points in the LAF-ConV blocks increases beyond 8.However, when the number of neighborhood points reaches 16, there is no significant improvement in network accuracy compared with when the number of neighborhood points is 12.Instead, it increases both time and memory consumption.Since the geometric information constructed by the two different neighborhood point settings may be similar, we chose the smaller option of 12 neighborhood points to minimize the cost.Additionally, when the number of neighborhood points is set to 20 or 32, there is a significant improvement in network performance.Therefore, based on the MSFE block with three layers of LAF-ConV blocks, we select 12, 20, and 32 as the KNN search points for the three local neighborhood graphs, respectively.
5) Effectiveness and Performance of GSA Block: Despite integrating multiscale local geometric features in the MSFE block, we introduce the GSA block to perceive global contextual semantic information more comprehensively.We investigate the impact of different global feature learning methods based on self-attention mechanisms on network accuracy.As shown in Fig. 9, we compare and analyze the self-attention mechanism modules in P-A [54], A-SCN [55], and PCT [56] through experiments.This examination explores their contributions to global feature learning and demonstrates, through experimental evidence, the influence of various global feature learning approaches on model accuracy.
As shown in Fig. 10, when the network does not utilize a self-attention mechanism, although it does not directly affect the segmentation's OA, there are more mis-segmentation results in the model, resulting in an MIoU of only 86.90%.Therefore, it is crucial to employ a self-attention mechanism when fusing local geometric information for global modeling in the MSFE block.When attempting to replace the GSA block with other self-attention mechanisms, the model's performance further declined, with both the OA and MIoU of the segmentation results being lower than the results obtained using the GSA block, which achieved an OA of 94.52% and an MIoU of 90.38%.These results confirm the significant advantage of the transformer-based GSA block in modeling global contextual information.Although this study achieves satisfactory segmentation accuracy on multispectral LiDAR, the network's computational efficiency and memory consumption are relatively high due to the involvement of multiscale processing and self-attention mechanism learning.Therefore, future directions include designing a lightweight, high-precision network to handle large-scale multispectral LiDAR point cloud scenes.Additionally, this experiment is limited to the Titan multispectral LiDAR dataset, which includes only three spectral channels, far fewer than the corresponding high spectral imaging channels.Further exploration of the potential to effectively classify scenes using the rich spectral information of hyperspectral LiDAR is yet to be developed.

Abstract- Multispectral
LiDAR can rapidly acquire 3D and spectral information of objects, providing richer features for point cloud semantic segmentation.Despite the remarkable performance of existing graph neural networks in point cloud segmentation, extracting local features still poses challenges in multispectral LiDAR point cloud scenes due to the uneven distribution of geometric and spectral information.To address the prevailing challenges, cuttingedge research predominantly focuses on extracting multiscale local features, compensating for feature extraction shortcomings.Thus, we propose a multiscale adjacency matrix convolutional neural network (MS-AMCNN) for multispectral LiDAR point cloud segmentation.In the MS-AMCNN, a local adjacency matrix convolution module was first proposed to efficiently leverage the point cloud's topological relationships and perceive local geometric features.Subsequently, a multiscale feature extraction architecture was adopted to fuse local geometric features and utilize a global self-attention module to globally model the semantic features of multiscale.The network effectively captures global and local representative features of the point cloud by harnessing the capabilities of convolutional neural networks in local feature modeling and the self-attention mechanism in global semantic feature learning.Experimental results on the Titan dataset demonstrate that the proposed MS-AMCNN network achieves a promising multispectral LiDAR point cloud segmentation performance with an overall accuracy of 94.39% and a mean intersection over union (MIoU) of 86.57%.Compared with other state-of-the-art methods, such as DGCNN, which achieved an MIoU of 85.43%, and RandLA-net, with an MIoU of 85.20%, the proposed approach achieves optimal performance in segmentation.Index Terms-Deep learning, graph convolution, multiscale structure, multispectral LiDAR, point cloud segmentation, selfattention mechanism.

1 )
Data Normalized: In order to accelerate the convergence of the network, normalization of the original data is necessary.A 9-D vector, including X, Y, Z, R, G, B, X , Y , Z represent each point in the data.Where XY Z represent the position coordinates of each point in the scene ranging from [−1,1]; RGB represent the intensity values of the 1550, 1064, and 532 nm channels of the Titan point cloud ranging from [0,1]; X Y Z represent the position coordinates of each point relative to its location in the scene, ranging from [0,1].Titan multispectral point cloud data are normalized using the mapminmax function after removing the offset.The data normalization is implemented as follows: Randomly select one point from the input multispectral point cloud scene as the initial point.Then, use this point as the center of KNN to search for k 1 and k 2 nearest neighbors in its neighborhood.To minimize the loss of multispectral LiDAR point cloud scene features during the sampling process, we establish a reasonable number of sampling points based on the density of the multispectral LiDAR point cloud in the dataset.The rationality of this parameter setting has been validated through ablation experiments to ensure that the sampling points adequately represent the characteristics of the entire multispectral LiDAR scene.In this experiment, k 1 is set to 4096, and k is set to 1024, both including the point itself.Use the k nearest neighbors as a training sample and remove the k nearest neighbors from the point cloud scene.During the generation process of multispectral LiDAR point cloud samples, it is essential to record the index number of each point in the scene to determine the presence of duplicate regions within the samples.2) Calculate the 3D distance from the previous seed point to the remaining point cloud scene, and designate the point with the farthest spatial distance as the next seed point.Repeat the operation in step 1) to obtain another sample.3) Iterate the operation in step 2) until the sample covers the entire point cloud scene and obtain a fixed number of output samples.For point cloud samples in overlapping areas, the method uses the class with the highest predicted count for each duplicated point as the final segmentation result.Compared with the original FPS-KNN sampling method, this method can effectively perform data augmentation and expand the scene's sample size.IV.METHODOLOGY Point clouds have plenty of 3D spatial features, which can intuitively describe the characteristics of natural spatial objects.However, point clouds have the characteristics of discrete and

Fig. 3 .
Fig. 3. Proposed MS-AMCNN.uneven distribution, which leads to the need for a topological relationship between points.Standard convolution operators are powerless to deal with disordered point cloud features.Inspired by previous related research, we proposed a novel approach to more effectively obtain topological relationships between point clouds and better perceive 3D spatial information and local semantic features of points.This network enhances the topological information of point clouds by modifying the structure of local adjacency graphs.Furthermore, it utilizes a multiscale structure to improve point cloud segmentation efficiency and the ability to obtain local geometric features.The designed network mainly consists of three key components: (1) A point cloud local adjacency matrix feature convolution that fully uses the spatial relationship between 3D points to extract local features effectively.(2) An MSFE block that performs feature extraction on points at diverse levels and aggregates multidimensional features to further encode local features.(3) A GSA block that utilizes the excellent global feature learning ability of the self-attention mechanism to enhance the global feature from the multiscale feature block.
. The network comprises multiple MSFE blocks, GSA blocks, and MSFE-GSA operations stacked together to output the point cloud's semantic segmentation results end-to-end during training.MSFE-GSA operation effectively combines the proposed MSFE and global self-attention mechanism.Adopting a multiscale local graph feature extraction method containing LAF-ConV can effectively extract the local features of the multispectral LiDAR point cloud.The MSFE block aggregates the multiscale features, while the GSA block enhances global contextual features.Following multiple MSFE-GSA operations, extraction of global features occurs through the maxpooling layer.After that, global features and local features are fused and passed through fully connected layers to output each point's label.The network's details are described in the following.First, a multispectral LiDAR point cloud sample of dimensions N × d (excluding batch dimension) is inputted into the network, where N represents the number of points in the sample and d represents the initial feature dimension of the input points.The initial features undergo processing by the MSFE-GSA operation layer, resulting in local features of dimensions N × 32.Within the framework of the MSFE-GSA operation, regarding the input multispectral LiDAR point cloud samples, a multiscale local graph is established through KNN nearest neighbor search.For each local graph, local adjacency matrices are constructed utilizing the LAF-ConV approach to acquire local features.Using MLP and pooling operations, the multiscale local features are connected to generate the output MSFE features.Subsequently, these features are fed into the GSA block for global feature modeling, thereby obtaining the output results of the graph convolutional layers.Subsequently, the features extracted by the previous MSFE-GSA operation layer are used as inputs to the subsequent MSFE-GSA operation layer, resulting in two levels of local features, each with dimensions of N × 64.Like a CNN network, the three local features are fused and inputted into an MLP layer, resulting in a global feature of dimensions N × 1024, further enhanced by a max-pooling layer to obtain a global descriptor.Subsequently, the 1-D global descriptor is repeated to expand to each point, resulting in a new feature of dimensions N × 1024.The global descriptor is then concatenated with the previous three local features to obtain the fused global and local features with dimensions of N × 1184.

Fig. 4 .
Fig. 4. Local adjacency feature convolution block on the local graph.

Fig. 7 .
Fig. 7. Results of different segmentation methods in the context of testing scenarios.

Fig. 8 .
Fig. 8.Comparison of segmentation methods in different scenes (highlighted in black box for differences).
many cars.Similarly, although our model still faces challenges in recognizing and extracting car and impervious ground edge categories in this area, where some boundaries that belong to the impervious ground are misclassified as buildings, our model still outperforms DGCNN.It should be noted that PointTransformer exhibits unsatisfactory performance in differentiating between impervious surfaces and grass, particularly in scenarios A and B, with the issue being most pronounced in scenario C. Similarly, in scenario D, PointTransformer wrongly classifies impervious surfaces in parking lots as building categories.These segmentation disparities are validated in TableII.Due to the lower point density generated by airborne multispectral LiDAR and the point cloud downsampling operations utilized in PointTransformer, RandLa-Net, and similar networks, the learned features fail to capture local semantic information adequately.These detailed visualization results further validate the effectiveness of our model in capturing spatial geometric structures and extracting local features.

Fig. 9 .
Fig. 9. Architecture of various self-attention mechanisms in 3D point cloud processing.

Fig. 10 .
Fig. 10.Comparison of model impact evaluation metrics using different self-attentive mechanisms.
Multiscale Adjacency Matrix CNN: Learning on Multispectral LiDAR Point Cloud via Multiscale Local Graph Convolution Jian Yang , Binhan Luo , Ruilin Gan , Ao Wang, Shuo Shi , Member, IEEE, and Lin Du on deep learning-based segmentation of multispectral LiDAR point clouds encompasses a comprehensive exploration of global and local representative feature acquisition.Nevertheless, in the context of multispectral LiDAR point cloud scenes, the necessity of achieving multiscale adaptive point cloud feature extraction becomes remarkably prominent, owing to the uneven distribution of geometric spatial patterns and spectral information.

1 )
Effectiveness of Optimized FPS-KNN: Different quantities of training samples from multispectral LiDAR point clouds reflect varying scene semantics, object continuity, and integrity.To validate the effectiveness of the improved FPS-KNN method, we tested the model's performance under different training sample configurations.Considering the limitations of GPU memory, we set the maximum sample quantity to 4096, consistent with the settings of other datasets such as ModelNet40.Additionally, we

TABLE III FLOPS
ANDPARAMETERS OF DIFFERENT METHODS ON MULTISPECTRAL LIDAR DATASETS ("M" AND "G" FOR MEGABYTES AND GIGABYTES)

TABLE IV EVALUATION
METRICS FOR DIFFERENT PARAMETERS OF THE MSFE BLOCK ON TEST AREA 3

TABLE V TEST
RESULTS OF MS-AMCNN WITH DIFFERENT SAMPLING METHODS ON TEST AREA 3 2) Effectiveness and Performance of LAF-ConV: To further extract localized geometric and spectral information from multispectral LiDAR point clouds, we propose the LAF-ConV method based on extracting features from the localized adjacency matrix.As presented in Table III, we investigate the effectiveness of the LAF-ConV model by substituting the conventional convolutional kernels in our model with alternative graph convolutional kernels from established literature.