RT-DLO: Real-Time Deformable Linear Objects Instance Segmentation

Deformable linear objects (DLOs), such as cables, wires, ropes, and elastic tubes, are numerously present both in domestic and industrial environments. Unfortunately, robotic systems handling DLOs are rare and have limited capabilities due to the challenging nature of perceiving them. Hence, we propose a novel approach named RT-DLO for real-time instance segmentation of DLOs. First, the DLOs are semantically segmented from the background. Afterward, a novel method to separate the DLO instances is applied. It employs the generation of a graph representation of the scene given the semantic mask where the graph nodes are sampled from the DLOs center-lines whereas the graph edges are selected based on topological reasoning. RT-DLO is experimentally evaluated against both DLO-specific and general-purpose instance segmentation deep learning approaches, achieving overall better performances in terms of accuracy and inference time.


I. INTRODUCTION
D EFORMABLE linear objects (DLOs) belong to the generic class of deformable objects and consist of wires, cables, strings, ropes, and elastic tubes, as the main relevant examples according to [1].Although vastly present in every domestic and industrial environment, DLOs still represent a problematic task for automated robotic systems, both at perception and manipulation levels [1].From the perception side, this is Alessio Caporali, Kevin Galassi, Riccardo Zanella, and Gianluca Palli are with the DEI -Department of Electrical, Electronic and Information Engineering, University of Bologna, 40136 Bologna, Italy (e-mail: alessio.caporali2@unibo.it;kevin.galassi2@unibo.it;riccardo.zanella2@unibo.it;gianluca.palli@unibo.it).
Bare Luka Žagar and Alois C Knoll are with the Chair of Robotics, Artificial Intelligence and Real-Time Systems, Technical University of Munich, 85748 München, Germany (e-mail: bare.luka.zagar@tum.de;knoll@mytum.de).
This article has supplementary material provided by the authors and color versions of one or more figures available at https://doi.org/10.1109/TII.2023.3245641.
Digital Object Identifier 10.1109/TII.2023.3245641 a result of the lack of any specific shape, color, texture, or feature making them easily distinguishable with respect to other objects.
In addition, DLOs are characterized by small dimensions in terms of diameters, posing an additional challenge concerning their 3-D perception capabilities with most sensors [2].From the manipulation side, the DLOs intrinsic deformability results in a high-dimensional state space with complex and nonlinear dynamics.Thus, modeling and predicting their behavior during a manipulation task is challenging [3], [4].The problem of DLOs segmentation is usually addressed in simple settings, like color threshold with a single DLO instance [5] or markers [6].In the last years, several DLO-specific approaches tried to address the instance segmentation problem more steadily, e.g., [7], [8], [9] with remarkable improvements at the introduction of every novel approach.Although the very recent method named FASTDLO [9] reaches good accuracy results with reasonable computation time, i.e., above 20 frames per second (FPS), it is still far from being real-time capable.From the domain of general-purpose deep convolutional neural networks (DCNN) tackling the instance segmentation task, there exist several approaches real-time capable, e.g., YOLACT [10] and YOLACT++ [11], however applying these methods directly to DLO-like objects usually does not guarantee satisfactory accuracy results [9].
To mitigate the aforementioned drawbacks and challenges, we propose an algorithm real-time capable and highly accurate, for instance, segmentation of DLOs, named real-time instance segmentation of deformable linear objects (RT-DLO).In Fig. 1, the plot of FPS versus accuracy shows how RT-DLO stands against the competition, being the fastest DLO-specific approach and the most accurate overall on the test-set of [9].
RT-DLO does not require any assumption about the background and the number of DLOs present in the scene.As input, it acquires the RGB image and provides as output a pixel-mapped colored mask where each DLO is represented by a unique color identifying its ID.In addition, being the DLO instances modeled as a sequence of key-points, a representation of the scene with spline curves can be easily obtained, e.g., for manipulation tasks employing a state-space representation different from the image space [3].
First, as a preprocessing step, the input RGB image is propagated through a DCNN trained on synthetically generated data aiming at segmenting the background, i.e., pixels not representing a DLO, and providing as output a binary mask.Then, a graph representation of the scene is constructed by efficiently sampling the vertices from the segmentation mask.The edges connecting the graph's vertices are instead computed by reasoning about the topology expressed by the mask, with an approach that considers both the proximity and orientations constraints among the vertices.Ideally, only a maximum of two edges per vertex should be sampled.In the case of intersections of DLOs resulting in the presence of high-degree vertices in the graph, subgraphs around the target vertices are extracted and further processed to disentangle the DLOs in the graph.Finally, the single DLOs are extracted from the graph based on an analysis of its connectivity.RT-DLO achieves a processing rate higher than 30 FPS with an input image of 640 × 360 pixels.To summarize, the main contributions of this article are as follows.
1) First instance segmentation approach concerning DLOs able to reach a processing rate higher than 30 FPS, i.e., real-time capable.2) Robust graph-based enhanced representation of the DLOs configuration in the scene given the segmentation mask.3) Improved an overall performance compared to several baselines, i.e., +2.9% intersection over union (IoU) with +7 FPS compared to [9] and +3.4% IoU with +32 FPS compared to [8].The source code implementing RT-DLO and the associated data are available at https://github.com/lar-unibo/RT-DLO.

A. Real-Time Instance Segmentation
The instance segmentation task consists in predicting objectswise segmentation masks.Remarkable results in this challenging task were achieved by mask region-based convolutional neural network (R-CNN) [12] with its detect-and-segment approach.However, due to this two-phase method, mask R-CNN is not real time.Recent approaches, for instance, segmentation of general objects are [10], [11], [13], [14], [15].Among those, only the authors in [10], [11], and [15] are capable of real-time performances.However, their applicability to DLOs requires attention due to the challenges highlighted in Section I. Also considering the dataset supply problem, satisfactory results were obtained only concerning the semantic segmentation task [16] and not for the instance segmentation one.Indeed, the performances of these methods are affected by the DLO instances lacking distinctive embeddings.On the contrary, due to the high-level abstraction by using a graph representation of the DLOs, RT-DLO can achieve better performances and robustness.

B. Segmentation of DLOs
The limited adoption of automatic or robotic solutions in the manufacturing and assembly tasks having to deal with DLOs has made the perception of such objects an important research topic of the last decade.In the past, simplifying assumptions were usually made, e.g., knowledge of the background [3], [5], [17], number of DLOs in the scene [5], markers [6].
Specific to DLOs, the first approach tackling complex backgrounds is represented by Ariadne [7], which employs a convolutional neural network (CNN) for DLO endpoints detection and a walking algorithm along the superpixels originated from the image.Ariadne+ [8] improves Ariadne in accuracy by employing a DCNN for the background semantic segmentation, removing the need for endpoint detection, and thus, also significantly speeding up the processing time.In Ariadne+, a graph representation of the scene is obtained by exploiting a superpixel-based approach where the graph nodes are selected based on the superpixels centroids and the edges based on superpixels contours overlapping.
Recently, FASTDLO [9] was introduced employing a skeleton-based approach on the segmentation mask and a similarity network for the correct interconnection of DLOs segments.FASTDLO is currently the state-of-the-art approach, for instance, segmentation of DLOs, achieving an inference time of more than 20 FPS.
RT-DLO employs an efficient and informative graph representation of the scene as opposed to the skeleton originated segments-based approach of FASTDLO and superpixel-based one of Ariadne+, resulting in faster processing times and improved accuracy, especially at the DLOs intersection.Indeed, RT-DLO can handle degraded masks more effectively since the continuity of the segmentation mask foreground along a DLO is not required.

III. METHOD
The idea exploited in RT-DLO is to model the current configuration of the DLOs present in the image with a graph structure G = (V, E) and then to extract the DLO instances from the obtained graph.The approach, schematized in Fig. 2, can be subdivided into six main steps, as follows.F) Intersections layout: Assessing the correct instances locally at the intersections.In the rest of this section, the procedures for obtaining the graph representation and extracting coherently DLOs instances from it are presented.First, the binary mask M b generation is discussed in Section III-A.Then, concerning the graph formation process, the vertices are examined in Section III-B while the edges are in Section III-C.Thereafter, the algorithm employed for processing problematic regions of the graph is provided in Section III-D.Finally, the extraction of the DLO instances, given the graph representation, is presented in Section III-E while the approach for analyzing their layout is in Section III-F.

A. Mask Generation
The mask generation step can be considered a preprocessing phase of RT-DLO since the graph representation of the DLOs is obtained employing only the binary mask M b of the scene and not the RGB image.In this article, we assume to use a DCNN, specifically DeepLabV3+ [18], trained on synthetically generated data [9].This choice is convenient since 1) good performances are shown in [9] concerning the semantic segmentation capabilities of this method; 2) a simplification on the comparison of RT-DLO against the baseline methods is achieved.Therefore, a binary mask M b is obtained by setting the pixels predicted to belong to a DLO to 1 and the remaining ones to 0.
It is worth mentioning that RT-DLO is independent of the method used to obtain the semantic segmentation mask.Different approaches can be employed depending on application requirements.

B. Vertices
First, vertices of the graph G are cleverly sampled from the binary mask M b and then characterized in terms of local orientation by a CNN.
1) Vertices Sampling: The set V = {v i } n i=1 contains the n vertices of the graph efficiently sampled from the binary mask M b .First, the distance transform operator is executed on M b obtaining M dist .This operator computes the Euclidean distances between the nonzero values of M b and the nearest boundaries (zero/black values) [19], thus assigning an intensity value to each pixel based on the computed distance.In Fig. 3 Then, M dist is dilated with a small square kernel (i.e., 3 × 3).The dilation operation is a maximum locating morphology operation.Indeed, as the kernel is convolved over the target image, the maximal pixel value overlapped by the kernel is computed and the corresponding image pixel at the anchor position is replaced.Dilation is usually applied on binary masks to enlarge the foreground (white) portion.Instead, in this article, the dilation operation is applied to the mask M dist , which contains intensities values, i.e., M dist is not binary, obtaining M dil .The local maximums of M dist are retrieved by comparing pixel-wise M dist and M dil masked using M b , as follows: Indeed, if the value of pixel (i, j) in M dist and M dil is the same, this means that the considered pixel is a local maximum.By assigning the pixel value of 1 to the maximums and 0 to the rest of the pixels, a new mask is obtained, denoted with M max , and illustrated in Fig. 3(c).It is worth mentioning that, by construction, M max approximates the center lines of the DLOs in the mask.
The set of maximum pixels of M max , i.e., pixels whose value is equal to 1, is denoted as V max .The cardinality of V max is relatively large and not really tractable in case real-time applications are sought.Thus, the farthest point sampling algorithm [20] is employed for down-sampling V max .A sampling ratio of α ∈ [0, 1] is used to specify the amount of down-sampling.The set of vertices V of the graph G is obtained as αV max .In Fig. 3(d), the vertices extracted from the sample mask of Fig. 3(a) with α = 0.15 are depicted.
2) Vertices Orientations: In the context of linear objects and linear shapes representation, for each given vertex of the graph, an orientation characterization can be performed.The objective is to describe locally the section of the linear object in the vicinity of the vertex as an orientation attribute of the vertex itself.Thus, the local orientation θ of a given vertex at pixel coordinates (x, y) is derived from a local patch of size δ × δ pixels, centered at (x, y) and with intensity values extracted from the distance transform image M dist .
A CNN is used to estimate an angular value from a given patch.Predicting an angular value via a learning-based method can become quite a complex task due to the periodicity of the angular data resulting in inaccurate distance representations when computing the loss function.Indeed, an angle of 2 • describes an orientation quite close both to 5 • and 179 • , although the corresponding loss values when applying common losses, e.g., L1-loss or mean square error loss (MSE-loss), are quite different.An approach pioneered in [21] is, thus, employed to address the angular periodicity and ambiguity in the loss computation.A given angular value θ in the range [0 • , 180 • ] is encoded as a 180-D vector with entries defined by applying a Gaussian function centered at θ and with variance σ.In this way, the angle θ is propagated smoothly in its proximity enabling benefits during the loss computation.The network structure is composed of two convolutional layers followed by a fully connected linear layer.Each convolution layer comprises a 2-D convolution followed by batch normalization.Between the two layers, a max-pooling operation takes place.After the convolution layers, the embedded data are flattened and the fully connected layer is used as an output to classify the patch in the 180-D vector.Binary cross entropy is used as loss function during the training stages, effectively shaping the learning task as a classification problem of the angular value in one of the 180 available classes.Consequently, the actual predicted angle is easily obtained from the 180-D vector as the index of the vector associated to the maximum probability.This angular value characterizes the orientation of the vertex associated to the processed patch.

C. Edges
The set E = {e j } m j=1 contains the m edges of the graph.Identifying the correct edges to be inserted in the graph is a complex task.Indeed, the connections between the vertices should consider both their relative proximity as well as orientation constraints, the latter in the form of vertex orientation and edge orientation.The vertices orientations were described in Section III-B2.For convenience, a matrix E ∈ R m×2 describing the edge set E as organized tuples is introduced.The relative proximity between vertices is exploited to obtain an initial candidate set of edges, denoted as E knn = {e j } m knn j=1 .That is, for each vertex, the K nn nearest neighbors in V are retrieved as edges.The value of K nn is a user-defined parameter and it follows that m knn = n × K nn if we consider the edges as directed.In addition, E knn ∈ R m knn ×2 is the matrix description of E knn .The K nn nearest neighbor case with K nn = 8 for a sample vertex is depicted in Fig. 4(a).
1) Vertex-Vertex Similarity: The orientation constraints between two general vertices v 1 and v 2 are evaluated by assigning a score to their connection by means of the cosine similarity defined as In particular d 1 v is obtained as [cos(θ 1 ), sin(θ 1 )] , where θ 1 is the orientation of v 1 obtained from Section III-B2.For d 2 v the derivation is similar.In (1), at the denominator is denoted the product of the norms.The cosine similarity is then used to score the orientations between two vertices pair.
For efficiency reasons, the cosine similarity is evaluated by means of matrix operations.Given the matrix D v ∈ R n×2 of vertices orientations in the form of direction vectors obtained from the predicted angles, i.e., for vertex i we have d i v / d i v , the cosine similarity between each pair of vertices of the set V can be obtained as being S v,v ∈ R n×n and | • | denoting the absolute value.
2) Vertex-Edge Similarity: Similarly to the vertex-vertex case, the matrix D e ∈ R m knn ×2 of edges orientations can be defined.It contains the direction vectors obtained by subtracting the coordinates of the associated vertices followed by a normalization by their distance.The cosine similarity between each vertex of V and each edge of E knn is obtained as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
with S v,e ∈ R n×m knn being the obtained similarity matrix between vertices and edges.
3) Combining S v,v and S v,e : At the current stage, because of the dimensions mismatch, it is not possible to combine S v,v [see (2)] and S v,e [see (3)].Thus, an augmented similarity vertices score matrix Sv,v ∈ R n×m knn is introduced.This matrix is obtained by mapping the values of S v,v in a column vector employing the entries of E knn as row-column pairs to access S v,v .Then, a matrix is constructed by repeating the column vector n times along the rows.Notice that this is a valid operation since S v,v is a symmetric matrix.The complete similarity score matrix is obtained as where B ∈ R n×m knn is the oriented incidence matrix and is the Hadamard product.The matrix B is used to inject into the scores the knowledge of the edge existence (entries 0) and direction (entries ±1), i.e., source vertex to target vertex.This information is very helpful since it allows the discrimination of the edge set based on the sign of their similarity score, i.e., the entries of S.An illustration of the two possible situations that can occur is provided in Fig. 4(b).The cosine similarity between the sample vertex 30 and its K nn neighbors can provide both positive values, in case the edge direction vectors and the vertex orientation vector of 30 are both in the green region, or negative values if instead they lay in the red region.
Based on the scores contained in the similarity matrix S, a positive and a negative edge for each vertex of V is sought, being the characterization of an edge as positive or negative related to the sign of the associated score in S. Notice that it may happen that a positive or negative edge for a given vertex does not exist, e.g., in presence of a vertex describing the terminal region of a DLO.In Sections III-C4 and III-C5, the calculus to extract the positive and negative edges from the similarity matrix S of (4) are provided.
4) Positive Edges: Let us define B + ∈ R n×m as the positive incidence matrix where the entries −1 of B are set to zero, i.e., B + contains values of the set {0, +1}.Let us also define a row vector d ∈ R 1×m knn containing the lengths of the edges.A matrix D ∈ R n×m knn can be created stacking n times d along the rows.Thus, the entries of D can be filtered out based on B + as D + = D B + .Then, a generic entry (i, j) of S is weighted based on the associated edge length as

The vector d i
+ denotes the ith row of D + .The matrix containing all the computed weights is denoted as W + ∈ R n×m knn .The presence of B + makes W + sparse since only the entries associated to an entry +1 in B + will have a weight different from zero.It follows that S + = S W + , where S + is the similarity matrix associated to the positive incidence matrix.Finally, an edge, if it exists, is selected for each row of S + as the edge associated to the maximum entry of S + along the considered row.Thus, considering the generic vertex i, i.e., row i of S + , its positive edge e i + is obtained as e i + = {E knn } j * , with j * = argmax(s i + ), s ij * + > μ, where with s i + we denote the ith row of S + , with {E knn } j * ∈ R 1×2 the column vector at index j * containing the indices of the source and target vertices and with μ a small threshold to avoid selecting edges with a very low similarity score.
5) Negative Edges: Following a similar discussion to the one of Section III-C4, let us define B − ∈ R n×m as the negative incidence matrix where the entries +1 are set to zero, i.
. It follows that S − = S W − , obtaining S − as the similarity matrix associated to the negative incidence matrix.Finally, an edge, if it exists, is selected for each row of S − as the edge associated to the minimum entry of S − along the considered row.The generic edge e i − is obtained as 6) Edge Set: The edges obtained from Sections III-C4 and III-C5 are combined into a single edge set denoted as E with which the graph G is generated [see Fig. 4(c)].

D. Intersections Processing
Although the graph G should contain vertices having a degree, i.e., number of neighbors, of only 1 or 2, depending on if the considered vertex is an endpoint, vertices having a higher degree, i.e., 3 or more, are still possible.This happens if the considered vertex is placed at the intersection area between multiple DLOs resulting in several ambiguous edge connections, e.g., Fig. 4(d).
To address this problem, Algorithm 1 is employed: it detects the problematic vertices, extracts subgraphs around each of them, and by employing the cosine similarity approach it finds the correct edges.
With more details, Algorithm 1 takes as input the graph G just created and provides as output the updated graph G where the ambiguous vertices are removed and their edges redistributed correctly in their local subgraphs.First, the ambiguous vertices are detected as those vertices with a degree larger than 2 and collected in V int , line 1.Then, for each v in V int , the neighbor vertices are collected (lines 2 to 5).In case one or more vertices of one set of neighbors overlaps with another one, those sets are merged (line 6) grouping all vertices and treating the problematic area as the composition of the original ones.Each set N of N defines a subgraph around the problematic area.For each subgraph defined by the vertices in N , the number of connections (edges) to establish is determined by k conn as the integer division between the cardinality of N and 2 (line 9).The combinations of 2 elements of the vertices contained in N are collected in the set C (line 10).These tuples of elements can be considered as edge candidates for the subgraph.For instance, in Fig. 4(e), the candidate edges of the subgraph under analysis are depicted in red (wrong) and green (valid).Thus, an edge solver (line 11) is employed to assign a score to each of those.In particular, given two sample vertices, i.e., v 1 and v 2 , which connection should be scored, the direction of the edge connecting them is computed as Then, the connection cosine similarity score, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Algorithm 1: Processing.similarly to (1), is evaluated as )| where with d 1 v and d 2 v the vertices orientations are denoted.Notice that the absolute value of the similarity is employed since we are not interested in its sign, but only in its magnitude.Each (v i , v j ) of C is, therefore, augmented by the computed score s ij as (v i , v j , s ij ) and collected by the set Z, which is then sorted based on the score values in descending order (line 12).Finally, an interactive procedure takes place to loop through the elements of Z and collect the k conn new edges into E new as those defined by vertices not being already assigned to other edges (lines 13 to 19).The sample subgraph analyzed through this article is solved obtaining the final graph depicted in Fig. 5(a).

E. DLOs Instances Extraction
The single instances of the DLOs present in the scene are retrieved considering the connectivity of the graph, i.e., each DLO is represented as an isolated subgraph from the initial global graph.For each subgraph, the path from one endpoint (vertex with degree 1) to the other is extracted.A path P t can be denoted as an ordered sequence of vertices as P t = {v t 1 , v t 2 , . . .v t t n }.The extracted path denotes the sequence of key-points describing the DLO instance.From these key-points, a spline curve can be fitted to better approximate the DLO shape and then an estimate of the DLO thickness can be obtained from the distance transform mask M dist .Thus, a colored mask M c can be drawn as shown in Fig. 5.
In some cases, it can happen that two or more DLO instances are effectively denoted by a single path.This situation can occur in case, for instance, the intersection between two DLOs happens along the border of the image.RT-DLO, employing only the mask image, tries to solve this scene by connecting jointly the two distinct DLOs, see as an example Fig. 6 showing the obtained DLOs instances given the source image and mask.To handle this condition, as a final consistency check along the obtained path, the cosine similarity is computed between each vertex of the path and its two neighbors.In particular, given a sample vertex ).If the product s i,i+1 s i,i−1 is negative, it means that the path is not smooth at vertex v t i .Thus, the path P t is detached at vertex v t i into two different paths (see Fig. 6).

F. Intersections Layout
To correctly assign the DLOs IDs in the intersection areas among two or more DLOs, additional color information is required.Indeed, only from the binary mask M b and the corresponding constructed graph, this information is not achievable.In this work, we deploy the approach first described in [9]: the standard deviation of the RGB color along the edge connecting two vertices in the area of the intersection is used.For a given intersection, all the involved edges are collected and the standard deviation of the RGB values along the edges compared.The edge corresponding to the smallest value is selected as the one being at the top of the pile.Therefore, the mask M c is drawn taking into account this information.

IV. EXPERIMENTAL VALIDATION
The experiments were performed employing a workstation with an Intel Core i9-9900 K CPU clocked at 3.60 GHz and an NVIDIA GeForce GTX 2080 Ti.PyTorch 1.4 is used for software implementation.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

A. Test Dataset and Metrics
To evaluate the RT-DLO performances on real data, a test set originally deployed in [8] and extended in [9] is used.It consists of 135 manually labeled real images of electrical wires with varying diameters and grouped into three categories, each consisting of 45 images defining a specific scenario, labeled as C1, C2, and C3.Each category is further divided into subclasses based on the number of intersections present in the images, i.e., 1, 2, and 3, with 15 samples each.
As evaluation metric, the IoU (IoU , where M is the mask under evaluation and M gt is the ground truth) is employed.The mask M corresponds to the colored mask M c where each DLO instance is denoted by a unique color and the IoU score is just the average score across the instances of the image.

B. Training
The training dataset and the training details for the semantic segmentation network employed in Section III-A are those of [9].As the threshold for the segmentation mask M b , the value of 0.3 is used for its binarization based on [9].
Concerning the CNN network of Section III-B2, the dataset was obtained from the synthetic dataset of [9] by randomly cropping a square patch along the vertices, obtained applying Section III-B, and by using the knowledge of the 2-D ground truth curve to label the orientations.A patch size δ = 15 was used.The convolutional layers have 32 filter channels as opposed to the 180 neurons for the last linear layer.The network was trained for 50 epochs, employing a batch size of 32 and a learning rate equal to 5 × 10 −4 .Adam was selected as optimizer with the final network weights selected based on the validation loss.In Fig. 7(a), the training and validation loss curves smooth decay can be observed, validating the choice of the smooth angle labeling approach.

C. Angle Prediction Evaluation
The network employed to sample the vertices orientations (see Section III-B2) is compared to a baseline method and tested both on a synthetic test set (100 samples like those of Section III-A) and on the real test set.As baseline method, an approach based on Gabor filters [22] is used.A Gabor filter is a The ground truth angular value for each vertex is directly available in the synthetic data.In the real test set, instead, it is recovered from the ground truth instances mask: spline curves are fitted for each instance and the vertices' reference orientation extrapolated as tangent of the curve at a vertex position.
Overall, the proposed network approach shows better performances, especially in the real scenario [see Fig. 7(b) and 7(c)].Indeed, we discovered that the Gabor filter approach is more sensitive to the mask's noisy edges and to its characterizing parameters.Considering the real scenario, the error distributions are characterized with the following mean and standard deviation statistics: −0.03 • ± 6.00 • for network; −0.17 • ± 7.07 • for Gabor.

D. Parameters Choice and Influence
RT-DLO employs two user-defined parameters that can affect the method performances, the vertex sampling ratio α and the number of K nn nearest neighbors.In Table I, the performances of RT-DLO on the test set are compared by varying α and K nn .RT-DLO maintains remarkably strong performances across a wide range of values for α, i.e., between 0.1 and 0.3.On the contrary, selecting α as 0.05 results in a quite reduced number of vertices, hurting the description power of the graph.The selection of K nn is also not critical with a value of 8 already sufficient to reach top performances.

E. Baseline Methods
RT-DLO is compared against both DLO-specific and generalpurpose instance segmentation methods.To the first group belong the algorithms named Ariadne+ [8] and FASTDLO [9].Both approaches employ the same segmentation network architecture of the one deployed in Section III-A.In particular, the network weights are those of [9], thus allowing a straightforward comparison with [9] and [8].

F. Evaluation
The comparison of RT-DLO against the baseline methods of Section IV-E is presented in Table II by means of the IoU score computed starting from the color masks provided as output by each method.The table also provides details about the average inference time, FPS, and key-points availability as output.Overall, RT-DLO shows strong performances both in terms of IoU score, i.e., +2.14% and +3.76% improvements against FASTDLO, i.e., top-performing algorithm, when deploying the same segmentation mask M b .In particular, RT-DLO can provide the same level of performance of FASTDLO employing a lighter backbone, thus making it possible to reach a frame-rate of 36 FPS, +13 FPS over FASTDLO.
A qualitative comparison on a few samples of the test set among RT-DLO and the DLO-specific methods is provided in Fig. 8, where the superiority of RT-DLO is especially visible at the intersections.Indeed, the major advantage of RT-DLO against the competing approaches resides in its graph representation, which is based on M b but is less susceptible to degraded area as opposed to the skeleton approach of FASTDLO and mask-guided superpixels method of Ariadne+.In this regard, a deeper analysis on RT-DLO robustness is reported in Section IV-H, where the requirement of an accurate segmentation mask M b is experimentally relaxed in two different scenarios: 1) the mask is artificially corrupted with an erosion process; 2) the segmentation back-end specifically trained on electric wires

G. Evaluation of Inference Time
In Table III, a characterization of the average timing on the test set for each stage of the proposed method is provided.Faster processing times can be achieved by deploying a lighter backbone, such as residual neural network (ResNet)-50, saving several milliseconds in the binary segmentation phase and obtaining a total processing time of about 27 ms as opposed to 31 ms.The graph generation time is below 10 ms, highlighting the efficiency of RT-DLO.If the colored mask is not required, the last two stages can be skipped shortening the computation time by 1 to 2 ms depending on the number of intersections, as shown in Table III.The timings in the table are obtained employing the hardware setup highlighted at the beginning of Section IV.A similar timing of about 13.5 ms is obtained for the total w/o segmentation case with a consumer laptop (Intel Core i7-12700H CPU).Indeed, high computation power is mostly required for the deep segmentation network.Thus, if the application does not require a complex deep model for scene semantic segmentation, the hardware specifications can be relaxed or, alternatively, higher overall FPS can be achieved.

H. Mask Degradation and Different Segmentation Back-Ends
The improvements of RT-DLO against the main competing methods, i.e., the DLO-specific algorithms Ariadne+ and FASTDLO, are not only in the form of faster processing time and better accuracy.Indeed, an important benefit of the graph representation approach of RT-DLO is its ability to better handle degraded semantic segmentation masks M b .To illustrate the graph-based advantage of RT-DLO, a two-fold study is conducted.On one hand, the performance drop of RT-DLO and the competition is evaluated after an erosion process is applied on M b .On the other hand, different segmentation networks trained on public datasets, i.e., not DLO-specific ones, are employed.
Concerning the first study, the masks M b of the test set are iteratively eroded, that is the process consisting in thinning the foreground area of a binary mask, with a kernel of 3 × 3 pixels to simulate the effects of less precise masks coming from Section III-A.The evaluation is performed by comparing RT-DLO to the DLO-specific methods on the masks obtained from Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the two different backbones, i.e., ResNet-50 and ResNet-101 (see Fig. 9).From the plots of Fig. 9(a), RT-DLO shows the capability of maintaining an almost steady performance after the first round of erosion process, followed by a drop in the scores in the subsequent iterations.On the contrary, the drop of scores associated to FASTDLO and Ariadne+ is significant from the very first iteration.Considering the mask IoU score as an upper bound, RT-DLO is capable of maximizing its score as opposed to the compared approaches.The images of Fig. 9(b) allow us to catch better the effects of the erosion process and the RT-DLO advantages on the test sample C1 of Fig. 8.
A study about replacing the segmentation back-end of Section III-A with ImageNet pretrained salient object segmentation (SOS) approaches is conducted in Fig. 10, avoiding the need of training a segmentation network on a DLO-specific dataset.The SOS architectures tested are as follows: EGNet [23], F3Net [24], CPD [25], and PoolNet [26].When evaluated on the test set, RT-DLO continues to achieve strong performances compared to the competing approaches [see Fig. 10(a)].The advantages of RT-DLO in the case of degraded masks are even more apparent for the sample images of Fig. 10(c), which show how RT-DLO is able to minimize the number of extracted instances.

V. CONCLUSION
In this article, a novel method for real-time instance segmentation of DLOs is presented.The representation of the DLOs as a graph offers an efficient, simple, and intuitive way to obtain the DLOs instances.The segmentation performance improvements compared to current state-of-the-art approaches for DLOs detection are noticeable.More importantly, the inference time capabilities of RT-DLO make it stand out even more compared to existing approaches.In future works, RT-DLO can be improved and expanded in several ways.For example, RT-DLO currently processes each image individually.However, the segmentation stage can be substituted with a different approach exploiting the previous frames of a video sequence for a better and possibly faster segmentation mask.In this context, a tracking system can be also investigated to match the DLO instances across the video sequence.Finally, the graph-based representation of DLOs can be easily extended to other structures, like wiring harnesses, and other sensors, like 3-D cameras resulting in a 3-D graph.

Manuscript received 5
August 2022; revised 13 November 2022 and 16 January 2023; accepted 6 February 2023.Date of publication 16 February 2023; date of current version 19 September 2023.This work was supported by the European Commission's Horizon 2020 Framework Programme with the project REMODEL Robotic technologies for the manipulation of complex deformable linear objects under Grant 870133.Paper no. TII-22-3348.(Corresponding author: Alessio Caporali.)
A) Mask generation: Obtaining a binary mask M b from the input color image via a DCNN.B) Vertices sampling: Processing M b , with vertices orientation characterization employing a CNN.C) Edges sampling: Exploiting the proximity among the vertices and the orientation between vertices and edges.D) Intersections processing: Disentangling the DLOs in the graph representation via subgraphs analysis.E) DLOs instances extraction: Computing pixel-wise DLOs instances masks in the image plane.

Fig. 3 .
Fig. 3. Vertices sampling key elements: the mask (a) M b , (b) M dist , and (c) M max , (d) obtained vertices.The bright regions in (b) denote high intensity values.
e., B − contains values {−1, 0}.The entries of D can be filtered out based on B − as D − = D B − .The weight matrix associated to D − can be defined as W − ∈ R n×m knn where only the entries associated to −1 in B − are different from zero.A generic entry w

Fig. 5 .
Fig. 5. Connectivity graph (a) is processed to extract the DLOs instances and obtaining the colored mask M c (b).(a) Grap final.(b) Colored mask.

Fig. 6 .
Fig. 6.DLOs instances extraction with and without consistency check in case of a problematic mask.

1 e 1 e
belonging to path P t .Its two neighboring vertices are v t i−1 and v t i+1 while the two edges directions are d i,i−and d i,i+1 e .According to (1), the cosine similarity between d i v and d i,i−can be denoted as s i,i−1 = s(d i v , d i,i−1 e ), where d i v describes the orientation of vertex v t i .Similarly, s i,i+1 = s(d i v , d i,i+1 e

Fig. 7 .
Fig. 7. Evaluation of the CNN angular prediction network and comparison against baseline approach based on Gabor filters.(a) Training and validation losses.(b) Evaluation on synthetic dataset.(c) Evaluation on real dataset.The error density is recovered via a kernel density estimation.

Fig. 8 .
Fig. 8. Qualitative evaluation of RT-DLO versus FASTDLO and Ari-adne+ on the test set classes.

Fig. 9 .
Fig. 9. RT-DLO, FASTDLO, and Ariadne+ performance comparison after iteratively degrading the binary mask M b .(a) Evaluation on the test set eroding M b .(b) Qualitative comparison given M b eroded for 1, 2, and 3 iterations.

Fig. 10 .
Fig. 10.Comparison of RT-DLO, FASTDLO, and Ariadne+ when employing popular salient object segmentation networks.(a) Evaluation on the test set employing M b obtained by SOS networks.(b) Qualitative comparison of the instances masks given M b from EGNet.

TABLE I PERFORMANCES
OF RT-DLO WHEN VARYING THE VERTICES SAMPLING RATIO α AND THE NUMBER OF K NN NEAREST NEIGHBORS linear filter usually employed for texture analysis.By properly defining its main parameters, it is possible to obtain a patch similar to the one processed by the network.Thus, the baseline approach consists in: generating 180 Gabor filters spanning [0, 180[ degrees; finding the filter with the smallest cumulative difference with respect to the input local patch; assigning as angle prediction the angular value used to generate the filter.

TABLE III AVERAGE
EXECUTION TIMES [MS] OF THE MAIN RT-DLO STAGES WRT THE NUMBER OF INTERSECTIONS IN THE IMAGEis replaced with others back-ends trained on general purposed datasets.