Road Severity Distance Calculation Technique Using Deep Learning Predictions in 3-D Space

Some roadside objects pose a significant danger to pedestrians and vehicles when they are too close to the road. A few examples are trees, poles and fences. Their proximity to the road can change over time due to natural conditions or human activities. Early detection of severe roadside conditions can help avoid accidents and save lives. However, detecting the roadside severity objects requires many resources and new techniques due to the size and complexity of the road network. Deep learning and image processing techniques can be leveraged to address this requirement and build an automatic roadside severity detection system. In this work, we propose a novel roadside attribute and distance calculation technique that extends our previous work in this area (lane-line method). The past work depended on the detected lane-line widths to calculate the distances. This method made mistakes in the presence of challenging road conditions and misclassifications. Here, we propose to combine camera configuration data with a neural network detector to develop a distance vs pixel model for reliable road severity distance calculation. We use camera metadata to transform the 2D image data predicted by the deep neural network into a 3D space. The improved model was tested with a real-world dataset. Compared to the lane-line method, the new combined model reported 36% and 37.5% accuracy improvements in right and left-hand side distances, respectively.


I. INTRODUCTION
Every year, many lives are lost, and significant property damage is reported from the accidents caused by fixed roadside objects. The roadside objects such as trees, poles and barriers increase the severity of the injuries and the property damage caused by the crash. The offset of the fixed roadside objects significantly contributes to roadside crashes [1].
Australian government road safety statistics [2] show 1126 road deaths during the 12 months ended November 2021. In 2018-19, there were 39,755 people admitted to the hospital with road crash injuries. The annual rate of admission per 100,000 population was 143. Similar statistics were published by the Department of Transport and Main Roads (DTMR), Queensland [3].
DTMR collects data from the Queensland road network to assess the road safety conditions, improve the road infrastructure and reduce fatalities in road accidents [4], [5]. The data collection is performed annually for Mobile Laser Scanning (MLS) data and Digital Video Recording (DVR) data.
The associate editor coordinating the review of this manuscript and approving it for publication was Nikhil Padhi .
The data collection aims at improving road condition, road safety compliance, environmental risk assessment and obstacle proximity.
The Australian Road Assessment Program (AusRAP) has been running with state, territory and local governments on infrastructure safety investment optimization and measurement since 2001 [6]. AusRAP identifies the physical roadside attributes and assesses the safety risk associated with each to produce a Star Rating Score.
DTMR uses its collected data for identifying AusRAP attributes and analyzing them for road safety applications. The past work has been focused on identifying and developing neural network and image processing techniques for AusRAP attributes using MLS and DVR data [4], [5], [7].
Our study focuses on a custom image dataset extracted from DVR data collected by the DTMR. This work proposes a novel combined model for attribute detection and distance calculation for roadside severity detection. We leverage the camera calibration data with the predicted road attributes to reliably measure the roadside severity distances.
The proposed technique does not rely on attribute predictions. As misclassification is an unavoidable issue in object detection, this method's dependency on camera metadata improves the severity distance calculation. When developing the fused model, a fixed four-point projective transformation model was combined with the predicted attributes of the fully convolutional network. The projective transformation model was pre-calculated manually using the camera metadata. If there were more than one camera angle, more matching projective transformation models could be created, or a single model could be used for approximated results.
A distance graph was created using the pixel-distance relationship between the image and world planes. The graph can be used to extrapolate an unknown distance represented by a reference pixel for a roadside attribute. This graph does not change from frame to frame; hence it experimentally proved effective for challenging road conditions. Two real-world road image datasets were used for experiments, and the results were compared with the recently published lane-line method [4], [5].
The contributions of this article are listed below.
• We introduce a novel distance calculation method for roadside attributes.
• We propose a combined model for detecting and rating roadside severity objects. The combined model uses deep neural network predictions and a pixel-distance relationship derived using camera metadata. The rest of this paper is organized as follows. Section II discusses closely related work on roadside attribute detection and distance calculation and improvements proposed related to distance calculation. Section III describes our proposed methodology. Section IV reports experimental results. Discussion of issues and potential improvements and conclusions are presented in Section V.

II. RELATED WORK
Ikiriko et al. [1] highlighted the importance of roadside severity analysis using a Multinomial Logit Model. They used the vehicular crash data of twenty-four counties in the state of Maryland from 2016 to 2018. Their findings showed the proximity of roadside attributes such as trees, sign support poles, traffic signal support, and fence strongly influence the severity of road crashes. Trees were responsible for most of the fatal and injury crashes, and guardrails were responsible for most of the injury crashes and property damages.
A project similar to the DTMR AusRAP attribute analysis was conducted by the Kentucky Transportation Center [8]. They leveraged the U.S. Road Assessment Program (usRAP) to develop quantitative, objective roadside safety ratings for rural two-lane roads in the state of Kentucky. Their roadside severity attributes are similar to the AusRAP severity attributes (Roadside severity -driver-side distance, Roadside severity driver-side object, Roadside severity -passenger-side distance, and Roadside severity -passenger-side object).
Some studies used simulated crash data to analyze roadside severity. Cheng et al. [9] conducted a crash severity analysis in roadside trees on crash simulation software. They also proposed a new index for evaluating the accuracy of accident severity classification. Relatively similar work has been reported in [10]. However, the proposed severity metrics' effectiveness was not further validated with real-world data.
Many related studies [1], [8]- [10] conclude that fixed roadside attributes such as trees, poles, walls etc., and moving attributes such as pedestrians [11] pose dangerous road conditions to the motorists when they are close to the road. In addition to the crash data and simulations, real-world visual data can be used to assess the severity of roadside objects. A few notable findings on roadside attribute detection and severity analysis can be found in some of the past work [4], [12], [13].
A Fully Convolutional Network (FCN) was proposed to detect AusRAP attributes from a real-world image dataset recorded on Queensland roads [14]. An improvement to the FCN was proposed in [7]. The study suggested optimizations to the number of convolution layers, activation function, pooling type, attribute image size, number of iterations and learning algorithms. A fusion of multiple FCN models was also suggested to optimize the training times when introducing new attributes [15]. Jan et al. [4] proposed a convolutional neural network for identifying roadside attributes for AusRAP and a distance calculation method. Their distance calculation was based on the predicted lane-line width. This technique is less reliable for roads with no sharp lane-lines and prediction errors. Our proposed method outperforms the lane-line method as the new approach is based on a mathematical model.

III. METHODOLOGY
This section describes our approach in detail. First, the method that uses the predicted road line width for distance calculation is explained briefly. Next, the fully convolutional network used for the road attribute prediction and the proposed distance calculation by integrating the camera metadata is described VOLUME 10, 2022

A. ROADSIDE DISTANCE FROM THE LANE-LINE METHOD
The lane-line method [4] was introduced to calculate the distances between the road and roadside attributes based on the detected lane-line width. This method introduced a deep neural network for AusRAP attribute segmentation and classification. When the attributes were detected, the distances were calculated using the line markings as a reference point. The width measurement is reliable when the classification is accurate. Figure 1 illustrates how to apply this method for distance calculation. The main steps of the method are summarized below.
After the segmentation and classification steps, the line marking width was measured. As shown in figure 1, the line width is 20 cm, and the distance between the pole and the line is 280 cm. Next, the number of actual pixels in the specific region (e.g., line width) was calculated. The number of pixels was then converted to centimeters. Weight was calculated and applied to each row of the pixel-wise distances. Next, the distance between the line and objects (e.g., tree object) was calculated. The final distance was obtained using the following relationship.
• Start from the base pixel of the object and move towards the line pixel column-wise.
• After reaching/touching the line, calculate the number of pixels from the base pixel to the shoulder line and multiply by pixel-wise width to convert it into meters.

B. PROPOSED METHOD
The architecture of our solution is presented in figure 2. The image frames were extracted from Queensland roads' videos and annotated for the objects (AusRAP attributes). The FCN performs segmentation, feature extraction and classification as described in section III.B of [14]. We used the same FCN architecture with more training and validation images. Our dataset consisted of 782 training images and 332 validation images. A total of 61 object classes were used. The FCN was trained on a high-performance computer. The output of the trained FCN model was used with the pixel-distance model for severity distance calculation. The pixel-distance model was pre-calculated from the camera metadata and image transformation. More details on developing the pixel-distance model are given below. As shown in figure 3, the camera angle θ was fixed for each road during the data recording. However, with respect to the world horizontal plane, the camera angle might differ depending on the road condition (slope, hill, bump, potholes, or curve). In this study, we make the following assumptions.
• The road is flat, and all the roadside attributes lie on flat ground.
• Vehicle movement does not alter the camera angle.
Using the above assumptions, we create a projective transformation model to calculate the distance between the road and the roadside attributes. A projective transformation model transforms a 2D image plane into a 3D plane. The image frames captured by the DVR are in 2D space, and we transform them into 3D space to measure the distances in the 3D scene. At least four points from the 2D image are needed to transform it into 3D reliably. The model was pre-calculated using an image frame and camera configurations.

1) CAMERA METADATA
When data recording using a DVR, the camera metadata provides valuable information about the camera configuration A survey vehicle with a digital imaging system can visually identify and locate roadside features. These systems are often used for road safety assessment [16]. Imaging, Laser and GPS sensors in the survey vehicle provide reliable location estimates. We consider them the ground truth for our experiments. and the road. Some metadata are compression level, offsets, camera height, pan, tilt, resolution, field of view, calibration points, distances, and pixel values. For this study, we used the ''near'' and ''far'' calibration points (pixel coordinates), the real-world distance between those points, and their real-world perpendicular distances to the image middle line. These metadata can be different from road to road.
A typical survey vehicle used to record Australian roadside data is shown in Figure 4. The equipment in the survey vehicle can reliably estimate the distance to the roadside objects using digital images, laser range readings and GPS coordinates. For our experiments, we use images only from the front digital camera to estimate the distances to the roadside attributes. The surveyed data (the distance ranges to the roadside attributes) was used as the ground truth.

2) IMAGE TRANSFORMATION
To transform the image frame to the world frame, the main requirement is finding the vanishing point of the image frame. The vanishing point is where all the parallel lines meet in the image frame. Depending on the available scene parallel lines, there can be many vanishing points in a frame.
Vanishing point calculation from the scene geometry is widely used in camera calibration applications [17]. Some studies focused on vanishing point calculation from the scene lines, specifically the edge lines. To achieve a reliable vanishing point, those methods must find at least two parallel lines. Edge detection, keypoint detection, and Hough transform are a few popular techniques used for the vanishing point calculation in the literature [18]- [20].
Our proposed method applies to the road severity distance measurement scenario. The roadside severity distance is the closest distance to an attribute from the road edge. All the attributes do not belong to the severity category. The attributes are considered severe if they are too close to the road edge and strong enough to increase crash severity. Some examples are poles, trees, and metal barriers.
Typically, the road lane-lines are visible in an image frame of the road. If the road is straight and the lines are visible, the vanishing point can be calculated using the available straightline segments. However, this is not the case for many parts of the road. The lane-lines can disappear due to wear and weather conditions. Sometimes they are not visible due to the shape of the road, traffic, or poor lighting conditions. Also, curved roads generate undesirable results for vanishing point localization.
The camera configuration data provides the near and far points' pixel locations and their real-world distances. We use this information to form a four-point projective transformation model, as shown in figure 5. The line connecting the near and far points is parallel to the mid-road line. A similar two points can be found by mirroring the near and far points around the mid-read line. These points are considered to be the four points for the transformation. The transformation represented in figure 5 is a 3D transformation. That means the image frame captured at an angle to the ground is transformed into the world frame which lies on the assumed flat ground.
Let the near and far points be (x 1 , y 1 ) and (x 2 , y 2 ), respectively. Using these two points, the slope in the 2D image frame can be found as follows.
We know the pixel coordinates of d 1 from the FCN output (x d 1 , y d 1 ). Using this information, we can find (x d 2 , y d 2 ). As d 1 and d 2 are in the same row, they have the same y-axis coordinate y.
The reference pixel value ref _pixel for the y-axis coordinate y can be found combining the above two equations.
The above equations are further explained below with the aid of figure 5.
The real-world perpendicular distances from the near and far points to the mid-road line were also given in the camera configuration as 3m. Additionally, the near-far line towards the azimuth crosses the middle line at the vanishing point (pixel coordinates (800, 330)). If we extend the near-far line towards the camera, it meets the maximum y coordinate (y = 1200) at x = 1862 (a point outside the image frame).
Next, we consider the triangle formed by the image coordinates (800, 330), (800, 1200), (1862, 1200). Inside this triangle, horizontal lines were drawn across each y-axis pixel (from y = 330 to y = 1200). Each horizontal line represents a 3m distance in the real-world but different distances (in pixels) in the image frame. That means, in each line, the distance represented by a single pixel is different. This pixel-distance relationship is shown in Figure 6.
The area marked by a red-dashed line represents the most accurate region for the distance measurement. The right-hand side sub-figure shows the same area marked on the image frame. The distances closer to the vanishing point do not give VOLUME 10, 2022 an accurate distance estimate. Besides, we can not measure distances beyond the vanishing point in the image frame.

IV. EXPERIMENTS AND RESULTS
The DVR data used for the experiments were provided by the Department of Transport and Main Roads (DTMR). They have done surveys for some roads using video data. The data collection was conducted with four cameras mounted on a vehicle (left, right, front, rear) for each road. The past experiments found that the video data from the front camera are appropriate for extracting the AusRAP attributes [7].
Survey data from two roads were used as test data for our experiments. The two roads were namely ''10A'' and ''10B.'' Image datasets from both roads were analyzed separately and combined. The summary of the image datasets is given below.
• 10A road: There were 2000 images. The vehicle travelled 19.7km to collect these images. Survey excel file reports data in 10 frame intervals. There were 198 valid entries to analyse.
• 10B road: There were 1834 images. The vehicle travelled 17.7km to collect these images. Survey excel file reports data in 10 frame intervals. There were 177 valid entries to analyse. Both roads have a 100 kmph speed limit. In both instances, the survey vehicle operated at roughly 100 kmph speed. As the camera captured images from a fast-moving vehicle,  the dataset does not include repeating images. The survey data were considered to be the ground truth data. They were created by the DTMR and reported as the ground truth road severity distances and objects for the selected roads. Both the lane-line and proposed methods did not show a high correlation with the survey data in our experiments. However, the proposed method showed a significantly better overall performance compared to the lane-line method. Like any trained model, our neural network model is not perfect, and sometimes it misclassifies the attributes. Therefore, all ground truth data could not be used when comparing the road severity object of the image frame with survey data. As our focus was to improve the distance calculation, we selected only the matched objects for the distance performance analysis. When both road image datasets were combined, there were 65 matches to perform distance analysis experiments. A match was found when both proposed and the lane-line methods predicted the ground truth object. If one method misclassified an object and the other method correctly classified it, that image was not considered for the analysis. Figure 7 shows a distance calculation between the proposed method (green) and the lane-line method (red). In this example, the distance measured by the lane-line method was roughly half the distance calculated by the proposed method. Here, by visual inspection, it can be determined that the 6.44m distance is closer to the real-world distance. As there was no solid reference to cross-check this distance's accuracy, the corresponding road width and the nearby vehicle width in the image can be used as clues for this conclusion.

B. PERFORMANCE ANALYSIS
In the survey data, the severity for each frame is indexed numerically from 1 to 4. Each severity index represents a ground truth distance range d gt , and they have the following meanings, VOLUME 10, 2022 To compare the performance against the ground truth data, the distance indexes calculated from the proposed and the lane-line methods were compared with the survey distance indexes. Figure 8, (a)-(b) show the distance index comparison for roads ''10A'' and ''10B''. The severity indexes for both the left-hand side (LHS) and right-hand side (RHS) are illustrated. Figure 8, (c)-(d) are the corresponding error plots for both roads. An error plot presents the modulus offset from the survey data. For most of the images, both methods have equal error values. However, for some images, the lane-line method provides slightly higher errors.
In addition to the distance index error calculation, we analysed the perfect matches of distance indexes. When the distance index calculated from either method equals the survey data, it was considered a perfect match. The perfect distance index matches for roads ''10A,'' ''10B'' and ''10A + 10B'' are shown in Figure 9. When both road images were combined, there were 65 entries to analyse (the entries matched with the survey object category). There was a 36% increase in LHS perfect matches in the new method compared to the lane-line method. A similar performance could also be seen in the RHS matches, with a 37.5% increase in the proposed method (see table 1). The severity of the roadside attributes directly correlates with their distances to the road. Therefore, estimating the accurate roadside severity distance is vital for road safety.

V. CONCLUSION
Roadside severity distance calculation is a challenging research problem due to the complexity of road condi-tions and road networks. We have been conducting various experiments targeting to improve the AusRAP attribute analysis. Our experiments showed promising results applicable to road safety and road rating. In this work, we proposed to combine a fully convolutional network capable of segmenting and classifying roadside objects with a distance calculation model. The camera metadata and an image transformation technique were adopted as the primary tools for distance calculation. A pixel-distance model was derived mathematically to estimate the distance represented by a reference pixel in y-coordinates. When the severity distance was found along the x-axis in pixels, it was converted to centimeters using the pixel-distance model. The proposed method works well with challenging curved roads and does not depend on the image details such as poor lighting, missing lane-lines and misclassifications. We experimentally validated that the proposed combined model was more reliable than the lane-line method, which depended on the predicted lane-line widths. Our future work will be focused on more experiments and comparative analysis with a 3D model for localizing the roadside severity attributes and tracking them over multiple frames for reliable distance calculations