DOES: A Deep Learning-Based Approach to Estimate Roll and Pitch at Sea

The use of Attitude and Heading Reference Systems (AHRS) for orientation estimation is now common practice in a wide range of applications, e.g., robotics and human motion tracking, aerial vehicles and aerospace, gaming and virtual reality, indoor pedestrian navigation and maritime navigation. The integration of the high-rate measurements can provide very accurate estimates, but these can suffer from errors accumulation due to the sensors drift over longer time scales. To overcome this issue, inertial sensors are typically combined with additional sensors and techniques. As an example, camera-based solutions have drawn a large attention by the community, thanks to their low-costs and easy hardware setup; moreover, impressive results have been demonstrated in the context of Deep Learning. This work presents the preliminary results obtained by DOES, a supportive Deep Learning method specifically designed for maritime navigation, which aims at improving the roll and pitch estimations obtained by common AHRS. DOES recovers these estimations through the analysis of the frames acquired by a low-cost camera pointing the horizon at sea. The training has been performed on the novel ROPIS dataset, presented in the context of this work, acquired using the FrameWO application developed for the scope. Promising results encourage to test other network backbones and to further expand the dataset, improving the accuracy of the results and the range of applications of the method as a valid support to visual-based odometry techniques.


I. INTRODUCTION
The pose estimation problem consists in estimating the position and orientation of a vehicle, device, human or robot with respect to a reference frame, through the use of different kinds of internal or external sensors. The accurate measurement of the orientation plays in fact a critical role in a wide range of activities, e.g., robotics and human motion tracking, bio-logging for animal behaviour research, aerial vehicles and aerospace, gaming and virtual reality applications, medicine and biotechnology, indoor and outdoor pedestrian navigation, maritime and/or autonomous navigation. When Global Navigation Satellite Systems (GNSS) are not able to provide correct information about the position and attitude of a vehicle, navigation and localization operations are generally performed through the integration of different kind of sensors: inertial, odometry, laser and sonar ranging sensors, underwater positioning systems, etc. [1].
The associate editor coordinating the review of this manuscript and approving it for publication was Pinjia Zhang . In the last years the use of low-cost technologies is becoming widely spread in numerous applications: this means that the accuracy of the pose obtained by these systems can be affected by even more disturbing factors than the traditional high-performing methods. In these circumstances, the development of accurate and reliable orientation estimation algorithms can still be considered a very challenging task, being at the basis of the localization process and of the consequent performances of the device employed for any specific task. This finds particular application in the context of the navigation, be it aerial, maritime or pedestrian, underwater/underground or in surface, autonomous, remotely operated or traditionally performed. In the specific case of maritime navigation, the information of position and orientation of a vessel is of great interest for seafarers in different operations and scenarios (e.g., open sea, congested harbours and waterways) as it is strictly related to the safety of the navigation at any level [2]. The same goes for Unmanned Surface Vehicles (USVs), which are mainly employed in environmental monitoring, safety or navigation support and VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ research operations. In this case, a non accurate estimation of the orientation can severely compromise the ultimate success of the mission, especially when paired to low-cost sensors and poor GNSS support. The Inertial Measurement Unit (IMU) gives the instantaneous speed and position of the vehicle without the need for external references by integrating the measures of angular velocity and linear acceleration obtained through its three orthogonal rate-gyroscopes and -accelerometers respectively. Unfortunately, several problems are associated with these sensors; among the others, measurements are noisy and biased and the errors increase over time due to the drift of the sensors. Micro Electro-Mechanical Systems (MEMS) Attitude Heading Reference Systems (AHRS) integrate to this configuration a magnetometer which measures the variation of the Earth's magnetic field: this allows to instantly calculate an improved estimation whereas benefitting from lighter weight, smaller sizes and lower prices. The great potential of these devices makes them suitable for several applications exploiting the pure orientation estimation, like geomatics, surveys, augmented reality, etc.
Vision-based methods are also frequently employed for the scope: these techniques allow to understand the surrounding environment by detecting its visual features through a camera; captured color data with their high resolution contain in fact several information, and the sensors are generally lowcosts and with an easy hardware setup. In this context, the detection of the horizon line is an important attribute for the maritime image processing, as it allows to estimate the camera's orientation with respect to the sea surface other than restricting the object search region when detection is performed, thus reducing the processing time and the false detection problem. Several approaches have been proposed to solve this task, however the accuracy and the processing time of the horizon line detection on high-resolution maritime image still face some issues [3].
In the last decade, Visual Odometry (VO) and Visual Simultaneous Localization and Mapping (VSLAM) techniques have been successfully developed; however, their application can be challenging too, especially when their deployment is made in non-textured environments or with poor-light conditions. To reduce these limitations, IMU and camera systems are integrated in Visual Inertial Odometry (VIO) techniques [4]; as a drawback, they require manual interference for possible failure cases assessment, careful and specific tuning of the parameters related to the environment, and a final refining of the results. In recent years, increasing consideration has been gained by Deep Learning (DL) techniques, which demonstrated to be robust to camera parameters and harsh scenarios: these methods are in fact able to successfully extrapolate and learn new features representations from the images they are fed with and these can further improve the motion estimation [5].
With the aim of providing further enhancements in the orientation estimation methodologies, this paper presents DOES, Deep Orientation (of roll and pitch) Estimation at Sea, a new supportive DL model which can be combined to the actual low-cost IMU-based configuration. This approach is not intended to substitute the current systems, but aims at improving the robustness of traditional methods when some limitations occur: the unavailability of GPS signals in indoor and under-surface environment, the undesirable high drift of inertial sensors in case of extended GPS outages and the issues of possible confusion with nearby robots for SONAR & RADAR are some of the limitations associated with these navigation systems. Visual-based methods help in this sense, since they constitute a powerful tool to estimate the pose of a camera through which the motion information is further recovered. These techniques can be classified as geometric or learning based: in the first case the camera geometry is explored to estimate the motion, whereas in the latter the model is fed with labeled data and then trained to accomplish the same task. The advantage of the learning-based methods is that they do not require the knowledge of the camera parameters and can estimate the orientation with correct scale even for monocular cases [6]. Moreover, visual methods can be further integrated with traditional, IMU-based orientation estimation algorithms to obtain a robust and reliable visualinertial odometry system [7]. The work presented in this paper develops an affordable visual, learning-based backbone which estimates the attitude of a monocular camera which will be mounted on a vehicle.
The idea behind DOES is in fact to train a DL model able to output the vehicle attitude (in terms of roll and pitch angles) by processing the sea horizon view recorded by a low-cost camera. In particular, the latter needs to be mounted on the surface of an autonomous robot (or, similarly, on the bridge of traditional ships) with its axis parallel to the vehicle longitudinal axis, to correctly frame the horizon line. A similar approach could be further tested on Unmanned Aerial Vehicles (UAVs) too. To lay the foundation for this task, preliminary intensive tests have been conducted to verify the validity of the approach. Different DL architectures have been tested for the processing of the images acquired through an Android smartphone's camera.
In this context, the lack of datasets specifically designed for DL-based orientation estimation at sea has been evidenced. While tackling this issue, the need of acquisition methods assuring the synchronism of the measurements for a reliable Ground Truth (GT) has been addressed too. For this reason, this paper presents also the first release of the ROll and PItch at Sea (ROPIS) dataset ( Fig. 1), which has been created through FrameWO, an Android application developed for the scope. The choice of employing low-cost sensors meets the necessity to develop affordable and smart tools to enhance the orientation estimation; for this reason, the first deployment of the dataset has been acquired using opensource libraries and software. In this preliminary release, the operating user acquires the data in the proximity of the seashore trying to simulate the real behaviour of a ship in navigation.
The aim of this project is to provide a supportive visualbased, low-cost technique for attitude estimation which can be easily deployed in the context of navigation at sea or other challenging scenarios, as it does not need to take into account camera models or related calibration issues.
More in detail, the main contributions of this work can be summarized as follows: • The development of FrameWO, an Android smartphone application for the simultaneous acquisition of camera images and their corresponding device orientation.
• The release of ROPISdataset, consisting of 22173 RGB images/Euler angles samples acquired with FrameWO application on eight different sea locations.
• A Deep Learning-based method to perform attitude estimation using horizon-depicting frames; DOES is specifically trained on the ROPIS dataset and provides fast and reliable estimations, further encouraging to operate for its deployment in real-time scenarios. The paper is organized as follows: Section II gives a brief overview on the existing literature on the orientation estimation task exploited through different traditional, visual and DL-based methods; Section III gives a theoretical foundation to the subject, introducing the attitude estimation problem to further describe the DL architectures which best fit the task. In Section IV the ROPIS dataset will be presented, highlighting the issues and solutions encountered during the app creation and the data acquisitions. Section V details the experiments performed on DOES whereas the obtained results will be presented and discussed in Section VI; final considerations and future objectives will conclude the work in Section VII.

II. RELATED WORKS
The accurate measurement of the orientation plays a critical role in a wide range of activities. AHRS sensors (i.e. accelerometers, gyroscopes and magnetometers) provide reliable measurements whose integration gives accurate information about the pose (position and attitude) of any object they are rigidly attached to. In the last decade, traditional methods have seen a huge improvement due to the integration with different kind of sensors, aiming at reducing the inertial-related error accumulation and the costs whilst enhancing the robustness of the methodology. As previously mentioned, one of the most effective integration is made through visual-based method, leveraging the potential of visual features and the low-cost of the devices. The following paragraphs give a concise review of the existing literature in the field of orientation estimation.

A. INERTIAL-BASED METHODS
There exists a large amount of literature on the use of inertial sensors for position and orientation estimation. The reason for this is related to their robust algorithms and their accurate solutions which makes them suitable for being used in several fields. Interestingly, relatively simple position and orientation estimation algorithms work quite well in practice, even if the model choice can sensibly affect the accuracy of the estimates [8].
There is a large and ever-growing number of application areas for inertial sensors, as for example robotics and human motion tracking [9], [10], bio-logging for animal behavior research [11], aerial vehicles and aerospace [12], [13], gaming, virtual reality and indoor pedestrian navigation [14]- [16], etc. In fact, the use of accurate inertial sensors and magnetic compasses was first introduced in the navigation field, but along with the development of MEMS technology, low-cost and small-size inertial and magnetic compass sensors appeared in various kinds of consumer electronics, game consoles, virtual reality applications and so on. The orientation representations and sensor fusion still remain the challenges to overcome [17]. Real-time orientation estimation algorithms based on low-cost IMU are analyzed in [18], where the approach is based on the relationships between the quaternion representing the platform orientation and the measurements of the sensors and the integration is performed through an Extended Kalman Filter (EKF). Researchers in [19] developed a low-cost and low-weight attitude estimator for autonomous helicopters based on an inclinometer and a gyroscope, while fusing the data coming from the sensors through a classic complementary filter; in [20] a gyro-free, quaternion-based attitude determination system which exploits low cost sensors is presented. Reference [21] implemented a complementary filter able to infer Micro Aerial Vehicle (MAV) attitude from observations of gravity and magnetic field, with the final algorithm able to work with both IMU and MARG sensors. Authors in [13] exploited an AHRS device together with a Unscented Kalman Filter algorithm to perform attitude estimation on UAVs. The same filter has been used in [22], which developed a novel navigation system for autonomous underwater vehicles that works without the presence of a GPS device, not available in underwater scenarios. Researchers in [23] proposed an Adaptive Kalman Filter which is able to provide pose estimations based on low-cost AHRS devices, whereas [24] and [25] investigated the use of AHRS in smartphones as cheap but reliable devices for angles estimation. A novel error-state Kalman filter is presented in [26], which provides highly accurate IMU orientation estimates which result to be robust to fluctuations in the registered local magnetic field or caused by abrupt movements. An indoor pedestrian VOLUME 10, 2022 navigation method based on shoe-mounted MEMS IMU and ultra-wideband is discussed in [27], which used a quaternion-based Kalman Filter to integrate the data and to reduce the complexity of the method. In [28] a new orientation estimation strategy for a non-accelerated platform is presented. Based on a low-cost IMU, this method sees a nonlinear Luenberger observer estimating the angles and a recursive least-square algorithm calibrating the common magnetometer offsets. Authors in [29] describes a calibration method for MEMS IMU mounted on electric bicycles that can be made in real-time thanks to its independence to sensor biases and its a very low computation cost.

B. VISION-BASED METHODS
The possibility to employ visual data to perform orientation and in general pose estimation has been widely deepened in the past decades. Many researches have been focused on the horizon line detection, due to its relevance for visual geo-localization, port security, etc. However, some special features in real marine environments (e.g., clouds clutter, sea glint and weather conditions) frequently result in different kinds of interference in optical images. Authors in [30] proposed a Sea-Sky Line (SSL) detection method for USVs based on the computation of the gradient saliency, through which the line features of the SSL are effectively enhanced while other disturbances are attenuated. The SSL identification is achieved according to regions contrast, line segment length and orientation features, and optimal state estimation of SSL detection is implemented by a cubature Kalman filter. In [31] a fast method for detecting the horizon line in maritime scenarios is presented. It combines a multiscale approach and a region-of-interest (ROI) detection, which allows to efficiently reduce the required processing information amount. A single edge map is then produced and the Hough transform and a least-square method are sequentially applied to accurately estimate the horizon line. The Hough transform is also used in [32], which proposed a sea-sky line detection system based on the local Otsu segmentation; similarly, authors in [33] recognized the horizon line in maritime images through a two-phase, coarsefine detection algorithm which increases the overall method robustness. Another quick horizon line detection method is proposed in [34], which extracts the horizon line in real maritime image with improved reliability and faster execution with respect to other competitors. The horizon detection through vision sensors is also frequently exploited to obtain redundant orientation information in the field of unmanned aerial navigation. For example, authors in [35] proposed two attitude estimation methods: the first one searches for the best line fitting the horizon in thermal images, which allows to further estimate the pitch and roll angles using an infinite horizon line model. The second method exploits a Convolutional Neural Network (CNN) which predicts the angles on the basis of the raw pixel intensities from the same kind of images.
However, these methods alone cannot be considered totally robust and reliable, since the position and slope of the horizon are strictly related to the camera intrinsic (i.e., focal length, optical center, pixel aspect ratio and skew) and extrinsic (rotation and translation) parameters and to the model used to parametrize them. In [36] the authors surveyed a plethora of methods which perform pose estimation by fusing visual, inertial and magnetic measurements, integrating them through the use of an EKF. The combined use of IMU and vision information has been explored by [1], which exploits SURF visual features together with accelerometer and gyroscope data to retrieve the robot pose in an indoor setting. A comprehensive analysis of the behaviour of these features when used for visual odometry can be found in [37].
VO, VIO and SLAM algorithms have recently received much attention for their efficient and accurate ego-motion estimation in robotics. A VIO algorithm for the estimation of the motion state of UAVs with high accuracy is presented in [38]. Visual data and pre-integrated inertial measurements are here integrated in an optimization framework; the stable initialization of scale and gravity through pose constraints together with a local scale parameter allowed to take into account the uncertainty of the VIO initialization.
The use of stereo camera sensors for VO is a reliable and low-cost way for attitude estimation, but may encounter problems when deployed underwater. This setting is in fact characterized by poor imaging and usually inconsistent motion due to the water flow. This issue has been tackled by [39], which proposed an AUV localization technique based on a stereo underwater VO system to overcome the aforementioned difficulties. In the context of underwater robotics, [40] presented another VO method which demonstrated to be robust to visual perturbations in many challenging scenarios. In [41] a novel key-frame based SLAM system is proposed, where a robust initialization aims at refining the scale through the use of depth measurements. Together with an improved image quality and a fast preprocessing step, this demonstrated to solve the localization drift and loss issues. A monocular VI-SLAM algorithm providing accurate and robust motion tracking is presented in [42]. This is developed in two parallel thread: the first one deals with the EKF motion tracking updated through a consistent map to reduce the drift. In the second one, a visual-inertial bundle adjustment is performed on the obtained global maps to optimize the overall results. ORB-SLAM3 [43] is another worth mentioning method in this context. It allows to use both stereo and monocular RGB-D cameras in the VI and SLAM approach, ensuring a robust real-time operativity in any kind of environment thanks to the Maximum-a-Posteriori estimation.
The rise of Deep Learning, with powerful architectures able to tackle complex tasks such as classification [44], detection [45], segmentation [46], denoising [47], super resolution [48], has definitely changed the way vision data are exploited for pose estimation. Instead of relying on engineered, fixed features (e.g. SIFT [49], SURF [50]), recent algorithms exploit deep networks as powerful features extractors or by directly estimating the pose vector in an endto-end model, from input images to the output prediction. For example, in order to estimate camera orientation, [51] exploited a LSTM deep network together with a linear Kalman Filter to combine IMU and camera data, whereas in DeepVIO [5] the authors fused 2D optical flow features together with standard inertial data, obtaining state of the art results on KITTI [52] and EuRoC [53] datasets. The combination of a traditional IMU with a LIDAR laser scan has been proposed in [54], where a recurrent CNN perform this aggregation on a scan-to-scan basis. In [55] researchers proposed a method to estimate a camera six degrees of freedom and absolute scale by exploiting unsupervised data, getting good results in terms of pose accuracy on KITTI benchmark. In [56], the authors developed a generative framework able to exploit a GAN [57] model on unlabelled RGB images for 6-DoF pose camera motion prediction, demonstrating the efficacy of their approach both on KITTI and Cityscapes [58] datasets. The former method has been improved in [59] with a stack of GAN layers which demonstrated to be effective on ego-motion estimation tasks. A comprehensive review of the state of the art deep models for pose estimation can be found in [60].

III. METHOD
This section aims at providing a theoretical background to fully understand the fundamentals of the proposed work. In particular, a general overview on the orientation estimation process is given in subsection III-A, with some details on the sensors embedded in an AHRS and on the coordinate frame to which the smartphone device (and the related measures) is referred. Subsection III-B presents in a concise but detailed way the deep architecture models analysed and tested during the work.

A. ORIENTATION ESTIMATION OVERVIEW
The orientation definition for a rigid body is generally made through a transformation matrix containing a parametrization of the Euler angles, unit quaternions, rotation vectors or rotation matrices [61]. Among them, the Euler angles allow for a more intuitive analysis in the 3D space and can be defined as follows: • φ is known as roll angle and defines the x axis rotation; • θ (pitch angle) refers to the y axis rotation; • ψ (yaw or heading angle) represents the z axis rotation. The correct integration of the raw IMU data or of the more cost-effective AHRS is at the basis of the orientation estimation process. The accelerometer measures the acceleration in m/s 2 applied to a device, including the force of gravity: velocity is determined if the linear acceleration component is integrated once and position if the integration is performed twice. The results can be of poor accuracy due to the extensive noise and accumulated drift from which it suffers. The rotation angles can be obtained by the integration of the angular velocities in rad/s provided by the gyroscope; even if they are sensible to sudden and fast motion, these sensors generally experience major drift issues due to the errors accumulation over long time. For the aforementioned reasons, pose estimation is usually exploited through gyroscopes and accelerometers fusion to leverage their potential whilst attenuate their weaknesses. The Earth's magnetic field (µT ) measures provided by the magnetometer can be joined to the previous ones to improve the heading determination; however, they suffer from the influence of metallic objects, which can heavily impact on the accuracy of the data collection. Moreover, the overall drift introduced by the sensors system causes errors accumulation: this means that the navigation information reliability and accuracy are guaranteed only within short times, with their measurements precision decreasing throughout long missions. For this reason, the integration of the measurements provided by the three sensors aims at reducing the errors accumulation caused by the single one; this is generally made through filtering techniques and fusion methods. Moreover, information provided by external devices can considerably improve the accuracy of the estimations, especially when low-cost sensors could facilitate the process and make it more practical.
In this context, the objective of the present work is to provide a supportive mean to improve the attitude estimations obtained by common AHRS: DOES is a low-cost DL architecture developed to recover orientation information from the view of a camera pointing the horizon at sea, which will be placed on the bow of a navigating vehicle in future experiments. The training has been performed on the ROPIS dataset, acquired using an application developed for the scope on an Android smartphone which simultaneously collects the frames and calculates the corresponding Ground Truth data using the AHRS sensors.
The IMU-AHRS measurements of the smartphones are generally expressed in a custom body reference frame. The Android developer website defines its frame relative to the device's screen when the device is held in its default orientation (see Fig. 2, [62]). In particular, the frame originates in the center of the device with the horizontal x axis pointing to the right, the vertical y axis pointing up and the z axis points toward the outside of the screen face, so that the coordinates behind the screen have negative Z values. The related attitude information is then referred to the same coordinates.
During the ROPIS dataset acquisition the smartphone has been kept in landscape mode, recording the horizon view. It has to be noticed that the coordinate frame does not change its definition, so in this setting the z axis points in the user direction, the y axis to his/her left and the x upwards.  network, as for example the VGG16-19 [63] and ResNet18-50-152 [64]; the resulting numerical comparison will be reported in Section VI, Table 3.
The VGG-16 and VGG-19 networks are based on the popular VGG architecture. They are composed of several convolutional layers followed by a Rectified Linear Unit (ReLU) activation function and interspersed by max pooling layers. Two FC layers are concatenated in order to produce the final features which are fed to a classification layer. These two networks differ only by the quantity and dimension of the convolutional layers employed, with a total number of parameters equal to 138M and 144M respectively. Despite being among the first developed deep architectures, with a huge amount of trainable parameters making them prone to overfitting, VGG models are still incredibly widespread, thanks to their ease of use for finetuning purposes on different tasks [65], [66].
ResNet is a family of deep models based on the residual architecture. Differently from the VGG, the ResNet is made of a series of residual blocks in which the feature maps calculated by the convolutional layers are added to the input, so that each residual block calculates an update (hence residual) of the input feature maps. This approach makes the network resilient to the vanish gradient problem [67], improving convergence speed and the final accuracy result. Moreover, all the ResNet models avoid the use of the FC layers after the convolutional blocks, reducing the total number of trainable parameters and thus lessening the overfitting effect on training data. Authors of ResNet developed three versions with different number of layers (18,50,152) and with different number of visual features before the classification step (512 for the former, 2048 for the others). The number of free parameters for the 18, 50 and 152 layers models are 11M , 23M and 60M respectively.
In the experiments presented in this work, all the networks have been fine-tuned on the proposed ROPIS dataset starting from the ImageNet [68] pre-trained weights. The ResNet18 has been chosen among the others as the default DOES backbone since it produced the best accuracy while keeping at the same time a fast inference speed. Fig. 3 reports the DOES network with the default ResNet18 backbone.
Two additional FC layers have been added as additional branches on top of the highest set of visual features in the backbone network to separately estimate the roll and pitch angles; for example, in the case of the ResNet models, this correspond to the global average pooling layer. Some different estimation procedures have been experimented, as the one described in [69]: it proposes to map the float angle value to a set of fixed bins, which then undergo a standard classification procedure with a final mapping back to the float value. However, in this work it has been experimentally found that this approach adds a layer of complexity without increasing the overall performances; this led to the decision to add a FC layer for each angle, which is able to accomplish the regression task with a good accuracy. Both the backbone network and the additional FC layers are jointly trained by back-propagation with the use of a standard Mean Square Error Loss (squared L2 norm). Two separated losses are calculated for each of the two angles as reported in (1) for roll (L roll ) and (2) for pitch (L pitch ), where y andŷ are the GT and predicted values respectively. The final loss L final is then obtained as a simple addition of the aforementioned quantities, as shown in (3). The GT roll and pitch values have undergone a prior normalization process, which subtracts to each of them the mean and divides by the variance, both calculated over the entire dataset.

IV. ROPIS DATA ACQUISITION PROCESS
The lack of datasets designed for DL-based orientation estimation at sea lead to the necessity of searching for methods to acquire a set of data for the scope. In the following section, the development of the Android application and the obtained ROPIS dataset will be described in detail.

A. DEVICE INTERNAL SENSORS AND CHARACTERISTICS
In order to train the model, the dataset needs to contain a large amount of images showing the horizon and the corresponding GT data in terms of roll and pitch angles. The latter needs to be given with the best possible accuracy, as the learning process results will depend on it, which is strictly related to the instrumentation employed for the acquisition. With the aim of producing a low-cost and flexible solution, in this work the authors avoided the use of costly, high-end IMU devices and developed the FrameWOAndroid application to acquire the dataset through a common smartphone. The presented ROPIS dataset in its first release has been totally collected through a OnePlus Nord smartphone, equipped with the most common sensors (Table 1) and characterized by an average price.  The OnePlus Nord mounts a BMI260 IMU, which contains a 16-bit tri-axial gyroscope (G) and accelerometer (A) providing fast, precise inertial sensing in smartphones and Human-Machine Interface (HMI) applications (i.e., advanced gesture, activity and context recognition, etc.). The IMU is characterized by a noise density of 160µg/ √ H z (A) and 0.008 dps/ √ H z (G), a Zero-g/Zero-rate offset of ±20 mg (A) and ±0.5 dps (G) and an output data rate up to 1.6 kHz (A) and 6.4 kHz (G). Moreover, it mounts the industry's first self-calibrating gyroscope with motionless Component Re-Trimming (CRT) functionality, which compensates MEMS typical soldering drifts, ensuring post-soldering sensitivity errors down to ±0.4% [70].
The MMC5603 is a monolithic complete 3-axis Anisotropic Magnetoresistance Effect (AMR) magnetic sensor. It has an on-chip automatic degaussing with built-in SET/RESET function which eliminates the thermal variationinduced offset error and clears the residual magnetization deriving from strong external fields. Its true frequency response is up to 1KHz and can measure magnetic fields in a range of ±30Gauss (G) with 2mG total Root Mean Square (RMS) noise level, enabling heading accuracy of ±1deg in electronic compass applications [71].
The Sony IMX586 stacked CMOS image sensor is mounted as the main camera of the OnePlus Nord, and features 48 effective megapixels with an ultra-compact pixel size of 0.8µm. The sensor uses the Quad Bayer color filter array, where adjacent 2 × 2 pixels come in the same color, making high-sensitivity shooting possible. During low light shooting, the signals from the four adjacent pixels are added, raising the sensitivity to a level equivalent to that of 1.6µm pixels (12 megapixels), resulting in bright, low noise images [73].

B. FrameWO APPLICATION DEVELOPMENT
The FrameWO app has been developed in a free Open Source environment, the B4X suite [74], which supports the majority of PC, smartphones and embedding operating systems (e.g., Android, iOS, Windows, MacOS, Linux, Arduino, RaspberryPI) and uses a modern version of Visual Basic as programming language. The Android version (B4A) allows to wrap existing Java code as an external library and then to reference it from the B4A IDE, obtaining in release mode performances similar to those of Java. The size of a simple app is generally around 100 KB.
As previously mentioned, the necessary prerequisite for the dataset to meet the scope of this study is to associate to each frame the corresponding GT; however, the images size is much more larger than that of the IMU data, thus introducing a delay in their storage which affected their simultaneity. For this reason, the app captures the frames in YUV format (allowing for a better compression of the image) and converts them in JPEG only at the end of the process; this also avoids to run out of memory during the acquisition. A detailed overview on the YUV model can be found in [75]. Furthermore, several tests have been performed to determine an acquisition frequency value suitable for both the high-rate IMU data and the low-rate camera frames: the application offers in fact the possibility to set the camera acquisition frequency in msec to choose the best option for the needs.
As regards the GT, the API of Android [62] has been used to work on the raw measures read by the sensors and to obtain the Euler angles of interest. The getRotationMatrix function allows for a coordinate systems transformation (from the device to the world one in this case) and takes as input the gravity and geomagnetic field in vector form to compute the inclination matrix I and the rotation matrix R. VOLUME 10, 2022  By definition, I is the rotation around the X axis which converts the geomagnetic vector into the gravity coordinate space, whereas R defines the identity matrix of the device aligned with the same world coordinate system: in this setting, the device faces the sky with the X axis pointing the East and the Y axis the North Pole (see (4), where g is the magnitude of gravity and m is the magnitude of the geomagnetic field).
In order to isolate the gravity vector, a discrete-time lowpass filter with a smoothing factor α = 0.2 has been applied to the accelerometer measurements. The Euler angles are recovered through the getOrientation function, which calculates them from the elements of the rotation matrix R [62], [76].

C. DATASET STRUCTURE
The ROPIS dataset in its first release has been mainly acquired in Italy, in the cities of Gaeta (Lazio) and Racale (Puglia). It consists of 22173 sRGB TrueColor JPEG images, with resolution set to 2592 × 1168, for a total dimension of 42.3 GB. Six different subsets have been acquired in as many locations, each presenting different characteristics in terms of scenarios and meteo-marine conditions; five of them have been chosen for the training set, from which a total of 100 frames has been separated for the validation set, and the last acquisition has been used as test set. The use of a dedicated test set with images coming from a separate location allows to verify the ability of DOES to generalize to new, different scenes with respect to the training and validation set. More in the specific, in each place eight different acquisitions have been made trying to simulate the behaviour of a ship in navigation in both static and dynamic conditions: this aims at emulating the induced oscillations which resemble the true motion of the ship. To improve the generalization ability of the model, the data have been acquired at different day times and with sunny and cloudy sky; Fig. 4 shows different samples of the ROPIS dataset. Some aspects of these data need to be highlighted: • The point of view of the ROPIS images presents some differences with respect to the acquisitions taken on board the ship, since it adds parts of the land in the image foreground, such as sand, rocks, etc. However, this does not affect the learning procedure as the DL networks are able to recognize useful and useless image features, discarding the latter.
• A frame representing the real view from a navigating vehicle should depict some elements in the scene, such as the bow structures and some part of the bridge floor from a ship, or some of the USV sections. Although these specific features do not appear in ROPIS, DOES demonstrated its robustness to similar images cluttering present in the frames. Further experiments will be made to precisely assess their impact on the learning process.
• The data acquisition has been made with the camera at a roughly fixed height of 1.5m with slight oscillations around this value: this considers, among the different vehicle movements, also the linear vertical -up/downmotion along the z axis (heave), corresponding to the smartphone x axis. It should be remarked that the pitch estimation is strictly related to the horizon height and thus to the camera axis and view; for this reason, the horizon line should be obviously always visible in the frame.
Fig . 5 shows the workflow of DOES in its three main phases: the data acquisition, the training with its specific data augmentation process and the test which finally allows to calculate the evaluation metrics.
The ROPIS dataset is intended to be further enhanced. The use of other low-cost cameras (to take into account the differences in the camera parameters and lens distortion) and the setting of a range of different camera height values aim at considering their impact on the training phase. Moreover, the acquisitions will be made in different scenarios, which will include adverse meteo-marine conditions and locations as ships bridge and USV platforms. The heterogeneity of the data fed to the network will enhance the model capability to generalize over more complex data and realistic settings, making it invariant to these parameters.

V. EXPERIMENTAL SETUP
In this section some details on the training process will be given, together with a brief overview of the evaluation metrics used to appraise the performance of DOES. Finally, the problem related to the comparison of DOES with other methods will be discussed. VOLUME 10, 2022 A. TRAINING DETAILS DOES has been developed in Python programming language using the Pytorch framework; the code is publicly available. 1 DOES has been trained using a standard fine-tuning procedure: the backbone convolutional kernels were pre-trained on ImageNet while the additional FC layers have been initialized with random values drawn upon Pytorch default uniform distribution. Both convolutional and FC layers have been trained using the Adam optimizer [77] and a fixed learning rate set to 0.001. DOES has been trained on the ROPIS training set for a total of 10 epochs: it has in fact been noticed that a larger number of epochs led to an increase of the overfitting without any improvement of the accuracy.
The images have been squared to a preliminary 2592 × 2592 resolution by the application of a zeropadding; this operation adds black bands to the smallest dimension to obtain a squared input whilst preventing the loss of information. The images have then been resized to a final resolution of 224 × 224; a zero mean-unit variance normalization has been applied to both the images and the GT sets, with the corresponding mean and variance calculated over the specific training data.
The data augmentation process consisted of random changes in the colours of the images, using the ColorJitter transformation function of Pytorch which allows to set different values of brightness, contrast, saturation and hue: this resulted in an increase of the training dataset which further enhanced the generalization abilities of DOES. No random cropping nor image flipping have been applied during this process: in fact, the former would have caused the neglecting of the relative sea height information given by the images whereas the latter could have changed the correct roll angle perception of the network. The data augmentation procedure has naturally been deactivated during the testing phase, whereas the zero-padding and resize processes have been applied also to the test images; furthermore, the predicted roll and pitch values have been de-normalized before calculating the evaluation metrics presented in the following paragraph V-B. The selected data augmentation values (brightness and hue equal to 0.5, contrast and saturation equal to 5), as well as all the other training hyperparameters, have been tuned on the validation set.

B. EVALUATION METRICS
DOES has been evaluated on the basis of the regression metrics implemented by the Scikit library in the sklearn.metrics module, which contains the most common utility functions to measure the regression performance.
The Mean Absolute Error (MAE) computes a risk metric corresponding to the expected value of the absolute error (5); it is the average absolute difference between the predicted and the true value, expressed in the same scale as the data being measured. Each error contributes to MAE in proportion to its 1 https://github.com/fabidicia/does absolute value.

MAE(y,ŷ)
The Root Mean Square Error (RMSE) represents the square root of the second sample moment of the differences between predicted values and the observed values (or the quadratic mean of these differences, also called residuals). It is a measure of accuracy and it is sensitive to outliers (6). In fact, since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors, making it more useful when large errors are particularly undesirable. RMSE does not necessarily increase with the variance of the errors, growing instead with the variance of the frequency distribution of error magnitudes.

RMSE(y,ŷ)
The Standard Deviation (STD) is a measure of the amount of dispersion (or variation) of the samples. A low standard deviation indicates that the values tend to be close to the mean µ (also called the expected value) of the set, whereas a high standard deviation indicates that the values are spread out over a wider range (7).
Finally, the Median Absolute Error (MedAE) is calculated by taking the median of all the absolute differences between the GT and the prediction (8). It is a non-negative floating point with best value of 0.0, robust to outliers since the median is not affected by values at the tails.

C. METHODOLOGY COMPARISON
The comparison between DOES and other state of the art methods turned out to be a non trivial task for several reasons; among the others, the Deep Learning based solutions currently developed for the estimation of roll and pitch are either released without source code (as for example in [35]) or employed for very different tasks (e.g., head pose estimation [69]), thus making the comparison not properly correct or practically impossible. Generally speaking, traditional Horizon Line Detection (HLD) algorithms can be used as a proxy for this kind of estimations; the roll and pitch angles can in fact be correlated to the slope and position of the horizon line. However, as previously mentioned, this would require the correct knowledge of the intrinsic and extrinsic camera parameters and of the transformation matrix between the camera and the smartphone reference systems.
To address this problem, a Linear Least Squares method has been applied to calibrate the HLD algorithms on the basis of the minimization of the squared error calculated between their output predictions and the GT values. Two of the most renowned HLD algorithms by the scientific community have been selected to perform this comparison and are briefly described in the following lines.
The Otsu method [78] is a popular technique used to threshold the image between sky and non-sky regions. It is a reasonable fast and simple algorithm which performs fairly well on heterogeneous sets of data. The threshold value T is automatically computed by the algorithm through the assumption that the grayscale histogram of the image pixels intensities is bi-modal; the threshold is set so that the distance between the two histogram peaks is maximized.
Ettinger et al. [79] is a computer vision-based HLD algorithm that performs exhaustive search in the 2D line parameters space over the whole image looking at the best values which separate sky from terrain. However, being a slow algorithm on high resolution images, a modified version has been implemented that uses a two-stage objective: the global one searches for a narrow range of combinations of the pitch and roll horizon line angles corresponding to a half-plane that likely subdivides the sky from the rest of the image. The local one aims at searching exhaustively through these combinations to find the half-plane that maximizes the difference (in average intensity) of the two half-planes in their immediate vicinity. This method assumes that the sky pixels have higher intensity values than the ground pixels (higher mean), and that the sky has higher consistency of representation (lower variance).

VI. RESULTS AND DISCUSSION
This section contains an assessment of the results provided by DOES. Table 2 shows DOES performances with respect to the selected horizon line detection algorithms. DOES is able to achieve sensible better results both on roll and pitch angles, with a Mean Absolute Error close to 1.5 • , as opposed to the other methods which exhibit worse performance on all the indicators.
The MAE and the RMSE can be used together to diagnose the variation in the errors in a set of predictions. The RMSE is generally higher than the MAE, and the greater is the difference between them, the greater will be the variance in the individual errors of the samples; moreover, if the RMSE is close to the MAE, then all the errors are of the same magnitude. In the case of the current comparison, the small gap between RMSE and MAE demonstrates the ability of DOES to produce fewer outliers than Otsu and Ettinger. In addition, the STD values of the three methods show that the results obtained by DOES are significantly more clustered than the others, meaning that they are closer to the mean value and as such can be considered more reliable. The good performances of DOES are further confirmed by the MedAE value, which is sensibly lower than the counterparts. These findings can be summarized in Fig. 6, which shows the MAE behaviour analysing the outputs percentage belonging to different MAE intervals (Fig. 6a) together with the empirical cumulative distribution (Fig. 6b) for the roll angle. The same evaluation can be made for the pitch angle (Fig. 7), which exhibits similar performances to the roll angle. Another important consideration related to this comparison regards the inference time of DOES; the average VOLUME 10, 2022  estimation time on a single image is 100-150msec with any of the tested backbones, whereas Otsu and Ettinger inference time is comprised between 100 and 11000 msec, making them unsuitable for real-time applications on high-resolution images. Table 3 shows a detailed comparison between DOES with its default proposed network and some alternative backbones: DOES is able to produce good performances with all the residual networks, whereas both VGG-19 and VGG-19bn struggle to produce reasonable results. More in detail, the MAE and RMSE results of ResNet18 are slightly better then the 50-and 152-layers versions, with the powerful DenseNet161 model able to produce a similar accuracy only on the roll angle. The performing results obtained by the ResNet18, together with the fastest training and inference speed (due to the smaller number of trainable parameters TP with respect to the other architectures), make ResNet18 the first choice for the deployment of DOES as long as new models specifically developed for the scope will be released. Future work will focus on the use of lighter architectures developed for the specific use on lowresources embedded hardware (e.g., MobileNet, [80]); this will lay the foundation for the deployment of the proposed model on embedded devices (e.g., Nvidia Jetson, [81]) in real-time scenarios, in accordance with the aim of making DOES a supportive smart technology to improve the attitude estimations provided by low-cost sensors.
Furthermore, the ROPIS dataset has been used for an additional test in which a 1.33x zoom has been applied to the frames to simulate different camera parameters. In some cases, this corresponded to a crop in the image which removed the horizon line, thus making DOES unable to correctly estimate the angles. This reflects in a slight decrease of the performances: the roll MAE is equal to 2.10 • , with a RMSE of 2.81 • , whereas the pitch angle exhibits a 2.02 • MAE and a 2.90 • RMSE.
Finally, a separated test (with no prior training or specific tuning) has been made on a set of 191 images presenting three main variations with respect to the ROPIS train and test data: • The device: a smartphone Huawei P9 [82] has been used, with the FrameWO App, to collect the data. The mounted dual-lens Leica camera has different characteristics with respect to the OnePlus Nord Sony camera: the P9 Leica 12 MP has in fact an aperture size of f /2.2, a focal length of 27mm (wide), a sensor size of 1/2.9 and a pixel size of 1.25µm.
• The location: the acquisition has been made in a different area of the Racale city (LE).
• The environment setting: the data have been collected rightly after the sunset, in a low-light condition which highly reduced the contrast in the frame, resulting in a very challenging scenario. Despite these substantial changes in the sensor and in the overall acquisition, DOES obtained remarkable results, performing a 2.17 • MAE and a 2.70 • RMSE for the roll angle and a 2.22 • MAE and a 2.71 • RMSE for the pitch angle. This demonstrates that DOES can successfully generalize over various conditions and camera parameters, confirming its potential for more challenging settings and further employment as inertial systems support and visualbased odometry tasks.
It is worth mentioning that the accuracy of the results is proportioned to the precision of the GT data and thus of the systems employed to acquire it. In this case, the overall accuracy is strictly connected to the use of a smartphone AHRS which, although being limited to the low-cost sensors mounted on it, is still able to provide reliable and accurate measurements. The use of high-end and more expensive devices would in fact ensure a higher grade of GT accuracy with consequent improvements in the DOES performances.

VII. CONCLUSION
This paper presents a novel Deep Learning-based approach to the attitude estimation problem, which has been developed and intensively tested on a new dataset (the ROPIS dataset) specifically built for the scope and released in the context of this work. Deep Orientation (of roll and pitch) Estimation at Sea (DOES) is able to predict the attitude of the device in terms of roll and pitch angles by analysing the frames recorded by the camera pointing towards the sea horizon. DOES has been tested using several known architectures (e.g., ResNet152, ResNet18, VGG19) and with different configurations and hyper-parameters, obtaining excellent results. Unlike other visual-based methods, DOES is able to produce the output without the explicit knowledge of the camera intrinsic and extrinsic parameters or the distortions introduced by the camera lens. There is in fact no necessity to make any assumption on the use of specific models to parametrize the camera, since the model training only depends on the dataset given as input; the latter generally provides different sampling characteristics, thus making the network able to learn and then estimate the attitude regardless of the camera specifics.
The ROPIS dataset has been created for this particular task and is here presented in its first release; the lack of public datasets suitable for DL applications made it necessary to search for a valid alternative for the experiments conduction. For this reason, the FrameWO Android application has been developed using the Open Source B4A platform and will be made publicly available online. This app allows to simultaneously acquire the frames to be fed to the model as input, and the attitude estimations measured through the internal sensors of the smartphone, which will be used as Ground Truth in the training/testing phases.
ROPIS dataset is intended to be further improved by the introduction of more subsets of data collected in different scenarios (i.e., during the dusk/dawn, rainy days, etc) and environments (e.g., different cities coastlines, onboard of a vessels), using different acquisition devices. This will improve the DOES ability to generalize over heterogeneous data, making it even more invariant to the camera configurations, the acquisition condition and cluttering factors, thus providing better results in any kind of situation in which the vehicle will be navigating. In this regard, the authors wish to encourage the users to download and test the FrameWO application with the aim of enhancing the ROPIS and its usage among the scientific community, to give a concrete contribution to this task.
The objective of this project is to develop a supportive technology to be integrated to the existing low-cost methodologies employed for the attitude estimation task. In fact, it has to be noticed that this approach has been specifically designed using affordable devices and applications and, as such, its results are not intended (at least in its preliminary version) to reach the accuracy provided by high-precision modern sensors. Further experiments will be made to test other light-weight DL architectures, which could be deployed on low-resources embedded hardware with the aim of providing better accuracy results in real-time applications on autonomous vehicles. These enhancements will make DOES a robust system to be integrated in visual and visual-inertial odometry methodologies. SALVATORE TROISI received the degree (Hons.) in nautical sciences from the Faculty of Nautical Sciences, Naval University in Naples, with an experimental thesis in nautical astronomy on the use of optical amplification of light.
Since 1987, he has been a Researcher with the Group 135 (first discipline complements topography), Faculty of Nautical Sciences, Naval University of Naples. From November 1998 to September 2007, he served as an Associate Professor of SSD ICAR 06 with the Faculty of Nautical Sciences, Naval University of Naples. He has been a Full Professor SSD ICAR06 with the Faculty of Science and Technology, Parthenope University of Naples, since 2007. His research interests include deformations control networks, geoid by astrogeodetic methods and GPS, topographic methods in environmental emergencies, GPS survey for deformations, design and simulation of geodetic networks by GPS methodology, design of satellite constellations, laser scanning, filtering of laser scanning data, close-range photogrammetry for reverse engineering, and 3D building modeling by aerial laser scanning data. VOLUME 10, 2022