Immediate Vehicle Movement Estimation and 3D Reconstruction for Mono Cameras by Utilizing Epipolar Geometry and Direction Prior

Motion estimation of surrounding objects is indispensable to any mobile machinery. The paper proposes a method to solve the estimation and reconstruction problem of dynamic objects with a mono camera. Using the relative camera motion and detected rigidly moving objects on the image, we estimate their movement up to a scale factor. Utilization priors about their moving direction are used to estimate the transformation, which maps the 3D object from the previous frame to the actual one. Our two-frame method works twice the speed or more as other methods using three frames or more for the estimation, and we do this without any constraints. We evaluate our method on various traffic scenarios of different autonomous driving datasets.


Immediate Vehicle Movement Estimation and 3D I. INTRODUCTION
E STIMATING relative camera motion is a high research interest in many machine vision fields. UAVs, mobile robots, automated guided vehicles, autonomous cars, etc., require the knowledge of vehicle ego-motion to navigate. Methodologies like Simultaneous Localization and Mapping (SLAM) or Structure from Motion (SfM) offer the possibility of reconstructing the static environment besides determining the camera motion. With other simple sensors on-board (e.g., incremental encoder) or based on simple assumptions (identifying objects, knowing camera height), the vehicle displacement's absolute scale can be determined. However, moving objects are generally ignored; the reconstructions are focused on the static scene; the image points of the dynamic objects are treated as outliers. Although other dynamic participants of the transportation system can pose a higher threat than static objects. Pedestrians are vulnerable [1], and other vehicles can move at high speed. When these objects move toward us, a quick reaction is needed for safety [2]. Techniques are needed that are capable of operating at the moment the dynamic object appears [3]. While other methods require a series of frames to establish a decision, our method can already give a reliable estimation of the object motion and shape from only two frames (so two times faster than estimating from three; also, calculation time is negligible), which can be critical in braking distance. Our method can also be used as an initial estimation, supporting future trajectory estimation methods.
There are end-to-end deep learning-based methods for motion prediction of multiple moving objects like [4] and [5] which are capable of predicting that. However, they model the vehicles as 2D objects. Considering real extensions of vehicles (instead of simplified models) is vital for passenger safety; that is why we reconstruct not just the trajectories but the objects in 3D as well. Movement of dynamic objects (so-called eoru-motions: motions of multiple moving objects not rigid with the scene [7]) can also be reconstructed up to a scale. The scale ratio between this and the reconstruction of the environment is unknown, thus the object's distance and velocity. In the paper, we propose a method to calculate the relative scale ratio and the displacement of moving objects at the environment reconstruction scale. In real-life scenarios, we utilize GPS data to estimate the metric scale. We perform this by estimating the essential matrix of a virtual camera pair corresponding to the target object's movement, reconstructing the motion at a selected scale, and estimating the movement's homogeneous transformation as a matrix maps the object center translation parallel to a given direction. Our assumption is true for most vehicle movements (because of straight road segments), and our tests show that it can result in a satisfactory solution in other cases as well as we only need two applicable frames to calculate the proper scale for the whole trajectory. This assumption makes it possible to determine the transformation matrix between frames on the absolute scale and achieve stateof-the-art accuracy in mono camera-based trajectory estimation and 3D reconstruction. Fig. 1 shows that the accuracy of our 3D reconstruction using only two camera frames is comparable to LIDAR measurements.

A. Contributions
The paper contributes to the following: • New methodology is proposed to estimate the transformation matrix of the motion of moving objects not rigid with the scene (based on an image pair of a mono camera) at the scale of background reconstruction.

B. Outline of the Paper
The paper is organized as follows: Section II surveys the literature about the related topics. Section III describes the proposed method and the concept in detail. Sections IV and V show our test results. In Section VI we propose a complement to our method to deal with its limitations. Section VII continues with discussion. Finally, Section VIII draws some conclusions.

II. RELATED WORKS
3D environment reconstruction is achievable from camera images by using SfM or SLAM [10] models, but it is more challenging for dynamic scenes. Reference [11] supports us with a great survey about the topic. There are a few solutions for multi-body non-rigid SfM [12], and some more for rigid multi-body SfM. However, multi-body SfM causes problems in practice [13], like regular bundle adjustment. In most cases, it is solved by factorization [7]. Trajectory triangulation of points moving along a line was first solved in [14] by using Plucker representation. Later [15] proposed a solution to reconstruct the trajectory of points with more general movement by utilizing a linear combination of trajectory basis vectors.
Estimation of relative scale to the background is one of the main issues in reconstructing trajectories of objects not rigid with the scene. As in general, these trajectories can be determined only up to the one-parameter family. Additional constraints are required to obtain the correct scale. Constraints can arise from the non-accidentalness principle, as in the case of [16] or deep learning for depth estimation, using object shapes template can be applied as well. [9] Motion models [8] can be applied as a constraint like a constant velocity model like in the case of [9]. The most common assumption is the planar motion of objects, relating the movement on the ground by being perpendicular to the normal vector of the ground plane [17] and the distance of object points to the ground plane being constant [18]. However, this calculation cannot be executed with a consecutive frame pair of a mono camera in most cases, even in case of constraints. The methods above require multiple frames for the estimation, the knowledge of the ground plane's orientation, and special camera or object motion. Real-time ground plane estimation from a mono camera is hardly applicable (because of slopes, it requires computationally expensive dense reconstruction). Figure 1c shows how few road points can be reconstructed (gray points). Also, methods relying on that require specific camera motion, making it inapplicable to driving scenarios. Compared to the above, the present paper brings the following advantages: • Only two frames are required for the displacement estimation; • the initial estimation can be refined with each frame; • no specific camera motion is assumed; • 2D motion of moving objects is not a requirement; objects can move through 3D space (e.g., uphill); • only visual odometry is required, reconstruction is not necessary; • does not require learning or object templates; • orientation of the ground plane is not needed.

III. THE PROPOSED METHOD
Here, we describe the method in detail. We divide our pipeline of scale and transformation estimation into four steps. 1) Preliminary data generation 2) Estimation of virtual camera motion and reconstruction without unknown scale 3) Estimate movement direction 4) Compute the relative scale These steps are described in the following subsections.
In Figure 2, we illustrate the transformation matrices and coordinate system we use. (Symbols not indicated in the figure caption are defined in the following subsections.) Three coordinate systems are illustrated in the figure. Two of them are the camera coordinate system at different positions, with C 1 and C 2 camera centers. The third (global) one for this illustration is fixed to the traffic pole.

A. Preliminary Data Generation
To generate the input data to our algorithm, we propose to execute three preprocessing sub-tasks. First, the cameras' intrinsic parameters should be determined [19]. The intrinsic matrix of a camera is constructed from these parameters and maps the 3D point (P) coordinates (in the camera coordinate system) to its 2D projections on the image plane (p): where K is the intrinsic camera matrix. Then, the relative camera poses should be estimated. In our real-life experiments we used GPS-based poses, but without other sensors, it can also be estimated utilizing only the mono camera with an SfM software like COLMAP [20] for accuracy or SLAM method like ORB [21] for the real-time run. (We used these in the case of virtual camera pose estimation.) In the latter case, the vehicle trajectory is determined on the scale of ego-motion trajectory. Finally, the moving object must be detected and matched through consecutive frames. In our case Yolo_v2 [22] is used for detection and Hungarian algorithm for [23] tracking. The methods of the current subsection are not an essential part of our proposed method (e.g., [24] provides an excellent survey for detection alternatives). However, real-life experiments showed that the above methods provide good performance to provide input data to our estimation.

B. Estimation of Virtual Camera Motion and up to Scale Reconstruction
We can estimate a virtual camera motion between consecutive frames corresponding to the tracked objects, assuming that the vehicle is static and only the camera (on the ego vehicle) is moving. Later on, knowing the ego-motion, the motion of the tracked object can be calculated from this virtual camera motion.
By decomposing the estimated essential matrix [25] corresponding to the moving object, we can define the projection matrices of virtual camera poses (denoted with v index): where T = [R|t] are homogenous transformation matrices with R rotation matrices and t translation vectors. v index means virtual camera poses, and numbering indicates the given frame. We determine the reconstruction up to a scale factor by defining λ 1 as the unknown scale. In our tests, we used COLMAP [26] to reconstruct the dynamic objects by estimating the motion of a virtual camera pair. Different image features [27] and robust estimators like [28] can be used for this estimation (customized to the needs), just in the case of the real camera motion estimation.

C. Estimate Movement Direction 1) Coordinate Transformation:
The virtual camera motion determines the object motion and shape at a given scale (e.g., λ 1 = 1) and the 3D point cloud shape. However, we would like to determine the trajectory and object shape on an absolute scale. We will utilize the fact that the coordinate systems of the virtual cameras and the real ones of the static scene's reconstruction are the same.
The previously triangulated point cloud can be transformed to the coordinate systems of the virtual cameras by the previously determined T v,i transformations. We apply a statistical outlier removal algorithm to denoise the triangulated object cloud. This method is thresholding based on the average distances in the neighboring points [29]. After that, by using T i transformations which maps the global coordinate system object points before (P m ) and after the movement (P m ) to the i th camera coordinate system can be described as a function of λ 1 : where X i , Y i and Z i are point coordinates of the triangulated 3D points in the coordinate systems of virtual cameras, and i = 1 : 2 are the camera indices.

2) Principal Component Analysis (PCA):
On the triangulated and transformed point cloud, we apply PCA [30] to determine its oriented bounding box, and we will use the direction in which the point cloud is the most scattered, indicated as N as our estimation of the object's center movement direction. The covariance matrix of the point cloud coordinates: where n is the number of points of the point cloud P μ of which P c center point is transformed to the origin. By applying singular value decomposition on C we can get the eigenvalues and eigenvectors of the covariance matrix. The direction corresponding to the smallest eigenvalue will be our estimation of N . Illustration of how movement direction specifies the correct scale is shown in Figure 3. The moving object in the illustration is a triangular prism. The estimated essential matrix gives the shape, knowing the object only translates (does not scale between views, and here, in this illustration, does not rotate either). Red arrows indicate a possible translation direction. Green arrows indicate the real translation of points in the known direction N . The movement direction (here, parallel to its longitudinal axis) determines the correct scale. The pair of smaller triangular prisms is a valid solution without knowing the direction, but the red arrows are not parallel to what we are searching. Only choosing the scale of the bigger pair of triangular prisms results in the searched direction N .

D. Compute the Relative Scale
Here, we use the assumption that the object (or at least its centroid) goes through a transformation that can be approximated as a translation parallel to its longitudinal axis. Thus the difference vector of the two center points (in the global coordinate system) is equal to the unit vector of the movement direction multiplied with an unknown scale λ 2 of the displacement: utilizing the known transformations (from the camera coordinate systems to the global one) and the estimated points: where P c,1 and P c,2 are the centroid of the point cloud in the first and second camera coordinate system. Rearranging the equation, and writing separately the rotation and translation parts of the transformations gives: where C 1 and C 2 are camera centers in the global static coordinate system and their difference is a vector describing the ego-motion. 1 N ) the coefficient matrix. The equation systems above (two unknowns with three equations) can be solved in a least-squares sense: By substituting λ 1 , the searched scale to T v,1 and T v,2 we get the transformation matrix of the moving object in the global coordinate system: Finally, we got the object motion's transformation matrix. Moreover, we also get the 3D reconstruction of the moving object on the correct scale.

IV. VALIDATION
We evaluated the proposed method in the Vehicle Trajectory dataset [18], which is the only demanding dataset currently available for vehicle trajectory estimation and object reconstruction with a moving mono camera to the best of our knowledge.
The Vehicle Trajectory dataset is a virtual dataset containing seven scenes (each with one trajectory) and five cars. Ground truth object masks are published with the dataset, so trajectory estimation can be measured without preprocessing influencing the results. The dataset is designed for vehicle trajectory estimation for mono and stereo cameras. As we use mono cameras, left images are used in our case. The seventh scene (Bumpy road) is an exception as it operates just with object motion parallel to the camera (degenerate case to our algorithm). In this case, we used left and right camera images alternately (in different time steps) to evaluate the whole dataset (note: only two frames without degeneracy would be enough to estimate the relative scale). We used scale ratios (calculated from frame pairs) with degeneracy degree smaller than 0.75 (judged as trustworthy, see in Fig. 10) for the one global scale ratio estimation. Degeneracy degree is defined absolute value of the scalar product of unit vectors in the direction of the estimated object and camera motion (see more in Section VII-A.1). However, in the case of some trajectories, there were just degenerate consecutive frame pairs based on the defined threshold. In this case, we used the available frame pairs and estimations. We estimated a global scale in two different ways for the whole trajectory. In the first case, we used the geometric mean of the calculated scale ratios (from consecutive frame pairs), and in the second case, we constructed a linear equation system from the equations of the frame pairs (Eq. 9): where i = 2..n, n is the number of frames and R −1 We solve the equation system in the least-squares sense (like in Eq. 10) and we used the estimated λ 1 scale as the global scale relating the object trajectory to the background reconstruction. Table I shows the average performance measure values are compared to [18]. Reference [18] used the whole image sequences to select view pairs of specific conditions to estimate the scale ratio. We used only consecutive view pairs for the estimation. For the global scale estimation, we used a geometric mean of the estimated ratios (geomean) or estimated only one global scale ratio by gathering the equations of consecutive frame pairs (eq. sys). We estimated the average scale ratio at least as accurately as the baseline for four out of the five cars. The trajectory errors calculated by the proposed estimation are shown in Figure 4 for all the 35 trajectories. Comparing Figure 4 to a similar figure published in [18] shows that we outperformed the baseline in most of the cases. For both performance measures, we outperformed [18]'s proposal except only the case of the Smart car, which the fact can explain, the car has a small length/width ratio, so the longitudinal axis of the car is harder to estimate. Example reconstructed trajectory is illustrated in Fig 5. (The trajectory consists of 60 frames, every 10 th and the 1 st frame is illustrated.)

V. REAL-LIFE TESTS
For the real-life testing of the proposed method, we chose the Argoverse dataset [31]. From the dataset, we selected  8 vehicle trajectories corresponding to different traffic situations and vehicle movements. These data are summarized in Table II with some statistics. For the simpler referencing, we assigned an integer number to identify the sequence and objects instead of the original Argoverse id, and this is indicated as 'our' in the table.
Results of the trajectory estimation and reconstruction can be found in Table III. Note: utilizing camera frames with high frequency can make the estimation even more accurate, but in these test scenarios, only 10 Hz sampling rate was used because of the LIDAR ground truths and the analysis provided in Section VII-C.
Our results were compared to the 'lift and splat' part of [32] as it is a deep-learning-based method designed and utilized by state-of-the-art methods like [5] for trajectory prediction. We used [32] for comparison because the part of the network can be considered to have a similar purpose (localizing surrounding vehicles) to ours. However, some significant differences should be discussed. Although the lift part of the network estimates 3D from the images (in very low resolution), at the end of the splat part, only 2D information (in bird's-eye view) is available in the form of a probability map about the surrounding vehicles with a simplified model. We provide 3D information about the vehicles with the actual shape, given positions, and data points. The depth accuracy of our two frames-based estimations outperformed [32] on average and in most of the sequences, too; utilizing more than two frames resulted in an even more significant performance increase. Besides, the segmentation result of [32] is often a box estimation parallel or perpendicular to the ego-vehicle. Inbetween angles, it often cannot estimate the correct pose of the vehicles, unlike our geometry-based estimation, as examples in Fig. 6 demonstrate to us.  The average depth error of our method in different challenging sequences is 7.96 %. This is the same level of accuracy as reported results by [34] and somewhat below than reported by [35] and [9] in similar real-life data, but with less challenging scenarios (only straight movement). Also, not relying upon constant motion assumption as [34] our method can solve more generic and challenging trajectories (e.g., the Vehicle trajectory dataset of the previous section). See more comparisons to these methods in section VI. A reconstructed trajectory of the dataset is illustrated in Fig. 7. (Every 15 th frame is illustrated.)

VI. COMPLEMENTARY METHOD
As previously mentioned, parallel camera and object movement results in degenerate cases for our method (see more in section VII-A.1). On the one hand, we find that it is a very frequent movement type, but also the least dangerous one (e.g., ADAS systems like adaptive cruise control are already handling parallel moving vehicles in front of us); most hazardous scenarios happen at vehicle path intersections. On the other hand, only one frame-pair with non-degenerate movement is sufficient for estimating the whole trajectory; also, our method is robust against cases close to degeneracy (Section VII-A.2). Finally, these parallel movement driving scenarios are not even challenging in terms of camera-based depth estimation. There are some methods like [34] or [9], which demonstrate that they are capable of vehicle trajectory Reconstructed trajectory of Argoverse dataset [31] in case of sequence 4. Colormap: Black -Lidar frame (just for reference), Red -(LIDAR based) ground truth object positions (just for reference), Blue -Reconstructed vehicle points from the camera of the given position. estimation in this simple case of approximately parallel movement of cars (in the case of 9 trajectories of the KITTI dataset), so these can be used. Still, we propose a straightforward solution to deal with these degenerate cases, using mostly the same pipeline as we have used so far. We suggest altering it after N is computed in Section III-C (and so degeneracy degree is known). From there, we can see that the target vehicle is moving approximately parallel to the ego vehicle. In this case, we can apply a simplified scale estimation because the target vehicle is parallel to our vehicle (and so their rear or front part is approximated as a plane). Based on the knowledge of camera installation position (height in absolute scale), we can use homography to estimate the vehicle distance (depth) on the ground. Steps of the scale estimation in these cases: • Estimate the distance (D plane ) of the car's (rear or front) plane perpendicular to our moving direction from the reconstructed point cloud. We can use the prior of its normal vector orientation (in the case of the KITTI dataset with roll and pitch angles being approximately 0, this orientation is approximately [0 0 1] in the camera coordinate system). We used RANSAC to estimate this plane of the target object [28]. • By projecting the reconstructed object cloud to a given image, we select the points with the highest v coordinate (closest to the ground plane) inside the previously determined bounding box of the object. • We assume that the previously determined points are on the ground plane, and we use projective transformation determined from the known camera installation position to estimate the distance of these ground points (D point ).  [31] in case of sequence id 47 and object id 6. Colormap: Black -Lidar frame (just for reference), Red -(LIDAR based) ground truth object positions (just for reference), Blue -Reconstructed vehicle points from camera of the given position.
• The relative scale which relates the target vehicle to the background can be computed as: Results of this estimation can be seen in Table IV and a reconstructed trajectory and object shape are illustrated in Fig. 8. (Every 10 th frame is illustrated.) One can see that using this assumption, we achieved stateof-the-art depth error estimation for half of the benchmark. However, our goal was to show that this benchmark which was frequently used to evaluate trajectory estimation is not challenging. It uses only one type of vehicle movement (and methods evaluated on this cannot be compared to our solution working in general camera and target motion cases). That is why we created a new benchmark for vehicle trajectory estimation by selecting diverse traffic scenarios from the Argoverse dataset (described in Table II and Section V).

VII. DISCUSSION
In this section, the proposed method's well-known limitations and comparison are discussed, and finally, a running time analysis is provided.

A. Degeneracy
In this subsection, first, degenerate cases are derived mathematically, then robustness against degeneracy is investigated.
1) Degenerate Case: In case of translation of the camera has the same direction as the translation of the moving object, scale of the moving object translation cannot be estimated. Substituting this condition, C 2 −C 1 |C 2 −C 1 | = N to Equation 9, and rearranging it, gives: so N g must be a multiplication of N as well. This results in a coefficient matrix of rank equal to one, or a scalar equation: which naturally cannot be solved for the two unknowns. Figure 9 illustrates this degenerate case. If the motion N , is parallel to the camera motion, both triangles (blue and orange) can be the solution. The scale remains ambiguous.

2) Degeneracy in the Validation Data (Robustness Against Degeneracy):
In the validation dataset, as we have said in the Bumpy Road scene, the camera and object trajectory (from our method's point of view) was completely degenerate. However, other parts of the dataset contained degenerate or nearly degenerate cases too. Figure 10 illustrates how the proposed method behaves as a function of a degeneracy. We measure the degree of degeneracy as the absolute value of the scalar product of unit vectors in the direction C 2 − C 1 and N (0 means they are perpendicular, and 1 indicates parallel). In the figure, the ratio of estimated and ground truth scales ( λ 1es λ 1gt ) in logarithmic scale is visualized as a function of degeneracy.
The illustration corresponds to one car (Golf) through the different trajectories, but all the cars have a very similar data distribution. One can see that there were many consecutive frame pairs in the dataset close to the degeneracy degree 1, and they resulted in erroneous estimation. However, they can be easily filtered out (as we can calculate this factor, which we call the degree of degeneracy). We get reasonable estimations a bit farther from the value 1 (we used 0.75, as mentioned earlier). On the other hand, as in the logarithmic scale, the distribution seems symmetric on the x axis. Calculating a geometric mean (or formulating a global equation system) could result in a correct scale approximation close to degenerate motions (if we have enough measurements) Fig. 10. Distribution of consecutive frame pairs degeneracy (in terms of our estimation method) and accuracy of corresponding estimations in Vehicle Trajectory dataset. Different color circles indicate consecutive frame pairs in different scenes. On the x-axis, as we get closer to the right, the higher the degeneracy gets (related object and camera movement). On the y-axis the closer to 0 means more accurate position estimation. as well. Considering the dataset's overall degeneracy degree, our method proved to be robust.
Note: The parallel camera motion case cannot be solved by the method proposed by [18] either, but in fact, it has a lot more theoretically degenerate cases, discussed in the following subsection.

B. Earlier Attempt in Degenerate Case
The method of [18] is the only available in the literature, which is capable of achieving similar accuracy as our proposed one (Table I) for general vehicle trajectories (including general camera motion and turning vehicles). However, besides its earlier mentioned limitation (coming from ground plane estimation) its applicability to general driving scenarios (camera and target object are moving in approximately parallel planes) is questionable. They provide no quantitative result on Cityscapes dataset [36] which they use to illustrate their method in these cases. Reference [18] approximate the ground as a plane, and based on plane motion assumption; they aim to solve the relative scale estimation. They utilize the fact that object points should have the same distance to the estimated ground on each frame. Formulating this criterion to any point of the object (here, we use object center) in the case of two frames, we can write the following equation (using the notations of this paper): where n is the normal vector of the plane. Let us suppose a coordinate frame fixed to one point in the ground plane, oriented so that direction of z axis is the same as its normal vector, pointing upward. In this coordinate frame the equation of the ground plane is 0 = Ax +By+Cz−D = 0x +0y+1z−0 where z index means the z coordinate of a vector. As it can be seen from Eq. 16 if there is no difference between the z coordinates of the camera position (moving parallel to the ground), the relative scale λ 1 cannot be determined. As the height difference of a camera (installed on a car) between consecutive frames is minimal (approximately 0), any general driving scenario could mean (close to) degenerate case to [18]. Numerous frames can be required for a proper estimation. They provide only qualitative result on two trajectories of this type of data (Cityscapes [36]), which can be reproduced uncertainly based on their description. In this way, we cannot compare their performance to ours in the case of driving scenarios. However, on their dataset, we outperformed them in most of the cases (Table I).

C. Running Time Analysis
The theoretical speedup is coming from the two frames' sampling instead of three or more, which is an important issue in risky traffic conditions. The present development machine configuration carried out the running time analysis: RAM: 31GB, CPU: Intel Core i7-7820X CPU @ 3.60GHz × 16, GPU: GeForce GTX 1080 Ti 12GB, Operating system: Ubuntu 18 in Matlab environment. Measured running time values are shown in Table V. The computation depends on the dataset image size. The indicated values correspond to Argoverse dataset with an image size of 1920 × 1200.
The ORB feature based reconstruction runs with bundle adjustment with an average running time of 569 ms. However, it is a background feature not necessarily counted in real-time frequency. In this configuration (with high-resolution images), the system can run about 5 FPS in a not-optimized Matlab environment.
Yolo_v2 detector is reported in [22] to run about 91 FPS (about 10 ms/frame) with 69.0 mAP and ORB-SLAM about 25 ms/frame in [37]. Substituting these values to the running time calculation of the pipeline (and not assuming any speed up in the tracker), would result in about 58 ms total run time (17 FPS), close to the real-time speed of 20 FPS. Summing up in Table V, it is safe to assume that real-time run is provided in a usual on-board embedded system.

VIII. CONCLUSION
This paper has presented an approach to reconstruct vehicle trajectories and vehicles in 3D with a mono camera, based on direction prior and epipolar geometry. We provide reliable results for eoru-motions by utilizing only two frames. For this, we estimate the relative scale that relates the object and the background. Earlier methods relied on priors like ground estimates, object shape models, and learning while working in the case of specific camera motion. We do not require these or utilize any assumption besides direction prior. Propagation of the scale ratio is possible, as the evaluation shows. We have discussed the limitation of the method with the degenerate case. We can detect these cases and propose a complementary to our method. With this proposal, we have reached state-of-the-art performance results in the problematic cases too. Our method uses only two frames to estimate the pose, speed, and 3D shape of moving objects two times faster (much earlier) than a method using three frames. It is advantageous in an autonomous driving application where prediction speed (especially of a hazardous moving object) is essential. The method's quantitative evaluation supports the theoretical advantages. We estimated trajectories of such general traffic scenarios that have not been aimed before in the literature. It can be used in any general vehicle trajectory estimation scenarios, too, like estimation from a UAV (Vehicle Trajectory dataset [18] ). In these cases, we have outperformed the state-of-the-art. Our method has potential extensions as many natural and artificial objects are elongated parallel to their travel direction. Object recognition can extend to other predetermined directions if not the case. We plan to extend the method to estimate other object types' movements in the future.