On Learning Mechanical Laws of Motion From Video Using Neural Networks

In computer vision, physics plays an important role in several applications. In this work, we teach a machine to detect the mechanical laws of motion of physical objects using video, and show how the results can be useful for computer vision tasks. We assume no prior knowledge of physics, beyond a temporal stream of bounding boxes. The problem is very difficult because a machine must learn not only a governing equation (e.g. projectile motion) but also the existence of governing parameters (e.g. velocities). We evaluate our ability to represent the physical laws of motion in video, such as the movement of a projectile or circular motion, in both real and constructed videos. These elementary tasks have textbook governing equations and enable ground truth verification of our approach. To establish the importance of the proposed method, we show a real-world use case in the domain of object tracking in confounding scenes, where existing state-of-the-art algorithms fail. Incorporating physics into computer vision not only serves the purpose of curiosity-driven research, but also provides an inductive bias for computer vision applications like object tracking.


I. INTRODUCTION
Many computer vision techniques use physical models as an inductive bias for neural frameworks [1]. It is impossible to cite the hundreds of such techniques, but a few representative papers include [2], [3], [4], [5], and [6]. In all these methods, a known physical model is needed to serve as an inductive bias. But what happens if we do not know the physics; when we do not know the governing physical equation or its parameters? In contrast to previous work, this paper looks at the unexplored problem in vision of teaching a machine to recognize the laws of physics from video streams. In the apocryphal story, Isaac Newton's observation of a falling apple was a The associate editor coordinating the review of this manuscript and approving it for publication was Szidonia Lefkovits . catalyst for deriving his physical laws. In like fashion, our machine aims to observe the dynamics of a moving object as a means to infer physical laws. We refer to this as recognizing physics from videos, as shown in Fig. 1. Recognizing physical laws is not only curiosity-driven research that tests the boundaries of computer vision, but it also provides new sources of inductive biases that can be plugged into neural frameworks, e.g., yielding better object tracking in challenging cases like occlusions.
It is worth noting that the recognition problem is very difficult because a machine must derive not only the governing equations of a physical model but also governing parameters like velocities and acceleration. We emphasize that a recognition algorithm like ours does not know a priori what ''velocity'' means-it must learn the existence of velocity. VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Recognizing physical equations from visual cues without human intervention. Here, we show how an input video of projectile motion can be processed by our method to recover both the governing equation of motion, as well as two governing parameters of initial velocities (both horizontal and vertical). These can be used for several applications in computer vision.
Additionally, the recognized physics should not only be easily interpretable to humans but also be able to work with any associated applications as a helpful reinforcement. In order to handle the under-determined nature of recovering both governing equations and governing parameters, we make a few assumptions. Section III expands on our assumptions, which we believe are the most relaxed till date. Our work is powered by methods from representation learning and evolutionary algorithms. The recognition of underlying governing parameters is achieved using a modified β-variational autoencoder (β-VAE) to obtain latent representations. These are then used in an equation recognition step, driven by genetic programming approaches. Our approach is able to learn equations that symbolically match ground truth, and have governing parameters that correspond to human interpretable constructs (e.g. velocity, angular frequency). Moreover, along with the basic idea of ''learn physics from vision, and then apply it for vision'', we show the effectiveness of our learnt symbolic equations for the task of challenging object tracking, highlighting the applicational advantage of the proposed method.
Contributions: Our key contributions are summarized as follows. (1) An initial attempt 1 at an algorithm that is able to recognize both physical governing equations and governing parameters from videos. Previous work [8], [9] can either recognize governing equations or the parameters, but not both. (2) We test the algorithm on both synthetic data and real data. Our method results in symbolically accurate expressions and interpretable recognition of governing parameter for a variety of simple mechanical motion tasks. Our method is robust to large amounts of positional noise and effective under a range of input data sizes. The ability to robustly work with real data highlights the real-world translational capabilities of the method. (3) Datasets: To lay the foundation for future work, we release two datasets: one is the physics-recognition dataset, consisting of both real and synthetic videos of dynamic physical phenomena. The other is a real basketball video dataset, where we test the applicability of our proposed method. (4) We perform comprehensive experiments 1 Aspects of this work form the basis for the thesis in [7]. on object tracking in difficult settings. The results confirm the applicability of our proposed methods, moving one step closer towards physics-AI convergence. Code and data may be found from https://visual.ee.ucla.edu/visualphysics.htm/.

II. RELATED WORK
Although our goals are different, we are inspired by work in physics-based computer vision, physical representation learning, and symbolic equation derivation. Physics-based computer vision encompasses the use of known physical models to either directly solve or inspire computer vision techniques. Techniques like shape from shading [10], [11] and photometric stereo [12] use known models of optical physics to estimate shape. Along this theme, recent work in the area of computational light transport has advanced the field to see around corners [13], [14], [15], [16] or infer material properties [17]. 2 Known physical models can also be used to inspire the design of vision algorithms. Examples include deformable parts models [19], [20] or snakes [21], which use the physics of springs to design computer vision cost functions. The recent popularity of data-driven techniques has spawned a family of work that combines a known physical model with pattern recognition. For example, [22] and [23] unfold the existing physical models as the backbone in the network architecture; [24], [25] use physics knowledge and laws to supervise the training process; [26] relies on gravity cues to improve depth estimation; and [2], [3], [27], [28], [29], [30], [31] introduce physics-based learning to set the new state-of-the-art in a range of vision problem domains. These approaches are powered by knowledge of a physical model, whereas our work has the complementary aim of learning the underlying model.
Learning physical parameters from visual inputs has been a topic of interest in recent years. For instance, [32], [33], [34], [35], [36], [37] estimate parameters or equivalent information for well-characterized physical equations with visual inputs. These can be incorporated into realistic physical engines to infer complex system behavior. Fragkiadaki et al. [38] integrate the model of external dynamics within the agent to play simulated billiards games. More recently, [39] and [40] deploy interaction networks with graph inputs to encode the interactions among objects in complex environments, and estimate other invariant quantities of the phenomenon using deep learning. In the field of controls, Shi et al. [41] learn the near-ground dynamics to achieve stable trajectory control. Jacques et al. [42] use a physics as inverse graphics approach in order to estimate physical parameters governing the dynamics in videos. While these prior attempts are capable of predicting the system dynamics precisely, they also require a well-characterized physical model that is already known.
Symbolic regression aims to generate symbolic equations from a space of mathematical expressions to fit the distributions of input samples. Genetic programming [43] is one of the prevalent methods in this field, with previous applications in recognizing Lagrangians [44] and nonlinear model structure identification [45]. Additional features from the input variables [46], [47] and partial derivatives pairs [48] can also be introduced into genetic programming for more reliable regression. Other evolutionary methods can also be used to derive partial differential equations (PDEs) [49]. Sparse regression [50] and dimensional function synthesis [51] are two other alternatives to conduct symbolic regression. Recently, deep neural networks (DNNs) have also been utilized to generate symbolic regression [52], [53], [54], [55]. These existing methods usually require predetermined terms or prior knowledge from physics.
Comparing with Prior Work: Fig. 2 clearly highlights the difference of the proposed method with the two closest existing works. Our proposed method recognizes the complete scene physics (both equations and scene physical parameters), with the only prior assumptions being the object of interest and the focus on position dependent physics. Additionally, we show performance on synthetic as well as real data. In comparison, Huang et al. [8] assume prior knowledge of the specific physical parameter definitions (in terms of the relation of velocity with position) in order to recognize the governing equations. On the other hand, Udrescu and Tegmark [9] propose a warp-robust latent space for position encoding, from which differential equations governing the dynamics are identified. Beyond that, they also assume knowledge of the physical parameter definitions in order to identify governing differential equations. Additionally, both prior works show performance only on synthetic data, with real data translation not being explored.

III. PROPOSED METHOD
Assumptions: This paper represents only a first attempt to recognize the laws of physics and physical parameters from videos. As such, we make certain assumptions. First, we restrict our focus to the dynamics of single objects (rather than groups of objects). Second, it is assumed that we know the object for which we would like to derive the physical equations. Third, we assume that videos are in sequence. We believe these assumptions are sufficiently general to allow us to characterize our technique as ''recognizing physics''. For example, the apocrypyhal story of Isaac Newton observing the apple falling aligns with the three assumptions outlined above. In the story, Newton was watching a temporal sequence of a single object in motion and was able to inductively reason about the laws of physics.
Defining ''Recognition of Physics'': We define recognition of physics as recognizing both the governing parameters and governing equations. Given the assumptions from the previous paragraph, we must therefore recognize all parameters except for the object location and time. Concretely, for a task like trajectory estimation, our framework has to tackle the challenging task of learning both the projectile equation, as well as the existence of a ''velocity'' term, from video input. Refer to Fig. 2

for details.
Defining ''Latent Space'': Subsequently in this manuscript, we use the phrase ''latent space'', similar to prior work in machine learning [9], to refer to the embedding space/learnt feature space of a deep learning/machine learning architecture, such as an autoencoder. While latent space variables need not always have interpretable significance (as in this work), they encode relevant information for any learningbased task.

A. RECOGNITION FRAMEWORK
Our method consists of three interconnected modules that handle position detection, latent physics recognition, and equation recognition, respectively. Fig. 3 summarizes this framework.

1) POSITION DETECTION MODULE
We build the Visual Physics framework based on the assumption that the underlying physical equations are reflected in the dynamics of an object across different time steps. Therefore, a robust object detection algorithm is required at the first stage to achieve accurate moving object localization for diversified categories of objects. We deploy a pretrained Mask R-CNN [56] to extract the bounding box of the object in the first frame, followed by object tracking using the STARK [57] algorithm starting with the centroid of the detected box.

2) LATENT PHYSICS MODULE
The objective of the Visual Physics framework is to derive the governing physical laws without prior knowledge. This differentiates our work from the pioneering prior work in [8]. Specifically, since we do not assume knowledge of physical parameters, as in [8], we propose a techinique to infer these parameters. To achieve this goal, we need to infer the associated latent governing parameters from positional observations. VAEs [58] have been widely deployed to extract the latent representations with applications in physics, such as SciNet [59]. We adopt a modified β-VAE architecture for our latent physics module as well. The encoder takes a vector VOLUME 11, 2023 FIGURE 2. (a) Previous work [8], [9] requires both video and the physical parameter definitions. Additionally, they show performance on synthetic videos, but no translation to real videos. (b) Our proposed technique also requires a video input, but is able to recognize latent parameters that correspond to true physical parameters, like velocity or angular frequency along with the equation. Our method is able to work with both real and synthetic videos.
corresponding to the object trajectory at uniformly sampled time instants as the input, and condenses them into a limited number of latent parameters. The decoder tries to reconstruct the object location (x q , y q ) at an unseen time instant with these latent parameters [l 1 l 2 l 3 ] T and the time instant t q as inputs. This module is supervised by the object locations without other prior physical knowledge. Once the network converges, both locations obtained from the position detection module, and the corresponding learnt hidden representations from the latent physics module are paired as the equation recognition module input. We term this module as the latent (or unknown) physics module as the physical parameter recognition is enabled through the latent space of the module (as realized in the previous section).

3) EQUATION RECOGNITION MODULE
We concatenate the latent parameters and positional observations, and use this as input to a symbolic regression approach. Vanilla genetic programming approaches are usually subject to convergence issues, and may lead to trivial equations that are not descriptive for the physics associated with the data. Schmidt and Lipson [48] alleviate this problem by introducing partial derivative pairs between the input variables as a search criterion. We follow this strategy to design an equation recognition module, capable of generating multiple equations with a range of equation complexity and fit accuracy. The final output is a symbolic equation that is Pareto-optimal, described by the knee point [60] of the error-complexity curve.

B. IMPLEMENTATION 1) PHYSICS-RECOGNITION DATASET
To evaluate the proposed framework, we generate a dataset of both real and synthetic videos covering physical phenomena. We simulate several such phenomena out of which we present two in Table 1: free fall and damped oscillations. We cover the other phenomena in the Appendix. Each synthetic task includes 600 videos with randomly sampled physical parameters. We additionally include real video clips for free fall (411 videos) and uniform circular motion (80 videos). For all scenes, the physical phenomenon is known in closed-form, enabling us to compare our proposed approach to ground truth. While the physics may seem elementary, we test in real-world conditions and add noise to make the task harder.

2) SOFTWARE IMPLEMENTATION AND TRAINING DETAILS
For the physical inference module, both the encoder and the decoder consist of six fully-connected layers, and the size of the latent parameters is set to be three. We use the mean squared error (MSE) of the reconstructed locations and the β-VAE loss [61] to supervise the training process. β-VAE penalty is introduced to encourage the disentanglement of latent representations, so that independent physical parameters are inferred in separate latent nodes. The entire loss function L of the latent physics network can be written as follows: where Y t q is the ground-truth location at time step t q ,Ŷ t q is the estimated location from the network, L mse (·) is the MSE loss, Z denotes the extracted latent representations, L kl (·) denotes the Kullback-Leibler divergence between a Gaussian prior, and β is the balance factor for the β-VAE loss as described in [61]. We use Adam optimizer [62] with an initial learning rate of 0.001, and this learning rate is decayed exponentially with a factor of 0.99 every 200 epochs. All the networks are implemented in the PyTorch framework [63]. We construct the equation recognition module by using the Eureqa package [64]. The candidate operation set includes all the basic operations, such as addition, multiplication, and sine function, in addition to powers and exponents. We recognize two equations for horizontal and vertical directions separately, and the R-squared value is used to measure the goodness of fit during the search.

IV. EVALUATION
A. SYNTHETIC DATA EVALUATION Fig. 4 illustrates various results from our framework, tested on synthetically generated data described in Table 1. Results for additional tasks can be found in the Appendix.

1) FREE FALL (SYNTHETIC)
In this scene, all possible trajectories are completely parameterized by the initial velocities v 0x and v 0y along the x and We use a number of video clips as inputs to our system. The extracted position information is fed through the physics parameter extractor, which identifies the governing physical parameters for the phenomenon. These are used as inputs to the genetic programming step, in order to identify a human interpretable, closed form expression for the phenomenon. y directions. Fig. 4(a) displays the output of our method for free fall, including both embeddings as well as the recognized equation. The embedding trends show that our latent physics model successfully learns to separate these horizontal and vertical velocity in two separate nodes. The correlation of the three latent nodes with the two governing (groundtruth) parameters demonstrate that the nodes learn an affine transform of the ground-truth velocities. It is important to note that the third node does not show dependence on the input, assuming a constant value. This reconciles with human intuition in the sense that free fall is determined only by two parameters. In evaluating the final output, we observe that the recognized governing equation matches the form of the familiar kinematic equations. The value of the acceleration due to gravity is learned exactly and the parametric dependence of the equation on the initial velocities is accurate up to an affine transform.

2) DAMPED OSCILLATION (SYNTHETIC)
In this experiment, we simulate videos of damped oscillation, where the oscillation amplitude decays exponentially with time. We change the damping factor b and the angular frequency ω along x direction, while the object location along y direction is fixed. As shown in Fig. 4(b), the latent physics module is able to recognize the notion of ω and b in two different nodes, and the equation recognition module can generate equations to describe the combination of periodic and damped motions accurately.
Having recognized the equations for circular motion from synthetic data, experiments on this task are now extended to real data. Through this, we aim to further demonstrate the applicability of our method on real scenes. The dataset consists of 80 videos of an object rotating at fixed angular velocity. The rotation radius is kept constant across the dataset, and the angular velocity ω is varied in the range [1.2π, 3π] radians/s. Videos with ω < 1.2π are excluded from the dataset in order to avoid non-linear effects of the motor at low frequencies. The first 200 frames of every video are used as input to the position detection module. The positions obtained are corrected for initial phase, so that all input trajectories have the same (zero) phase, by appropriate rotation of coordinates. The ground-truth ω for each video is calculated numerically based on these detected locations, from VOLUME 11, 2023  zero-crossing frequencies. These are used for verification of the learnt representations, and are not used as part of the recognition process. Fig. 5(a) shows a graphical description of the setup for data collection. The latent physics module is trained with synthetic data, which is generated so as to match the parameters of the real dataset (frame rate, angular velocity range). We then use the real data on this trained model, in order to obtain the latent representations and the inputs for the equation recognition module. It may be observed from Fig. 5(b) that the first latent embedding l 1 obtained for the real data is well-correlated with ω. The other two nodes are close to zero in magnitude. This reconciles with the fact that there exists only one primary governing parameter for this setup. Additionally, the trend between the learnt embedding l 1 and ω suggests a quadratic relation. Hence, in Fig. 5(d), we verify that the recognized angular velocity ω net (mentioned in Fig. 5(c)) corresponds to ground-truth ω with high accuracy.

2) FREE FALL (REAL)
We replicate free fall in the real-world as shown in Fig. 6(b) where the test set is a video sequence of a human tossing a VOLUME 11, 2023 ball with varying spins and uncontrolled air resistance. The motion may also not be perpendicular to the camera, leading to scale inconsistencies. 411 videos are collected, where each video represents a toss. To obtain ground truth initial velocities, we fit the kinematic equations to the observed videos, using the appropriate scaled value of acceleration due to gravity g. The proposed latent recognition module does not have the luxury of this information. We report results in two conditions. In Fig. 6(b), we train on real data and test on real data. Diversity in the dataset occurs due to different types of spins and tosses. To show that our method is not overfitting, Fig. 6(a) displays results when we train on synthetic data and test on real data. Both cases achieve successful recognition of the ground-truth governing equation. The symbolic form of the equation we learn reconciles with the known physics model up to an affine transform in the governing parameters.
It is important to note that slight error is observed when testing on real data. We believe that physical non-idealities such as air resistance and drag account for a part of this inconsistency.

3) FREE FALL (APPLICATION)
Object relocation after long-time occlusion is a major challenge for object tracking. For example, in a real basketball game video, it so happens that the trackers fail to track the ball when the scene is cluttered with long-time occlusions. As humans, with the knowledge of physics, it is easy for us to predict the ball's trajectory, even with limited visibility. We apply the same idea for object trackers and try to teach them to use the recognized physics as inductive bias to overcome the challenges. Fig. 6(c) illustrates this proposed framework. We test two state-of-the-art single object trackers: KeepTrack [65] and STARK [57] for this case. Fig. 6(d) shows that both trackers can successfully track the ball before reaching the occlusion region but fail to relocate it when it reappears at the right hand side. Vanilla tracking algorithms do not have prior knowledge of physics as an inductive bias. Therefore, when the algorithms update the object's position based on its previous state, it is very likely that the new search region can be misled by an incorrect state which results from occlusions in the cluttered scene. We collect a real basketball dataset with video clips of both easy samples (clear and full basketball trajectories) and hard samples (partial basketball trajectories with occlusions). To solve the occlusion problem, we first recognize the accurate governing physical equations of the basketball from easy samples, which are then used to generalize to other hard samples under the same physical conditions. We note that applying classical physical equations here could be wrong because (1) the real physical environment is non-ideal (e.g. air resistance); and (2) the camera projection is unknown so the value of some physical constants (e.g. the gravity) need to be translated to pixel coordinates. We train our latent recognition module with occlusion free videos and recognize the corresponding governing equations.
Next, we design a plug-and-play method to plug in our recognized physical equations as the inductive bias for the trackers so that there is no need to retrain. Thanks to the symbolic governing equations, it is efficient to obtain the corresponding latent parameters which represent the governing parameters (e.g. initial velocities), even when visible frames are limited. With both known governing equations and parameters, we can compute a sequence of physics states to predict the location of the target object as the inductive bias, in parallel with the current state given by the trackers. A discriminant condition is set for the trackers' failure if the difference between the two states is beyond a threshold ratio ϵ. When this condition is met, the trackers will follow the inductive bias and replace the wrong current state by the physics state, thereby optimizing the search region for the next state. In Fig. 6(d), we show an example of how the proposed method helps relocation after long-time occlusion in a real basketball video clip. In this video, the basketball is fully occluded from frame 21 to 67 and both trackers fail to relocate the ball in subsequent frames; our proposed method successfully solve this problem.

V. PERFORMANCE ANALYSIS
We now analyze, in reasonable detail, the characteristics and performance of the proposed approach. These factors hold special importance towards the function of the framework as a physics recognition unit, in a future application domain (e.g. biomedicine, astrophysics).
Robustness Against Noise: To assess performance in context of noise, we use the synthetic free fall task and add noise to the position detection module of varying strengths. This corrupted data is then used to train the latent physics module and serve as the input to the equation recognition module. The plots of governing parameters in Fig. 7 show that with increasingly noisy input trajectories, the representations remain relatively robust. However, the variance in representations is found to increase as the input corruption level increases. We are satisfied with the quality of these representations. Using even noisy (yet correlated) representations in the equation recognition step, still enables us to recover output equations that resemble the known physical laws. The method eventually fails for corruption with noise of standard deviation of 128. At this very high noise level, even the direction of the trajectory is changing (i.e. the ball appears to travel backward). We can observe this in the last column of Fig. 7.
Equation Complexity Versus Accuracy: Here we discuss how the proposed framework is able to recover the correct equation In order to choose an appropriate trade-off between fitting accuracy and complexity, we use plots such as those shown FIGURE 7. The proposed method is found to be robust when considerable zero-mean additive Gaussian noise is added to the trajectory. The framework is tested on synthetically added noise with standard deviation ranging from 4 to 128 pixels (at a scale of 300 pixels/meter). The representations are found to be robust for noise of standard deviation up to 32 pixels, with equations demonstrating analogous robustness. The method fails at a noise of standard deviation 128 pixels, which can be seen to completely bury the trajectory signal in noise.  in Fig. 8. The knee point of the trade-off curve is chosen as the most meaningful equation, since it marks the point of maximum gain in error performance with minimal increase in complexity. This selection ensures that the genetic programming algorithm refrains from over-fitting on data.
Effect of Training Data Size: We analyze the performance of our proposed method with respect to varying amounts of training data. This holds relevance in terms of the possible application of the framework (or others inspired by it) toward tasks with varying data availability. Fig. 9 shows the results of this analysis on the task of synthetic simulation of free fall motion. We evaluate performance based on: (a) the normalized cross-correlation coefficient between the learned active latent node and the ground-truth governing parameters, and (b) the trajectory prediction accuracy based on the latent values predicted by the physics recognition module on the test dataset, used on the recognized equations. The general trend of increasing correlation and reducing prediction error with increasing training samples is clearly visible in the plots. However, what is also of interest is the fact that the worst case error for the scenario with the lowest number of input samples (200 samples) has a sufficiently high correlation of 0.95.

VI. CONCLUSION
Through this work, we have demonstrated the ability to recognize physics from video streams. Our method is unique in that it is able to recognize both the governing equations and physical parameters. Our results are powered by an encoderdecoder framework that learns latent representations. These, in conjunction with position and scale information, allow us to learn interpretable equations that describe the physics of the task at hand. We show that our method is able to learn from real videos in the wild. The learnt equations, used as plug-and-play priors, are shown to improve object tracking through challenging environments. Some future directions that could be explored:

1) BEYOND 2D PHENOMENA
The Visual Physics dataset consists of 2-dimensional scenarios. For example, the tossing ball is viewed from the side, such that the ball does not change in its axial depth. For engineering reasons, we assume that the physical phenomena VOLUME 11, 2023 is observed in the 2D camera space of a video camera. If dynamics occur in 3-dimensions (e.g. motion in x, y, z), then our algorithmic framework is still valid, but we must use a 3D camera to capture these 3D dynamics. In general, Visual Physics framework can apply to higher-dimensional scenarios, potentially outside of video, provided that the measurement space is able to capture the phenomena.

2) OPEN PROBLEMS
Analogous to the apocryphal story of Newton's apple we have considered dynamics of a single object. This work is therefore a stepping stone to understanding the dynamics of multiple-objects. Another open problem is to extend the framework, beyond the three modules we have proposed. Concretely, we could also see adding a fourth module where the equation and embeddings we recognize is used as input to another inference framework. For example, it might be possible to improve object detection given the velocities of objects, or create computational imaging frameworks that learn to classify scenes based on scattering properties. In conclusion, this paper is scratching the surface of the possibilities at the seamline of computer vision, physics, and artificial intelligence. We are excited to see these fields continue to merge.

APPENDIX A MORE PHYSICAL PHENOMENA A. CONSTANT ACCELERATION MOTION (SYNTHETIC)
In this task, the trajectory is governed by a single parameter: the acceleration a acting on the object. Obtained results are displayed in Fig. 10(a). As we expect, since only one of the nodes is required to describe the phenomenon, the embedding trends show that two nodes are invariant to the input and learn an almost constant, low magnitude value. The other node, which is correlated to the input, learns acceleration. Turning to the output equations, we find our method recognizes both the correct form, and the latent variable maps to an interpretation of a. Also note that the value of the y coordinate, which is expected to be constant, is recognized accurately.

B. UNIFORM CIRCULAR MOTION (synthetic)
This task has a sinusoidal, rather than polynomial form. For a fixed radius of revolution, the governing parameter we seek to recognize is the angular frequency ω of the rotating object. Hence, this task also depends on a single governing parameter. Fig. 10(b) highlights that one of the latent parameters is correlated with angular frequency, while the other two are uncorrelated to the input. Based on the learned parameters and observed positions, the proposed method correctly identifies a sinusoidal dependence for both the x and the y coordinates.

C. HELICAL MOTION (SYNTHETIC)
The synthetic videos are generated with different angular velocities ω and horizontal translational velocities v 0x . There is no translational motion along the y direction, and the radius of the rotational motion is held constant for all the videos. Fig. 10(c) shows the learned representations and equations along the x and y directions. It may be observed that two of the latent representations are affine transforms of the governing physical parameters, v 0x and ω, and the derived equations are of the same functional form as the true equations. This emphasizes the performance of our framework on scenarios with multiple physical phenomena in action.

APPENDIX B ADDITIONAL REAL DATA EXPERIMENT DETAILS A. ABOUT CAMERA CALIBRATION AND SCALE
An essential aspect of interpretability and usability of the learned equations and physical parameters is knowledge of the system of units and scale. Our goal of recognizing physics from a single camera is therefore complicated by the projective behavior of a typical camera: scale information is lost, and additional warping is introduced in the image frame. Our proposed framework uses a simple yet accurate way of resolving these ambiguities. Similar to settings for experimental physical recognition that assumes complete knowledge of the experimental setting, we assume knowledge of the camera calibration parameters, as well as object size. Therefore, the recognition is applied to calibrated (that is, the cameras configuration such as position, orientation etc. are known) and scale specified videos (that is the conversion factor from the pixel units to SI units (such as the diameter of an object, like a ball, in both pixel and SI units) is known)and coordinates and dimensions are therefore converted from pixel scale to SI units. This assumption (of knowledge of object size and camera parameters) is reasonable under the physical discovery/recognition setting, where properties of the object under consideration and the experimental setup are known apriori.

B. REAL BASKETBALL DATASET RESULTS
We collect a dataset that contains 523 videos clips of basketball toss at a framerate 30 frames per second, with both easy samples (clear and full basketball trajectories) and hard samples (partial basketball trajectories with occlusions) as mentioned in section IV-B-(3). To solve the problem of object tracking through occlusions, we first recognize the accurate governing physical equations of the basketball from easy samples, the recognized equations are then applicable to generalizing for any other hard samples under the same physical conditions. We train our latent recognition module with 415 occlusion free videos. The videos are pre-processed the same way we pre-process all the other datasets: object detection using the Mask R-CNN [56] framework followed by tracking and bounding box estimation in each frame using the STARK [57] algorithm. The software implementation details are provided in Appendix D. The bounding box centroids are then passed to the β−VAE network for latent parameter learning. Here the value of β for the β−VAE network is TABLE 2. Description of the synthetic Visual Physics dataset. These three physical phenomena are representative of fundamental trajectory motion. Although all scenes describe trajectories, the governing equations and parameters are different (e.g. polynomial for some, and sinusoidal for others). set to 12, and the module is trained for 4000 epochs with learning rate of 0.001 that decays exponentially by a factor of 0.95 after every 200 epochs. The learned latent parameters, and subsequently, the governing equations recognized based on these parameters are illustrated in Fig. 11. Based on this training, we can recognize the equations that can clearly denote the motion of the basketball at any given time instant.

APPENDIX C QUANTITATIVE PERFORMANCE EVALUATION
The performance of the proposed Visual Physics framework may be measured along two fronts: (i) the mean error between the ground-truth trajectories and the trajectories from recognized equations, and (ii) the normalized cross-correlation coefficient between the latent representations and the corresponding ground-truth governing parameters. The analysis on the effect of training data size, described in Section V, utilizes these metrics for evaluation. Here, we describe these metrics in more detail.
Let the ground-truth trajectory coordinates be denoted by (x (t) , y (t) ) at a given time instant t. Based on the Visual Physics framework, let the learnt equations for x and y be given by x = f x (t, l 1 , l 2 , . . . , l n ) and y = f y (t, l 1 , l 2 , . . . , l n ), where l 1 , l 2 , . . . , l n are the latent node values. Then, the mean error between trajectories (ϵ) can be computed as: where S is the total number of time samples in the trajectory under consideration. Additionally, the values for l 1 , l 2 , . . . , l n are estimated through least-squares. Some values of ϵ evaluated on trajectories for the free fall case may be found in Fig. 8 of the main paper. A test set of unseen trajectories was evaluated using these metrics. A low value of the error implies that the model (equation) learnt is sufficiently parameterized to characterize the observed trajectory, as well as that the time evolution of the predicted trajectory matches that of the observed trajectory. Let the ground-truth governing parameters be represented by g 1 , g 2 , . . . , g m , m ≤ n. On successful recognition, the hidden nodes of the latent physics module are expected to show strong correlations with the governing parameters. Hence, the normalized cross-correlation between corresponding latent nodes and governing parameters is given by where K is the number of test trajectories (with each trajectory consisting of a sequence of time-varying position coordinates), and σ u is the standard deviation of any variable u. We look at the magnitude of the strongly correlated hidden node-governing parameter pairs, and use the magnitude as an indicator of 'goodness of latent representations'. Fig. 8 again highlights the computed values of the same for the free fall task. It may be observed that the values of the correlation metric are acceptably high. An additional metric for the goodness of latent representations and complexity evaluation can be the number of latent nodes required for the task. For instance, it would be interesting to apply this framework on multidimensional physics tasks, where the governing parameters are a lot more than 3, requiring us to use more number of latent parameters.

APPENDIX D SOFTWARE IMPLEMENTATION DETAILS
This section highlights the synthetic dataset generation, real data pre-processing and the overall framework implementation.

A. DATASET GENERATION (SYNTHETIC DATA)
The synthetic dataset comprises of an object undergoing motions governed by a range of diverse physical laws. We use Python and associated toolkits for simulating the same phenomena. Specifically, we use NumPy and OpenCV. Each scene consists of a spherical object, of fixed size. The background is chosen to be a constant frame, independent of the video. Frame rate, video duration and frame size are the tunable parameters for this setup. The trajectory of the ball is then calculated, based on initial positions, initial velocities and time. Specifically, the initial velocity range is chosen so that for a given initial position, the object always stays in the frame at all times. Based on these parameters, the object location at each time instant is determined using kinematic equations, and the corresponding frame is created. These sets of frames are then stored as the respective videos. Note that for the train on simulation, test on real regime, for uniform circular motion and free fall tasks, the frame size, frame rate and scale were chosen so as to be consistent with real data.

B. POSITION DETECTION MODULE
To process the videos, we developed a Mask R-CNN [56] and STARK [57] based framework highlighted in Section III-A-(1) to convert the videos into position vectors which could be processed by the latent physics module. The input to the pretrained Mask R-CNN is the first frame of a video with N frames (N = 200 for synthetic data). The Mask R-CNN framework is obtained from the Pytorch torchvision library. The Mask R-CNN detects the bounding box of the ball in this frame. The centroid of this bounding box is considered to be the position of the ball in the first frame. This bounding box is then fed as input to the STARK [57] object tracking algorithm. We use the STARK-ST tracker with the baseline pretrained model for this purpose. This algorithm then processes the video and provides the location of the ball in each subsequent frame as a pair of (x, y) coordinates. The frames are sampled alternately, in a way that the even numbered frames are processed by the latent physics module.
The odd numbered frames are then used as the query input set. The output of this module is hence a (N + 1) length vector, where the first N /2 elements correspond to y coordinates, the next N /2 elements correspond to x coordinates, and the last element of the vector corresponds to the frame number which is the query. In case of the uniform circular motion task, the position detection module was modified slightly to avoid unwanted detections by Mask R-CNN in the video frames. The modification is to convolve each frame of the video with a Gaussian blur kernel (using OpenCV), so that other irrelevant stationary components of the video frame are partially abstracted out, and the Mask R-CNN detects only the object of interest in the frame. Since we deal with a single object setting in our work, the blurring technique is useful to improve the robustness of the Mask R-CNN for detecting the object of interest in a variety of real scenes.

C. LATENT PHYSICS MODULE
This module uses the position outputs from the previous step to identify governing parameters in the latent nodes. We use a feed-forward neural network for this purpose, specifically a modified β-Variational Auto Encoder (β-VAE) architecture [61], [66]. The inputs of length N + 1 are obtained from the position detection module (N is the number of input video frames to the position detection module). Both the encoder and decoder consist of 6 fully-connected layers each. The dimension of all of the hidden layers is fixed at 256, and we use three latent nodes. For training, we concatenate a randomly chosen time query with the latent nodes, as input to the decoder. The outputs of the decoder are the position of the object at the specified time query. As mentioned in Section III-B-(2), we use the mean squared error (MSE) loss on the predicted locations, regularized by the β-VAE disentanglement loss.

D. EQUATION RECOGNITION MODULE
As mentioned in Section III-B-(2), we use the genetic programming toolkit Eureqa [48]. For our experiments, we use a fixed configuration setup. Inputs for the genetic programming are the positions along x and y, the time instants t (evaluated using the frame index and the frame rate of the video) and the latent node information for each trajectory l 1 , l 2 , . . . , l n , where n is the number of latent nodes used. For our experiments we use n = 3. For M training trajectories and K samples per trajectory, we therefore have M × K sets of (x, y, t, l 1 , l 2 , . . . , l n ) as inputs.
The error metric is chosen to be the R-squared goodness of fit. The candidate functions are chosen to be: (i) constant, (ii) input variable, (iii) addition, (iv) subtraction, (v) multiplication, (vi) division, (vii) sine, (viii) cosine and (ix) exponential. The complexity for each of the candidate functions is kept at the default value. No other configuration parameters for the toolkit are modified. Since the toolkit output includes several equations with varying complexities, the final equation is chosen based on paretooptimality in the fit-complexity space. The same is mentioned in Section III-A-(3).

E. PLUG-AND-PLAY METHOD
Algorithm 1 illustrates the plug-and-play method mentioned in Section IV-A-(3), for plugging our recognized physical equations into the object tracking framework.

F. RUNTIME ANALYSIS
Experiments were performed using a Linux (Ubuntu 18.04 LTS) machine with an Intel i5-8400 CPU (6 cores, 2.80 GHz), 16GB of RAM, and NVIDIA GeForce RTX 2070 GPU (8 GB of GPU RAM). Table 3 shows the runtime analysis for the helical motion task. As suggested from the table, the overall runtime for this task is approximately 1.5 hours. The position detection module is the primary bottleneck in our framework, largely due to the size of the dataset. Depending on the complexity of the equation, the time required by the equation recognizing module to converge at a plausible equation ranges from 60 s to 1800 s, for equations along two dimensions.

APPENDIX E INTERPRETABILITY
To further illustrate the significance of recognizing both the physical parameters and the associated physical laws (as opposed to prior work such as [8] and [9] that assume VOLUME 11, 2023 Algorithm 1 Plug-and-Play Method Given: current state s t , initial annotation s 0 Input: governing equations E 0 1: for t = 1, . . . , N do 2: if t < n then # for visible frames 3: l t ← E 0 (t, s t ) # calculate latent parameter l t in the governing parameter 4: L t .append(l t ) # store all l t in list 5: end if 6: l ← median(L t [−5 :]) # calculate the certain l by the latest 5 states 7: E ← combine(E 0 , l) # final governing equations 8: p t ← E(t) # physics state 9: if ||s t − p t || > ϵ then # state difference beyond the set threshold 10: s t ← p t # update the current state by physics state 11: end if 12: R t+1 ← T (s t ) # propose new search region 13: s t+1 ← (R t+1 ) # update state 14: end for FIGURE 12. Traditional classification on our learned latent space can provide additional interpretability. The decision boundaries highlight the range of the learned physical parameters to determine whether a ball will arrive at the target region in the future frames (in the case of the free fall task), or if a surface is rough (in the case of the constant acceleration task).
knowledge of physical parameters), we highlight a new setting where the learned physical parameters are used to enhance existing computer vision tasks. First, we consider the free fall task with both real and synthetic samples, for the task of future position prediction. This holds relevance in several real-world applications including sports analytics. For real data, we use 410 samples, split as 340 train and 70 test samples. For synthetic data, we use 500 train and 100 test samples. The ground truth labels are assigned based on whether the tossed ball will travel through a predefined rectangle region (labeled as ''in'') in the future frames or not (labeled as '' out''). We also conduct similar classification on the constant acceleration motion, to determine the roughness of various surfaces in the scenario of non-slipping frictional deceleration of a constant mass object. We use 500 synthetic samples for training and 100 for testing. The associated ground-truth labels are generated based on the simulated frictional coefficients (µ), where 0 < µ ≤ 0.5 are labeled as smooth surfaces, and 0.5 < µ ≤ 1 are labeled as rough surfaces.
Since the goal is to predict desirable behavior in the future frames, the classification models are only provided with the learned representations of the first 20 frames of the videos in all the experiments. We use a two-layer multilayer perceptron (MLP) with eight hidden units and the ReLU activation function. The inputs of the classifiers are the latent parameters found to be relevant in the learned equations. The classification results for the above experiments are illustrated in Fig. 12. For each task, we highlight the decision boundaries learned by the MLP, as well as the test sample location within the learned governing parameter space. We can observe clear boundaries between two groups in each task, and each region in this learned space can be interpreted by tracking the definitions of the variables in the recognized equations.
Note: The decision boundaries in Fig. 12 involve a model trained with only the learnt latent parameters, with the plots being 2 dimensional (i.e. plotting the active latent parameters). This is done to enable easy visualization as a 2D plot. However, tasks such as location classification (Fig. 12(a), (c)) benefit from additional inputs to the MLP (namely initial position of the object). With these additional inputs, our simple classification model achieves a test accuracy of 84.4% when trained on only 435 real world trajectories.