Higher-Order Multicuts for Geometric Model Fitting and Motion Segmentation

The minimum cost lifted multicut problem is a generalization of the multicut problem (also known as correlation clustering) and is a means to optimizing a decomposition of a graph w.r.t. both positive and negative edge costs. It has been shown to be useful in a large variety of applications in computer vision thanks to the fact that multicut-based formulations do not require the number of components given a priori; instead, it is deduced from the solution. However, the standard multicut cost function is limited to pairwise relationships between nodes, while several important applications either require or can benefit from a higher-order cost function, i.e., hyper-edges. In this paper, we propose a pseudo-boolean formulation for a multiple model fitting problem. It is based on a formulation of any-order minimum cost lifted multicuts, which allows to partition an undirected graph with pairwise connectivity such as to minimize costs defined over any set of hyper-edges. As the proposed formulation is np-hard and the branch-and-bound algorithm (as well as obtaining lower bounds) is too slow in practice, we propose an efficient local search algorithm for inference into resulting problems. We demonstrate versatility and effectiveness of our approach in several applications: 1) We define a geometric multiple model fitting, more specifically, a line fitting problem on all triplets of points and group points, that belong to the same line, together. 2) We formulate homography and motion estimation as a geometric model fitting problem where the task is to find groups of points that can be explained by the same geometrical transformation. 3) In motion segmentation our model allows to go from modeling translational motion to euclidean or affine transformations, which improves the segmentation quality in terms of F-measure.


INTRODUCTION
M ULTICUT-BASED formulations have recently received considerable attention and have been successfully applied to a variety of tasks in computer vision [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]. Their application to a new problem is particularly easy: one is only required to provide probability estimates of pairs of nodes to belong together, while no information about the exact number of clusters or their expected sizes is necessary. These parameters are defined by the solution. Multicut-based frameworks are therefore particularly interesting for tasks such as multiple model fitting or multiple object motion segmentation, where one wants to avoid additional model selection steps and prefers the correct number of objects/models to be directly inferred from each problem instance.
However, the multicut cost function [13], [14] can assign a cost or a reward only to direct neighbors in the graph, which can be a serious limitation in certain applications. For example, in case of image segmentation and a 4-connected graph the final solution is likely to deteriorate significantly, since inter-pixel edge probability estimates tend to be noisy. Keuper et al. [3] introduced additional (lifted) edges into the multicut objective, which allow to capture information in a non-local neighborhood, but preserve the original feasible set of solutions. Furthermore, Kim et al. [15] proposed a higherorder multicut formulation, that allows to model dependencies between more than two nodes. In this work, we combine these two ideas in one formulation. This generalization allows us to apply minimum cost multicuts to geometric model fitting as well as motion segmentation problems, which both require higher-order non-local costs and can have a variable number of objects in each problem.
On the downside, Bansal et al. [14] showed, that solving the multicut problem is exactly NP-hard. This result extends to the above mentioned multicut-based formulations. Although branch-and-bound [2] algorithms, as well as LP relaxations [15], [16], are feasible when applied to small problems, they do not easily scale [17]. Instead, we propose a local search algorithm based on an efficient move making algorithm [3]. This original heuristic by Keuper et al. proposes feasible solutions for the (second order) lifted multicut problem. Here, we extend it to handle also higher-order terms and their combinations. Such heuristics do not provide any guarantees on the quality of solutions or computation time, but work well in practice [17] and provide feasible solutions at any time. Thanks to the affordable runtime of our proposed local search algorithm, we were able to apply higher order (lifted) multicuts to large problems, which we describe in details below.

Geometric Model Fitting
The task of robust geometric model fitting is to explain observational data under a given model assumption. In this paper we tackle the most general problem setting, i.e., we assume that the number of models is unknown, there is a significant amount of background noise, and the models themselves are perturbed.
The most traditional way to solve this task is the random sampling consensus (RANSAC) [18]. It starts with a random subset of data points and iteratively adds or removes data points to grow the inliers set, i.e., the set of points with a model error below a certain threshold w.r.t. a single model. A straightforward extension of RANSAC to multiple model fitting is to iteratively fit a model, remove all associated inliers, and proceed with fitting another model. This can lead to undesired results since the information about the possible relation between the removed and the remaining data is lost. Zuliani et al. [19] proposed multiRANSAC to improve over RANSAC in this respect, which requires a user to specify the number of models. As other recent geometric model fitting approaches [20], [21], it is based on random sampling and thus sensitive to the initial condition.
In contrast, we cast the geometric model fitting problem as a point grouping problem as previously done e.g., in [22], [23], [24]. Our approach is entirely based on local observations over which a global and probabilistically motivated optimization is possible. In this setting, the considered problem sizes are usually small since previous approaches (e.g., [20], [21]) require large computation times, such that we can easily employ and solve models of order up to 5 and thereby show that our formulation and heuristic are principled.

Motion Segmentation
Motion segmentation, as addressed for example in [25], [26], [27], [28], [29], is a task of segmenting salient moving objects in a video. According to the Gestalt principle of common fate [30], motion patterns of objects are often more homogeneous than their appearance and provide robust cues for moving object segmentation. Thus, from accurately estimated point-wise motions, object motion models can be fit through the formulation of a point grouping problem over local motion similarities.
The euclidean difference between two local motion descriptors such as optical flow vectors or point trajectories measures how well the behavior of the two entities can be described by a single translational motion model. A simple model with only pair-wise potentials can yield good performance in practice [4], [31]. While being successful in providing segmentations in simple scenarios, more complex motion patterns can not be resolved with only pair-wise potentials. For example scaling (e.g., zooming in/out of the camera or movement of objects toward the camera), out-of-plane rotation and highly non-rigid motion of object parts hinders providing high quality motion segmentations. Therefore, higher-order motion models, that can compare more than two motion vectors at a time, are required.
Transformations describing translation, rotation and scaling can be estimated from two motion vectors. Thus, for any three points, one can estimate how well their motion can be explained by one euclidean transformation through residual errors. Costs that describe such motion differences are thus at least of order three. Affine motion differences can be estimated from four motion vectors, and to assign costs to differences in homographies the minimum required order is five. Our model offers the flexibility to combine edges of varying order in one problem instance -which we exploit to produce robust model fits up to euclidean transformations. Yet, complementary to the geometric model fitting application, the relevant motion segmentation benchmarks yield rather large problem sizes such that we are limited to models up to order three in this setting. Thus, the first application we consider, geometric model fitting, indicates the principled applicability of our formulation while the second, motion segmentation, proves the practical relevance of these results.
One additional adversity in motion segmentation is to distinguish between different objects with similar underlying motion patterns. Lifted Multicuts [3] have shown to resolve such ambiguities appropriately in the context of image segmentation. We show that higher-order graphs with third order edges and their combination with lifted edges propose better motion segmentations and disambiguate complex motion patterns like similarly moving objects, scaling motions and out-of-plane rotations of the objects.
Contributions. In summary, we make the following contributions: We provide a formulation for the geometric model fitting problem using higher-order lifted minimum cost multicuts. We show its applicability to multiple model fitting problems such as line fitting from total least squares estimates and prove its practical benefit for motion segmentation. In contrast to previous approaches, our model allows the segmentation of noisy data into segments w.r.t. motion models beyond in-plane translation by combining second and third order edges and allows for efficient optimization.

RELATED WORK
Geometric Model Fitting. Most recent and well-performing approaches to geometric model fitting are based on preference analysis. They start with sampling a number of model hypotheses to build a preference matrix with the inliers to these hypotheses. J-Linkage [21] builds a preference matrix by assigning either 1 or 0 indicating if a point belongs to a hypothesis, determined by a threshold. Then they perform a greedy agglomerative clustering of points using the Jaccard distance, i.e., they group points with similar preference sets. To avoid hard 01-assignments, T-Linkage [20] relaxes the membership value and uses a different similarity metric [32]. Magri and Fusiello [33] cast geometric model fitting as a maximum set coverage problem: Given an integer k, exactly k subsets of hypotheses are selected, covering the maximum number of points.
A separate line of research seeks to find low-rank representations of the preference matrix and thus discover the models. Robust Preference Analysis (RPA) [34] builds a symmetric kernel of pairwise similarities (measured with Tanimoto distance). Then it performs Robust PCA to obtain a rank-k representation and clusters points in this reduced space using Symmetric Non-negative Matrix Factorization. Avoiding all intermediate operations, Non-negative Matrix Underapproximation (RS-NMU) [35] directly finds a lowrank representation of the preference matrix and yields highquality results. Their method gains robustness via a t-test that efficiently filters out statistically insignificant hypotheses. Denitto et al. [36] propose an approach to compute a sparse low-rank representation of the preference matrix using FABIA [37] and obtain bi-clusters, i.e., clustering in rows and columns of the matrix, that allows points to belong to different geometric models simultaneously.
Isack and Boykov [38] also formulated a combinatorial optimization problem with a discrete label space of possible models parameters. They iteratively find inliers to the models by applying the a-expansion algorithm [39] and then reestimate model parameters and prune the redundant ones. Amayo et al. [40] proposed an efficient primal-dual optimization method for a convex relaxation of the discrete energy of [38]. Recently, Barath and Matas [41] have introduced density modes into models' parameters space and showed that applying mean-shift [42] can efficiently reduce the number of models.
Higher-order clustering has previously been used in [22], [23], [24], [43], [44], [45], [46] for geometric model fitting. Higher-order potentials are defined over k-tuples of points and a probability of the latter to belong together is computed based on the residuals to the sampled hypotheses. However, to perform clustering, these approaches transform higher-order terms into pairwise potentials or use randomized techniques and apply spectral clustering. Thus, in principle, they also require the number of models to be given. In contrast, we work directly with the higher-order potentials and perform correlation clustering so as to automatically recover the optimal number of clusters. However, the results presented in [22] and [24] can be seen as a motivation for the use of local evidence on larger than minimal sets, as employed in the proposed approach.
Our approach relies on the minimum cost lifted multicut formulation for hyper-graph decomposition. Such formulation is different from both previous approaches. In contrast to spectral clustering, the multicut formulation does not suppose any balancing criterion. Moreover, we directly infer segmentations from the hyper-graph without any projection onto its primal graph. In contrast to MRFs, the proposed approach allows higher-order edges to connect vertices globally, violating the Markov property. Further, MRFs and CRFs aim at inferring a node labeling with labels given a priori while multicut approaches aim at inferring an edge labeling yielding an optimal number of segments.
In contrast, end-to-end trained CNN based approaches to motion segmentation are based on single frame segmentations from optical flow [59], [60], [61], [62], [63], [64]. Tokmakov et al. [59], [60], [65] use large amounts of synthetic training data [66] and learn to generate binary object masks. [59] combine motion cues with an ImageNet [67] pretrained appearance model and a GRU for increased temporal consistency. In a similar way, Jain et al. [64] employ a realistic dataset extracted from pairs of frames from the ImageNet video dataset [67] to learn object motion and appearance cues. While these approaches directly yield pixel accurate segmentations, they replace explicit motion model assumptions by huge amounts of training data and can not inherently determine the number of moving objects. In DyStaB [68] unsupervised moving object segmentation is done by partitioning the motion field w.r.t. a mutual information based objective and learning object models from the segments. Yang et al. [69] propose a self-supervised transformer model to segment optical flow fields into primary objects and background in a generative way. The combination of appearance based detectors and geometric motion segmentation is used to segment rigid motions in [70].
In [71], motion segmentation is approached in a probabilistic way and the camera motion is subtracted from each frame for improved training. In [72], this idea of prior camera motion subtraction is used to allow for better CNN training for frame-wise segmentation. Bideau et al. [73] propose a multi-step procedure in which first the camera motion, fit using RANSAC in the first frames, is subtracted, then a set of rigid motion models is fitted and last object segmentation proposals from CNNs are used to combine the rigid motion parts into objects. In [74], a unified approach is proposed to handle the moving object detection problem separately in the 2D and 3D scenes based on geometric interpretations and the parallax motion analysis. Our approach directly estimates rigidly moving object parts in a single step, accounting for camera motion using third-order models and does not depend on external object proposals.
Minimum Cost Multicuts. Most previous works on minimum cost multicuts in computer vision focus on problems with pair-wise potentials [3], [4], [75], [76]. The exception is the model first presented by [15], [77] and extensively studied in [16]. Therein, higher-order costs are used for image segmentation on superpixel graphs with pairwise neighborhood connectivity. In [16] an implementation of a branchand-bound algorithm is proposed for small problem instances. We propose a higher-order lifted multicut model, which allows the definition of higher-order edge costs for lifted edges as well as for connectivity defining edges. Lifted multicuts w.r.t. a pair-wise graphs have been proposed in [3] along with a local search algorithm to find solutions in reasonable time. A different algorithm for the problem was proposed in [78] and [79]. Here, we generalize the solver from [3] to facilitate the inference in higher-order problems.

HIGHER-ORDER LIFTED MULTICUT PROBLEM
A decomposition of a graph G ¼ ðV; EÞ can be represented by assigning to each vertex an identifier of a component it belongs to, i.e., a vertex labeling. The drawback of such an encoding is that a permutation of components' identifiers will result in a different vertex labeling, while encoding the same decomposition. This ambiguity creates problems during optimization that are hard to deal with, because the search space of feasible solutions can be factorially large. An alternative approach is to assign either 0 or 1 to each edge such that edges labeled 1 connect nodes only inside connected components (Fig. 1a).
Such 01-edge labeling, complying with constraints we define below, is called a multicut of a graph. Chopra and Rao [80] define a binary linear program called minimum cost multicut problem, that allows to optimize for the 01-edge labeling, or in other words to find an optimal decomposition of a graph. Its main advantage is that no prior information is necessary about the number of clusters in the data, instead, it is deduced from the solution. This is exactly the setting in the applications we consider in this paper, as we do not know beforehand how many geometric models or moving objects the data contains.
However, there are two main limitations of the multicut problem: 1) It allows to specify costs or rewards only for direct neighbors in the graph, which limits the expressiveness of the cost functions only to local neighborhoods. Keuper et al. [3] introduced lifted edges (Fig. 1b) into the 2nd-order multicut problem and showed that it greatly improved image and mesh segmentation results. An important difference of the lifted edges is that they only define a cost between vertices, but not connectivity, thus preserving the original feasible set of solutions. See [81] for a detailed proof, that the lifted multicut problem is not simply equivalent to a multicut problem with more edges. 2) It allows to specify only pairwise edge costs, which is not enough for some applications. For example, for line fitting and motion segmentation we need 3rd-order costs, while homography estimation requires 5th-order costs. Kim et al. [15] and Kappes et al. [16] proposed a formulation that allows to specify costs of arbitrary order (Fig. 1c). They used a cutting-plane algorithm to solve emerging models, which turns out to be impractical for large real-world instances due to the LP-solver's bottleneck. Below, we combine the two above-mentioned findings in a joint formulation.
Definition 1. For a simple, connected graph G ¼ ðV; EÞ and lifted jUj ! 2g denote the set of connected subsets of nodes in G. For a given cost function c : V ! R, written below is an instance of the higherorder minimum cost lifted multicut problem (1) with y subject to the following linear constraints 8C 2 cyclesðGÞ; 8e 2 C : Cycle inequalities (2) ensure that the cut in graph G does not have holes. Path inequalities (3) guarantee that 8f ¼ fv; wg 2 F , y f can be assigned value 1, iff there exists a path P in graph G, that connects vertices v and w. Otherwise, the solver then has two options: either create such path or set y f ¼ 0. Cut inequalities (3) guarantee, that 8f ¼ fv; wg 2 F , y f can be assigned value 0, iff there exists a cut T in graph G, that separates vertices v and w. Otherwise, the solver has to either create such a cut or set y f ¼ 1.
Note that in our formulation (1), the set of decompositions is defined over a pairwise connected graph G. The costs, however, are defined over connected subsets U 2 2 V of nodes of arbitrary cardinality larger than 1. Normally, only subsets of fixed cardinality k are used, e.g., V ¼ V k À Á . However, one can define a cost function over several cardinalities K & N n f1g. It is easy to see, that in case K ¼ f2g we get the lifted multicut formulation from [3]. Therefore, here we propose a strictly more general formulation.
Keuper et al. [3] showed the connection of optimization to finding the most likely multicut in Bayesian sense: Let pðy vw ¼ 1 j x vw Þ be a conditional probability estimate for two nodes fv; wg 2 V to belong together given some features x vw . If we set costs as c vw ¼ log 1Àpðy vw ¼1 j x vw Þ pðy vw ¼1 j x vw Þ , then minimizing (1) is the same as performing MAP inference in the induced Bayesian network. The extension from 2nd-order sets to higher-order cases is straightforward.
This allows to interpret solutions of (1) in terms of local then the corresponding cost c U is less than Switching green and red labels will produce a different encoding for the same decomposition. Dashed lines, in turn, constitute a multicut of a graph and uniquely define its decomposition. (b) An example of a lifted graph decomposition and it encoding. Blue lines denote lifted edges, that connect vertices which are not direct neighbors in the graph. (c) An example of 3rd-order costs, that consider three nodes at a time (light blue triangles) for a better join / cut decision. If a higher-order cost does not correspond to a clique in the graph, we add lifted edges.
0. This means that all nodes in U are likely to belong together. We call such terms attractive. Conversely, if pð Q fv;wg2 U 2 ð Þ\fE[Fg y vw ¼ 1 j x U Þ is less than 0.5, then the corresponding cost c U is greater than 0 and acts as a penalty. We call such terms repulsive.

LOCAL SEARCH ALGORITHM
The multicut problem is known to be NP-hard [14]. Various cutting-plane and branch-and-bound algorithms [2], [15], [75], [77], [82], [83], [84] do not scale to instances of the size we consider even for 2nd-order problems, c.f. the comparative study [17]. Indeed, we implemented a branch-andbound method in Gurobi [85] for the 3rd-order case by linearizing the objective (1), but we could not obtain a solution even for the smallest problem we consider in 12 hours. Toward scalable algorithms, Keuper et al. [3] define a generalization of Kernighan and Lin's primal local search algorithm for graph partitioning problems to the case of lifted multicut problem. It shows the best performance for this problem in [17]. We generalize their algorithm further to lifted multicut problems to include costs of arbitrary order.
Overview. The algorithm takes as input an instance of the higher-order lifted multicut problem and an initial decomposition of G and outputs a decomposition of G whose higher-order lifted multicut has an objective value lower than or equal to that of the initial decomposition. As the original KLj-Algorithm [3], it always maintains, throughout its execution, a feasible decomposition of G. The pseudocode is given in Algorithm 1. New components are introduced by updating a boundary of a component against an empty set ;, as given by lines 5-6, exactly as in the 2ndorder version [3]. Function UPDATE_BOUNDARY (Algorithm 2) receives two components A and B and updates the cut between only them. It constructs a sequence M of elementary transformations of the components A and B greedily such that every consecutive move-operation increases the cumulative gain S maximally (or decreases it minimally). Therefore, the operation COMPUTE_GAINS computes, at the beginning of each execution of UPDATE_BOUNDARY, for every element v 2 A [ B the difference in the objective function (gain) when v is moved from A to B or from B to A. These differences are updated as described in Algorithm 2, ll. 9-21. To escape local optima, we determine i Ã ¼ argmax i S i such as to maximize the total gain of the sequence of operations. If the objective value can be decreased by executing either the first i Ã elementary transformations or by joining the components A and B the optimal of these two operations is carried out. While components are defined with respect to the graph G ¼ ðV; F Þ, differences in objective value are computed with respect to the graph  In [3] as well as in our algorithm, all transformations of feasible solutions are local, resulting in changes of the objective value that are computed in linear time (in the size of the graph). The combination of the locality of individual transformations and the non-locality of sequences of transformations has proven effective for diverse applications [17]. As in KLj [3], the number of outer iterations of Algorithm 1 is not bounded by a polynomial and we cannot give any guarantee for convergence. However, in practice, the algorithm converged in less than 50 iterations for the experiments described in Section 6.
Implementation Details. For efficiency, we pre-compute all the gains for vertices in A [ B (line 1) and keep track of the vertices, that currently lie on the boundary V between A and B (lines 2 and 21); this dramatically improves the runtime for sparse graphs. We iteratively pick a vertex v Ã with the largest gain (line 7), that can also be negative. Then, we update gains of all other vertices in A and B, that are in subsets U that contain v Ã (lines [8][9][10][11][12][13][14][15][16][17][18]. Note, that in the case of 2nd-order costs updates as given in lines 10-18 specialize to the corresponding updates in [3]. In the end, we find, which first i Ã moves produce the greatest decrease (note, i Ã can be also 0), and either merge A and B together (lines [23][24] or undo the moves after i Ã (line 25).

Geometric Model Fitting
We cast geometric model fitting as a point grouping problem or in other words as a graph decomposition problem which is defined over local observations. Specifically, we consider minimal residual errors of n-tuples of points as local features indicating these points' likelihood of being sampled from the same model. Given this local evidence, we can phrase the geometric model fitting problem as a higher-order minimum cost multicut problem.
We compute residual errors relative to a model fit using total least squares (TLS). Unlike least squares (brown color), that minimize only axis-aligned residuals, total least squares (red color) minimize the distance between a point and its projection on the model. We solve TLS using Singular Value Decomposition.
For the minimal set plus 1 of data points, where the minimal set is the set of points required to estimate model parameters, we compute a probability for these points to belong together, i.e., being sampled from the same model. This probability is inversely proportional to the points' residuals to the estimated model. For line models, the minimum set is of cardinality two. Therefore, we need edges of at least order three to assign such probabilistically motivated costs.
Line Fitting. For line fitting, we create a fully-connected graph over all V 2 À Á vertices, that defines the set of feasible decompositions. The cost function c is defined over all V ¼ V 3 À Á as follows: We fit a line into each triplet fu; v; wg 2 V using TLS and compute their residuals r, an example is given in Fig. 2a. We assume, that points have been sampled from a Gaussian centered on the ground-truth line with a standard deviation of s. That makes p v ðr v Þ ¼ erfcðr v ; 0; s 2 Þ, where erfcðÁÞ is the complementary error function. Further assuming that all the points are i.i.d., we get p uvw ðy uv y uw y vw ¼ 1 j r u ; r v ; r w Þ ¼ p u ðr u Þp v ðr v Þp w ðr w Þ. Finally, the cost is c uvw ¼ log 1Àp u ðr u Þp v ðr v Þp w ðr w Þ p u ðr u Þp v ðr v Þp w ðr w Þ . This corresponds to minimizing the sum of residuals w.r.t. a non-noisy line.
Homography and Motion Estimation. For homography and motion estimation we proceed in the same manner as described above. For this task, we have to subsample cost terms, as V ¼ V 5 À Á is a prohibitive for jV j > 150. For every vertex we thus sample all 20 5 À Á subsets of its 20 nearest neighbors (in the image pixel space) to capture the local scope and 2 Á 10 5 random subsets to capture the global scope. We fit an elementary homography into each of these 5 pairs of points and assume the points' probability to be inversely proportional to the distance between the ground-truth correspondence and its projection via the fit homography, c.f. Fig 2 (b). This corresponds again to minimizing the sum of re-projection errors.

Point Trajectories
Point trajectories are spatio-temporal curves that describe the trajectory of a single object point in the image plane. They build the basis for many motion segmentation methods such as [4], [25], [31], [86], [87]. Here, we use the method from [25] to generate dense long-term point trajectories from precomputed optical flow [88] to allow for a direct comparison to prior work. For a video of length N, [25] yields n point trajectories p i with the maximum length N, where n depends on the desired sampling rate. Due to occlusions and mistakes in the optical flow estimation, most trajectories are significantly shorter than N, and some trajectories start after frame 1 to ensure even point sampling throughout the sequence.

Higher-Order Motion Models
Although it is not sufficient to accurately describe object motion in a 3D environment recorded with a possibly moving camera, we restrict ourselves to edge potentials of order two and three for practical reasons. This allows to measure the difference of point motions according to euclidean motion models, i.e., from the group of transformations describing translation, rotation and scaling in the 2D plane. This is a subset of the group of similarity transformations in the 2D plane, with reflections excluded.
We further argue that in any case, the easiest model that can explain the motion of a set of points with a single transformation should be used. If two points are moving according to the same translational motion model, we can assume that they belong to the same object without looking at further points around them. Only if their motion is different according to a purely translational model, looking at more complex motion models adds information. This results in a motion-adaptive graph construction strategy.
Motion-Adaptive Graph Construction. We propose to construct the higher-order graph G from the pairwise costs computed from motion differences. The algorithm is described in Algorithm 3. For any pair of trajectories, we compute their cost of belonging to the same translational motion model. Only if this cost is positive, i.e., repulsive, we look at all further points to compute for every three-tuple We fit a line using TLS into a set of points and assume that the latter are independently drawn from a 1D Gaussian centered on the line and orthogonal to it. (b) Homography Estimation: For a pair of images with annotated correspondences (yellow line) we directly model the distance r between the corresponding and projected point. In the example above, a point from the second image corresponds to a red dot, but its projection via the estimated homography is a bit off (blue dot). Uncertainty model corresponds to a 2D isotropic Gaussian centered at the red point.
the cost of belonging to the same motion model for translation, rotation and scaling. The respective third-order edges are inserted along with their costs.
This strategy allows to integrate second and third order potential without losing model capacity. Further, compared to generating the full graph with higher-order potentials, it yields a significant space reduction in practice. Lifted Graph Construction. To construct higher-order lifted graph G 0 ¼ ðV; E [ F Þ, we compute for every trajectory the set of its 12 spatially nearest neighbors N . The edge set E of edges between direct neighbors in G 0 is computed according to Algorithm 3. It contains exactly all pairwise edges e ij 2 E for which at least one of the following three conditions holds: (1) p i 2 N ðp j Þ, (2) p j 2 N ðp i Þ (3) the maximum spatial distance between p i and p j is below 40 pixels.

Algorithm 3. Motion Adaptive Graph Construction
Second Order Costs. Second order costs are computed from pairwise differences on point trajectories. We compute such differences only for trajectories which have at least two frames in common. Since it has proven successful in previous work [4], we compute such differences based on motion, color and spatial distance cues. As suggested by [31], we define the pairwise motion difference of two trajectories at time t as Here, @ t p i and @ t p j are the partial derivatives of p i and p j with respect to the time dimension and s t is the variation of the optical flow as defined in [31]. The motion distance of two trajectories is defined by the maximum over time As proposed in [4] color and spatial distances d color and d spatial are computed as average distances over the common lifetime of two trajectories. These three cues are combined non-linearly to compute the costs with weights and intercept values u as proposed in [4] 1 . Third Order Costs. We compute third order motion differences as proposed in [27]. For any two trajectories p i and p j we estimate the euclidean motion model T ij ðtÞ, consisting of rotation R a , translation v :¼ ðv 1 ; v 2 Þ > and scaling s as a ¼ arccos ðp i ðt 0 Þ À p j ðt 0 ÞÞ > ðp i ðtÞ À p j ðtÞÞ kp i ðt 0 Þ À p j ðt 0 Þk Á kp i ðtÞ À p j ðtÞk where t denotes the first point in time where both trajectories co-exist and t 0 is the last. The distance to any third trajectory p k existing from t to t 0 can then be measured by d t ij ðp k Þ ¼ kT ij ðtÞp k ðtÞ À p k ðt 0 Þk. For numerical reasons, d t ij ðp k Þ is normalized by kp i ðtÞ À p j ðtÞk kp i ðtÞ À p k ðtÞk þ kp i ðtÞ À p j ðtÞk kp j ðtÞ À p k ðtÞk 1 4 ; with s t being the optical flow variation as in (5).
To render distances symmetric, [27] propose to consider the maximum d t max ði; j; kÞ ¼ maxðg t ij d t ij ðp k Þ; g t ik d t ik ðp j Þ; g t jk d t jk ðp i ÞÞ, which yields an over-estimation of the true distance. While this is unproblematic in a spectral clustering scenario, where distances are used to define positive point affinities, it can lead to problems in the multicut approach. Over-estimated distances lead to under-estimated join probabilities and thus eventually to switching the sign of the cost function towards repulsive terms. To avoid this effect, we compute both d t max ði; j; kÞ and, analogously, d t min ði; j; kÞ. For both, we compute the maximum motion distance over the common lifetime of p i , p j and p k as d max ði; j; kÞ ¼ max t d t max ði; j; kÞ and d min ði; j; kÞ ¼ max t d t min ði; j; kÞ. We evaluate the costs cðd max ði; j; kÞÞ and cðd min ði; j; kÞÞ for both distances as cðdÞ ¼ u 0 þ u 1 d and compute the final edge costs c ijk ¼ cðd min ði; j; kÞÞ if cðd max ði; j; kÞÞ > 0 cðd max ði; j; kÞÞ if cðd min ði; j; kÞÞ < 0 0 otherwise: Thus, we make sure not to set any costs for edges whose underlying motion is controversial. Here, we set u 0 ¼ 1 and u 1 ¼ À0:08 manually. Implementation Details. In practice, we insert pairwise edges e ij in G and G 0 only if the spatial distance between p i and p j is below 100 pixels even for lifted edges in F . This is in analogy to [4] and due to the fact that for nearby points, the approximation of the true motion by a simplified model is usually better than for points at a large distance. Also, since the number of pairwise edges increases quadratically with the maximal spatial distance, this heuristic decreases the computational load significantly. For the same reason, we introduce an edge sampling strategy for third order edges. For every triplet of points, we compute the maximum pairwise distance d. From all triplets with 20 < d < 300, we randomly sample 100 while we insert all edges e ijk with d 20. This also prevents from a too strong imbalance of long range edges over short range edges.

Geometric Model Fitting
We start with line fitting experiments on synthetic data from Toldo and Fusiello [21]. This dataset consists of 3 instances with lines arranged in different shape (namely, stairs with 4 lines and star with 5 and 11 lines) and perturbed by Gaussian noise. Each instance contains around 50% of uniformly sampled gross outliers, c.f. Fig. 3 top row.
As a performance measure we use the widely accepted [20], [21], [33], [35] misclassification error (ME), that is the ratio of the misclassified points over the total number of points. The classification is performed by matching the predicted and the ground truth clusters such that the number of misclassified points is minimized using the Hungarian algorithm [90] for minimum weight bipartite matching. The point is then considered to be correctly classified if its cluster label corresponds to the matched ground truth one.
Xiao et al. [23] notice that it is impossible to estimate the number of models present in the data without introducing additional stronger assumptions. For example, if we assume that points lying on a line constitute a model, then many more lines can be discovered in the presence of many outliers, as is the case of the data from Toldo and Fusiello [21]. Indeed, the term "outliers" here is rather subjective, as any two points perfectly define a line. A common solution [20], [21], [33] is to simply take the top-K largest clusters and assign the rest to the "outliers" class, where K is the number of clusters in the ground-truth. A more principled approach was proposed by Tepper and Sapiro [35], it is based on the statistical t-test that allows to keep only statistically significant models. In this paper we adopt the usual strategy and match the predicted models to the ground truth ones so as to minimize the misclassification error.
The results are reported in Table 1. All competing methods sample a small number of hypotheses before optimization, while we sample all possible triplets. For a fair comparison we sample all possible hypotheses for the competing methods as well. However, this dramatically increases runtimes of these methods. In our experiments we noticed, that the statistical filtering of Tepper and Sapiro [35] very efficiently gets rid of irrelevant ones, thus, we apply it to the sampled hypotheses before optimization. It can be seen from Table 1, that our method is the fastest at the same time showing competitive accuracy, especially on the hardest problem Star11. Visualizations of our results are presented in Fig. 3.
We analyze the behavior of our method regarding possible uniformly random sub-sampling of the cost terms in Fig. 4a. It can be seen, that our method can get away with as little as 5% of all the possible triplets. As noted earlier, our method discovers more lines, than there is in the groundtruth, because any 2 points define a line in the euclidean space. This corresponds to over-segmentation of the graph decomposition we obtain. Yet, if we plot the costs of the resulting clusters as given in Fig. 4b, computed according to the objective function (1), it is rather easy to spot the models (lines) of interest. We think, this may allow to develop a method for automatic selection of models.
Next we turn to the real-world challenge posed by Adelaide Robust Model Fitting (AdelaideRMF) data set [91]. It consists of 38 image pairs with a set of interest points and human-annotated correspondences for each pair. The dataset is split into two equal parts and offers two challenges: 1) estimate multiple homographies (a transformation that relates points belonging to the same planar surface in 3D space) and 2) estimate multiple motion models.

Motion Segmentation
First, we apply the proposed higher-order lifted multicut model on the motion segmentation benchmark FBMS-59  The total runtime is split into sampling and solving time.
[31], an extended version of the BMS-26 benchmark from Brox-Malik [25]. It contains 59 sequences of varying length (from 19 to 800 frames) and diverse content and motion: severe camera shaking, non-rigid multi-object motion, scaling, out-of-plane rotation of objects, zooming of the camera, etc.
To allow for training, the data set has been split into two subsets of 29 and 30 sequences for training and testing, respectively. While we agree that training all model parameters is highly desirable, we did not do so. This is due to the fact that (1) neither of the state-of-the-art methods [4], [27], [31] is training-based and (2), the training set, with 29 sparsely annotated sequences, is rather small. Thus, to avoid confusion, we hence denote the training split by Set A and the test split by Set B.
FBMS-59 [31] provides manual annotations for all moving objects in the videos for every 20-th frame as well as ground truth definition files that downweight annotated segments in some scenes. Thus, objects in some sequences that contain severe camera motion, for example, a mistakenly segmented wall in Fig. 9, attain a lower weight in the evaluation. All objects that move in at least one frame are segmented in all annotated frames. A second set of ground truth annotations at a similar level of sparsity has later been provided in [93] to evaluate a slightly different motion segmentation paradigm. In [73], [93], it was argued that all freely moving objects in 3D space but nothing more should be segmented per frame. Specifically, this means that objects moving only in few frames are only to be segmented in these frames. Scene geometry and camera motion yielding apparent motion in the image plane, such as the wall in the example of Fig. 9, are not considered. Our work addresses the task originally annotated by [31]. It aims to segment all objects that move in at least one frame of a sequence and does not provide full 3D motion segmentations, as we limit ourselves to third order motion terms. Yet, we find it interesting to consider both sets of annotations during evaluation to allow for a comparison to the respective competing methods. Note, that also the evaluation metrics differ slightly. Both [31] and [93] measure precision, recall and f-measure, where [31] evaluate precision and recall over all frames and compute the f-measure in the end, while [93] evaluate the fmeasure per frame and report the mean value. In addition, [31] report the number of objects O that are segmented with an f-measure above 0.75. Instead, [73] propose the DObj metric, which measures the average absolute difference between the number of ground truth objects in each frame and the number of segmented objects in this frame. While O should be large, DObj should be small.  Our model yields competitive results at low standard deviation. y these methods require the exact number of models before optimization. Fig. 5. Some visualizations of our results on the Adelaide Robust Model Fitting data set [91]. Only one image of the pair is shown. Colored boxes denote points assigned to the same geometric model. Yellow dots denote points assigned to noise/background.
For our evaluation, we employ the annotations provided in [31] when considering the metric proposed therein. We use the annotations from [93] when evaluating using the metric proposed in [73] to allow for direct comparison to their work. We start by an evaluation using annotations and metrics from [31].
Evaluation. To assess the capacity of our model components, we first evaluate a purely higher-order non-lifted version of our model. In this model, all pairwise costs are removed and all edges are connectivity defining. We compare this simple model to [27], [31] and the purely motionbased version of [4]. While [31] and [4] only consider translational motion, the affinities in [27] are defined most similarly to our higher-order costs. As the proposed approach, [4] formulate a multicut problem while [27], [31] follow a spectral clustering approach. The results are given in Table 3 in terms of precision, recall, F-measure and the number of extracted objects. Precision and recall are not directly comparable but they can serve as cues for under-or over-segmentation. The F-measure is a weighted harmonic mean of the two. From Table 3, we can observe that our higher-order lifted multicut model outperforms the higher-order spectral clustering method from [27] by about 10% on Set A and 3.5% on Set B. While there is a clear improvement on both sets, the imbalance is remarkable. A similarly remarkable imbalance can be observed when comparing the performance between the two spectral clustering methods [31] and [27]. The higher-order model [27] results in a lower Fmeasure on Set A compared to [31]. Yet it outperforms [31] on Set B by about 3%. This indicates that the motion statistics in both splits are significantly different. When we compare our higher-order model to the pairwise minimum cost multicut model from [4], we can observe an improvement on Set A. On Set B, both models perform almost equally.
In Table 4, we show the evaluation of our higher-order multicut model with the motion-adaptive order, denoted AOMC (compare Algorithm 3). This model has access to similar pairwise cues as the motion and color-based version from [4], denoted MCe. It has as well access to the higherorder motion cues from Equation (10).
As a sanity check for the motion-adaptive graph construction, we also generate graphs that simply contain all pairwise costs c ij as well as all third order edges with costs c ijk without any adaptation with respect to the costs. We denote this additive model by HOPMC (higher-order + pairwise multicut). On Set A, all three approaches produce similar results, whereas the proposed AOMC shows particularly good performance on the test set with about 2% improvement over MCe [4] in f-measure.
Lifted versions of both types of problems (HOPMC and AOMC) yield a small further improvement on Set B. However, on Set A, the segmentation quality of Lifted HOPMC is even below the one of MCe by about 1%. In contrast, the proposed Lifted AOMC consistently outperforms all competing methods and baselines.
In Table 5, we evaluate the impact of the quality of the point trajectories when they are computed from different optical flow models [88], [94], [95]. The proposed approach Lifted AOMC consistently outperforms the pairwise MCe [4] on comparable optical flows. While the most recent FlowNet [95] performs best, the overall differences are small.
Several examples of pixel-segmentations computed from our sparse segmentation using [58] are given in Fig. 6. The   densified segmentations look reasonable. On the bear example, the articulated leg motion still causes some over-segmentation. One of the horses in the horses05 sequence is missed. However, even small objects such as the tray in the marple12 sequence or the phone in the marple13 sequence can be correctly segmented. Such densified segmentations, computed from sparse results using FlowNet [95], can be evaluated by the metrics from [73] with their matching annotations [93]. Table 6 shows our results in the setting by [73], i.e., considering a frame-wise evaluation on all freely moving 3D objects. While the f-measure of our model is similar to the one reached by [4], the DObj metric is significantly improved, i.e., lower, and almost on par with the results from [73], which are dedicated to this setting. Evaluating binarized segmentations (Table 6) allows for a comparison to learning based encoder-decoder models such as [59]. Without learning any prior on object saliency, our f-measure for binary segmentation on FBMS-59 is 76.16% and thus slightly better than the learnt model from plain motion cues in [59] with 74.79%, but inferior to [60], who learn an additional appearance stream and reach 86.96% [73]. In the following, we discuss several example segmentations in detail. Fig. 7 shows the trajectory segmentation quality under scaling. In the horses05 sequence, scaling is caused by the motion of the white horse towards the camera. This causes over-segmentation in the competing method MCe [4], which can not handle higher-order motion models. With the proposed Lifted AOMC, the segmentation can be improved. Figs. 8 and 9 both show examples where the same label is assigned to distinct objects that move similarly. In the cars2 sequence in Fig. 8, this is due to similar real-world object motion, whereas, in the marple10 sequence, the effect is due to camera motion and the scene geometry. In both cases, the formulation of the Lifted AOMC problem allows to tell the distinct objects apart. However, in the marple10 sequence ( Fig. 9), we can observe a spurious segment in the background, which is probably caused by imprecise flow.
In Fig. 10, we show an example of the goats01 sequence. Here, the head and body of the goat in front are segmented into a distinct components by the pairwise method [4], because of the expressed articulated motion. Although our third order model can not explicitly handle articulation, the over-segmentation can be fixed in this case. Fig. 11 shows a failure case of the proposed method. Due to the dominant camera motion in a scene with complex geometry, the euclidean motion model fits particularly badly. Thus our model leads to the segmentation of the scene into its depth layers, and thus to strong over-segmentation.
Next, we evaluate our approach on two additional datasets widely considered for motion segmentation, the DAVIS 16 dataset [100] and the VSB100 dataset [97], [98]. Both have originally been designed for different purposes. DAVIS 16 is a Fig. 6. Samples of our Lifted AOMC segmentations densified by [58]. Even for articulated motion, our segmentations show little over-segmentation.  [4], [96], [59], [60] and [73] are taken from [73]. Fig. 8. The two cars in front move to the same direction, leading an assignment to the same cluster with the non-lifted multicut approach [4]. The Lifted AOMC can assign the different cars to distinct segments. Fig. 9. Due to camera motion the person and the wall are assigned to the same cluster with the non-lifted multicut approach [4]. The Lifted AOMC allows for correct segmentation. Fig. 7. The scaling motion of the white horse moving towards the camera causes over-segmentation with a simple motion model [4]. With the proposed Lifted AOMC, this can be avoided.
dataset for binary video object segmentation which has been used to learn appearance and motion patterns of salient objects in videos, e.g., in [60]. While the sequences may contain several moving objects, the task is to track the segmentation of the dominant object throughout the sequence. Complementary to this, the VSB100 dataset was originally proposed as a video segmentation dataset where the task is to mimic human boundary level annotations, i.e., the segments do not necessarily have a notion of objectness. This general purpose multi-label video segmentation dataset has a motion subtask which can be used to evaluate motion segmentation approaches before, e.g., as in [4]. Table 7 shows our results on the DAVIS 16 dataset in terms of Jaccard index (J) and the fmeasure (F) which measures the boundary fidelity of the segmentation. We compare to the results by [4] which is, as ours, a method for multi-label motion segmentation. The proposed approach improves significantly over those results. Yet, note that dedicated video object segmentation approaches such as recently proposed in [68], [69] yield higher numbers on the DAVIS benchmark with mean Jaccard index of up to 80% on the validation set.
The evaluation of our model on the motion subtask of VSB100 is given in Fig. 12 in terms of boundary precision and recall (BPR) and the region metric volume precision and recall (VPR). It can be seen that the proposed higher order model with adaptive edge order outperforms the previous models on this task. As expected, the differences in the BPR are rather small while they are more significant in VPR. The dashed lines indicate results of our model using FlowNet [95] to compute optical flow while the solid lines are based on [88] to ensure fair comparison to [4] and [31]. The improved optical flow has a slightly larger impact on the BPR values, indicating that the issues addressed by more robust optical flow estimation and the issues addressed by our more complex motion model are complementary.
Scalability Analysis on FBMS-59. Last, we want to evaluate our proposed heuristic for the higher-order minimum cost lifted multicut problems (compare Algorithm 1) in terms of computation times. Fig. 13 plots the computation times of our full pipeline on FBMS-59 w.r.t. the number of point trajectories. The runtime distribution indicates linear runtime behavior and shows that heuristic solutions can be generated in a few minutes for most instances. Yet, the number of large problem instances is too small to make any claim.

CONCLUSION
We presented a multicut-based approach that can be applied to computer vision tasks such as motion segmentation and geometric model fitting. To do so, we proposed a pseudoboolean formulation that allows to define costs on subsets of vertices of arbitrary cardinality and includes lifted edges. In motion segmentation higher than simple pair-wise costs allow to model object motion more precisely (euclidean instead of in-plane translational motion). Line fitting can be formulated only with 3-rd order costs, while homography estimation requires 5-th order costs. Since the emerging higher-order multicut problem is NP-hard to solve exactly, we proposed an efficient local search algorithm for inference. When applied to real and toy problems, our approach yields Fig. 10. The articulated motion causes over-segmentation in [4]. The Lifted AOMC performs better. Fig. 11. Failure case. The dominant camera motion causes strong oversegmentation with the proposed method. Here, our third order model can not model the motion appropriately. The more complex motion model in Lifted AOMC is beneficial on this dataset of binary object segmentation. Results marked with Ã are taken from [59]. Fig. 12. Evaluation on the motion subtask of the VSB100 dataset [97], [98]. We compare our results to SC [31], the video segmentation approach VS [99], the superpixel tracking baseline from [97], and the multicut models with pairwise terms MCe [4]. The proposed lifted adaptive order model (LAOMC) outperforms the pairwise terms consistenly. LAOMC Ã shows results based on FlowNet [95], while LAOMC is computed on flows from [88] for fair comparison to [31]. either competitive or state-of-the-art results, is highly flexible and easy to apply.