A Bayesian Nonparametric Model for Unsupervised Joint Segmentation of a Collection of Images

Jointly segmenting a collection of images with shared classes is expected to yield better results than single-image based methods, due to the use of the shared statistical information across different images. This paper proposes a Bayesian approach for tackling this problem. As a first contribution, the proposed method relies on a new prior distribution for the class labels, which combines a hierarchical Dirichlet process (HDP) with a Potts model. The latter classically favors a spatial dependency, whereas the HDP is a Bayesian nonparametric model that allows the number of classes to be inferred automatically. The HDP also explicitly induces a sharing of classes between the images. The resulting posterior distribution of the labels is not analytically tractable and can be explored using a standard Gibbs sampler. However, such a sampling strategy is known to have poor mixing properties for high-dimensional data. To alleviate this issue, the second contribution reported in this paper consists of an adapted generalized Swendsen-Wang algorithm which is a sampling technique that improves the exploration of the posterior distribution. Finally, since the inferred segmentation depends on the values of the hyperparameters, the third contribution aims at adjusting them while sampling the posterior label distribution by resorting to an original combination of two sequential Monte Carlo samplers. The proposed methods are validated on both simulated and natural images from databases.


I. INTRODUCTION
Image segmentation has received a lot of attention in the literature since it is a key step in image analysis.Biomedical imaging based diagnosis [1] or remote sensing [2] are examples of archetypal applications.Segmentation aims at identifying K homogeneous areas, referred to as classes, in an image or a collection of images.Within a statistical framework, region-based segmentation can be formulated as the estimation of class labels univocally assigned to pixels sharing the same statistical properties.Adopting a Bayesian The associate editor coordinating the review of this article and approving it for publication was Diego Oliva.
formulation, one has to elicit a prior distribution over the class labels.A classical choice is the Potts model, which promotes spatially piecewise homogeneous regions by yielding a higher prior probability for configurations where neighboring pixels are in the same class.However, a major limitation of the Potts model lies in the fact that the number of classes K must be set in advance.As an alternative, K can also be estimated jointly with the class labels.The resulting posterior distribution is highly intractable but can be explored using sampling methods, e.g., reversible jump Markov Chain Monte Carlo (RJ-MCMC) algorithms.However, the efficiency of RJ-MCMC may be limited by the dimension of the problem and is highly driven by the ability of designing efficient moves across spaces of different dimensions.Conversely, Bayesian nonparametric methods offer a scalable solution by allowing K to increase with the dimension of the data.Following this idea, a prior combining the Dirichlet process (DP) and the Potts model has been proposed in [3] for automatic image segmentation.In this case, the DP ensures the automatic inference of the number of classes while the Potts model enforces spatial homogeneity.
When considering the segmentation of a collection of J images, the most straightforward approach consists in partitioning separately each of the images in a set of m j •(j = 1, . . ., J ) regions and then matching the regions among the different images to identify common classes.However, this suboptimal two-step processing does not fully take the shared information into account to characterize the classes.As an alternative, a joint analysis can be carried out: such an approach deals jointly with all the images, fully exploiting the fact that a given class can be present in several of them, but possibly in different proportions.For instance, when analyzing remote sensing images, vegetation can be encountered in countryside but also in urban and suburban areas.Nevertheless, its density is far less significant in the last categories of images.Thus, the joint processing is expected to improve the detection and recognition of the vegetation class in images where it is quite rare.In such scenarios, the model in [3] may not be suitable because the DP cannot depict a sharing of classes between images.
A first nonparametric approach for shared segmentation of image databases is proposed in [4].A hierarchical model is considered so that the different regions of an image appear with proportions given by Pitman-Yor (PY) processes; such processes are well-suited to capture the power-law behavior of natural objects.These regions are assigned to global classes whose prior distribution is also driven by a PY process.To favor smooth regions, thresholded correlated Gaussian processes are used to generate the class labels.Keeping up with the random field formalism, in this work, we propose an extension of the nonparametric model in [3].The joint segmentation is made possible using a hierarchical Dirichlet process (HDP) [5] to model the labels of the regions and classes as well as their relationship, coupled with a Potts model that ensures spatial smoothness.The use of the HDP allows both the number of regions m j• , the overall number of classes K and the shared classes to be inferred automatically.
The posterior distribution of the labels can be written up to a normalization constant as the product of the prior and the likelihood of the set of observations.Since Bayesian estimators associated to this posterior distribution cannot be derived analytically, a Markov chain Monte Carlo (MCMC) approximation is considered.We presented preliminary results based on a Gibbs sampler in [6].However, that sampling scheme was not suitable for the segmentation of real images due its high computational cost.Moreover, the sampling procedure in [6] did not allow the model hyperparameters to be estimated.This paper addresses both limitations.
First, adopting the strategy followed by most image segmentation approaches, the images are decomposed into super-pixels, here obtained using the simple linear iterative clustering algorithm (SLIC) [7].Then, to fully characterize the super-pixels, both color and texture histograms are considered as descriptive features.As for the segmentation algorithm, a generalized Swendsen-Wang (SW) [8] based sampler is proposed to accelerate the convergence and overcome the mixing issues of the conventional Gibbs sampler.The SW algorithm proposed in [9] introduces a set of bond latent variables that allow the class labels of subsets of neighboring super-pixels to be jointly updated; as these class labels are typically highly correlated, this joint update improves the convergence rate of the sampler.The last contribution of this paper is to estimate the hyperparameters jointly with the partition.For that purpose, they could be also assigned a prior distribution and sampled within the SW procedure.However, due to the intrication of the Potts and HDP models, the presence of unknown normalization constants precludes the closed-form computation of the conditional distribution of the hyperparameters.To overcome this issue, we propose to embed the MCMC algorithm into a sequential Monte Carlo (SMC) sampler [10] which sequentially explores the posterior label distribution for a sequence of hyperparameter values.At each step, the likelihood of the current set of hyperparameters is also obtained up to an unknown constant, such that the optimal value can be selected afterwards.An original approach is proposed to compensate for this hyperparameterdependent constant which requires to run an additional SMC algorithm that targets the prior instead of the posterior label distribution.Finally, the best partition must be selected based on the posterior label samples.To avoid label switching issues, it is chosen in this work as the one minimizing the Dahl's criterion [11].
The paper is organized as follows.The statistical model of the images to be segmented and the proposed prior distribution over the labels are presented in Section II.Section III describes the single-site Gibbs sampling procedure for posterior inference and the more efficient generalized SW-based sampler is introduced in section IV.In section V, a sequential Monte Carlo sampler is derived to jointly estimate the model hyperparameters and the image partition.Finally, after deriving the selection procedure of the best partition, experimental results are reported in section VII.Section VIII concludes the paper and proposes perspectives to this work.

A. LIKELIHOOD
Let us consider a set of J images to be jointly segmented.To reduce the data dimension and hence save computational complexity at the processing stage, each image j (j ∈ {1, . . ., J }) is first divided into N j super-pixels.The presegmentation algorithm used in this paper is the simple linear iterative clustering (SLIC) presented in [7].It should be noted that the super-pixels are irregularly shaped but a neighborhood system can still be defined.Indeed, two superpixels are said to be neighbors if they share connected pixels.
Super-pixel n (n ∈ 1, . . ., N j ) in image j is characterized by the observation vector y jn and its associated distribution denoted f is assumed to be parameterized by a latent vector θ jn , i.e., y jn |θ jn ∼ f (y jn |θ jn ).To make the segmentation more efficient, both colors and textures of the super-pixels can be taken into account.More precisely, in this case, the observation vector is defined as a set of two descriptors, namely y t jn and y c jn representing the texture information and the red-green-blue (RGB) color histogram, respectively.As a consequence, the vector of observations writes y jn = (y t jn , y c jn ), inducing θ jn = (θ t jn , θ c jn ).Assuming conditional independence, the joint likelihood associated with super-pixel n in image j writes f (y jn |θ jn ) = f (y t jn |θ t jn )f (y c jn |θ c jn ).This composite model will be considered in the experiments conducted on real natural images (see Section VII-B).It should be noted that preliminary tests on synthetic images of small size will be based on a simpler model.More precisely, no presegmentation will then be required and each class will be characterized only by the gray level of its pixels.
The J images are assumed to be composed of K , possibly shared, homogeneous classes.The joint segmentation has two levels: i) first, each image j is divided into m j• (j = 1, . . ., J ) regions and ii) each region is assigned to a class.Thus, the classification task consists in recovering the region label c jn ∈ 1, . . ., m j• of super-pixel n in image j as well as the class label d jt ∈ {1, . . ., K } of region t in image j.
Pixels assigned to the same class k share the same latent parameter vector, denoted φ k .By denoting A k the set of superpixels assigned to class k, i.e., A k {(j, n)|d jc jn = k}, we have θ jn = φ k .Assuming the observations are conditionally independent given the parameter vector, their marginal likelihood is f (y) = K k=1 f (y A k ) where with y the collection of all observations, y A k = {y jn |(j, n) ∈ A k } the observations associated to the pixels with indexes in A k and h(•) the prior distribution on φ k .This model is conditional upon the class and region assignments.In the following section, we describe the considered prior model for the partition, i.e for the labels c jn and d jt .

B. PRIOR DISTRIBUTION
The proposed prior modeling for the set of class and region labels (c, d) combines two potential functions that promote sharing of the classes and spatial homogeneity, respectively.They are described in what follows.

1) HIERARCHICAL DIRICHLET PROCESS
Let G j be the distribution of the hyperparameter vector θ jn (j ∈ {1, . . ., J } and n ∈ {1, . . ., N j }).For the sake of generality, this distribution is assumed unknown and not enforced to belong to a specific parametric family.Within a Bayesian framework, it is assigned a prior distribution which has the specificity to be defined on the infinite dimensional space of the probability measures.In this work, a DP is chosen as a prior since its discrete realizations favor several pixels to share the same class labels [12].Under this assumption, G j writes: The DP depends on two parameters, a base measure denoted here G 0 and a scale parameter α 0 .The support points ψ jt are independent and identically distributed according to G 0 while the probabilities τ jt indirectly depend on α 0 through the stickbreaking generative process [13].
To induce class sharing among the images, the base distribution G 0 should be the same for all the images but also discrete to ensure that the distributions G j (j ∈ {1, . . ., J }) are defined with common atoms.As the number of atoms is unknown, a DP is also selected as a prior distribution for G 0 .In the sequel, its base distribution is denoted H , with probability density function h and scale factor γ , leading to In the literature, this model composed of nested layers of DPs is referred to as the hierarchical DP (HDP) [5].To better understand it, we can resort to the well-known metaphor of the Chinese restaurant franchise (CRF) [5], based on the Pólya urn scheme.First, since the value of the hyperparameter vector θ jn is uniquely determined by the region c jn and class d jc jn labels, the models (2) and (3) can be completely rewritten in terms of the latter.Furthermore, the distributions G 0 and G j can be marginalized out to focus on the sequential generation of the labels.Under the DP assumption, they are exchangeable so that each label can be sampled as if it were the last one.The procedure is the following: given the remainder of the super-pixels, super-pixel n in image j has a positive probability of being assigned to an existing region t which is proportional to the number ν jt of super-pixels in that region whereas the probability of being assigned to a new one t new is proportional to α 0 , i.e., Similarly, region t in image j can be assigned to an existing class k proportionally to the overall number m •k of regions assigned to class k, or to a new one proportionally to γ , i.e., where d −jt = {d j t |j = 1, . . ., J ; t = 1, . . ., m j • ; (j , t ) = (j, t)}.Thus, the prior potential on the partition induced by the HDP writes 2) POTTS MODEL The prior ( 5) is complemented by a Potts-Markov random field which promotes spatial homogeneity of the class labels [14].It is known to be particularly suitable for image segmentation [15] and thus has been widely used for various applications, ranging from remote sensing [2] to medical imaging [16].Given a description of the image through a neighboring system, the probability for a super-pixel to be assigned to a class k depends on the number of its neighbors assigned to the same class.In our context, two super-pixels are considered to be neighbors if they have at least two pixels that are themselves neighbors according to a conventional 8-pixel neighborhood structure.The prior potential can thus be written where V n denotes the set of super-pixels neighbors of superpixel n, 1 •,• is the Kronecker delta function and β is the so-called granularity parameter.For high values of this parameter, the image exhibits few classes.On the contrary, for small values of β, the images tend to be over-segmented.The parameter β is assumed to be known in this paper.It could be estimated using a procedure similar to [17].

3) PROPOSED PRIOR MODEL
The proposed composite prior combines a global penalization of the number of classes provided by the HDP and a local penalization coming from the Potts model.The HDP induces an implicit prior on the number of regions in each image, the number of classes and the shared classes.With the Potts model, a spatial proximity is taken into account.The joint prior model of the region and class label is given by

III. GIBBS SAMPLING
The posterior distribution Pr(c, d|y) ∝ Pr(c, d)f (y|c, d) is not analytically tractable, i.e., the optimal partition cannot be derived analytically.Thus, a Gibbs sampler is derived to draw samples asymptotically distributed according to the posterior.The region and class labels are iteratively sampled according to the posterior probabilities Pr(c jn |c −jn , d, y) and Pr(d jt |c, d −jt , y), respectively [18], with c −jn = {c j n |j = 1, . . ., J ; n = 1, . . ., N j ; (j , n ) = (j, n)}.This sampling scheme is inspired from the one initially proposed in [5].

A. SAMPLING OF THE CLASS LABELS
Sampling the class label d jt of the region t in image j is achieved by deriving the corresponding conditional posterior probability, which is proportional to the prior model (7) and to the distribution of the observations associated with the pixels of interest, conditionally on the other observations and current partition, i.e., with, using (7), The term ϕ(d jt |c, d −jt ) can be computed by resorting to the CRF metaphor.By assuming that K classes are active at the current iteration of the Gibbs sampler, two cases should be considered: the region t in image j can be assigned to an existing class, i.e., 1st Case (d jt = k With k ∈ {1, . . ., K }: The first term in the right-hand side of (8) exhibits the distribution of the observations associated to pixels in the considered region conditionally on the observations associated to pixels assigned to the class d jt = k omitting those in the considered region, i.e., p(y jt |c, ).Moreover, the CRF allows the first prior factor to be evaluated using (4) as ϕ( The second factor ρ(d jt |c, d −jt ), corresponding to the Potts model, depends on the class assignment of the set of neighbors of the super-pixels in the region t that do not belong to this region, herein denoted V [t] .
2nd Case (d jt = k new With k new ∈ {1, . . ., K }): In this case, since a new class is sampled, the conditional distribution of the observations y jt becomes the marginal distribution, i.e., p(y jt |c,

B. SAMPLING OF THE REGION LABELS
The conditional distribution of the region label c jn associated with super-pixel n in image j is proportional to the prior model of this region label and to the likelihood of the observation In (12), as for the class labels, the term ϕ(c jn |c −jn , d) resulting from the Bayesian nonparametric prior implies that superpixel n in image j can be assigned either to an existing region or to a new one.These two cases are discussed below.
1st Case (c jn = t With t ∈ 1, . . ., m j. ): The first term of the conditional distribution (11) is the distribution of the observation vector y jn conditionally on the observations attached to pixels assigned to region t in image j omitting the nth one: Finally, sampling the region label c jn can be conducted from the conditional posterior probabilities Pr(c jn = t|c −jn , d, y) Note that, once a new region label t new has been chosen, the corresponding class label d jt new should be sampled according to Pr(d

IV. A GENERALIZED SWENDSEN-WANG BASED ALGORITHM
In high-dimensional problems, the efficiency of the proposed Gibbs sampler can be impaired by the slow convergence of the generated Markov chain.In particular, to speed-up convergence, one opportunity is offered by the SW algorithm, specifically designed to sample according to a Potts model [9], one of the key ingredient in the proposed prior model (7).It considers an augmented model, introducing a set of latent binary variables r linking pairwise pixels to form so-called spin-clusters.This latent variable r is defined such that its sampling only depends on the current partition of the pixels, while preserving the marginal distribution of the labels.The major interest of this strategy results from the fact that the labels of linked pixels are jointly updated.We propose here a generalized counterpart of the conventional SW algorithm presented in [19].

A. CONDITIONAL DISTRIBUTION OF THE LINKING VARIABLE
Within the proposed hierarchical segmentation scheme, two sets of labels are introduced, namely, the labels of region c and the labels of class d.One can think of sampling r conditionally to the class labels, as it is classically conducted when handling a Potts model.Unfortunately, in this case, a problem arises: under this configuration, two pixels in different regions can be linked, which makes the sampling scheme of c and d challenging since several regions can be assigned to the same class.As an alternative, we propose to sample the linking variable conditionally to both the region and class labels.Note however that when two pixels belong to the same region, they also share the same class.Thus, the linking variable can be equivalently sampled conditionally on the region labels c only.As a consequence, the linking variable r j nq associated with two super-pixels n and q in the image j can be sampled according to where Ber(p) is the Bernoulli distribution with parameter p and λ governs the behavior of the generalized SW (GSW) algorithm: the greater its value, the bigger the spin-clusters.

B. EMBEDDING THE GENERALIZED SW INTO THE GIBBS SAMPLER
Finally, at iteration i of the algorithm, the sampling scheme targeting the posterior distribution of interest is 1) r (i) ∼ Pr(r|c (i−1) , d (i−1) , y) 2) c (i)  ∼ Pr(c|d (i−1) , r (i) , y) 3) d (i)  ∼ Pr(d|c (i) , r (i) , y) In step 1, since the distribution of r only depends on the partition in regions c, the conditional probabilities reduce to Pr(r|c, d, y) = Pr(r|c).As a consequence, the linking variables are sampled according to (17).Moreover, in step 3, the class labels are conditionally independent of the linking variable r.Thus this step is equivalent to the one described in paragraph III-A.
In all and for all, only step 2, i.e., the sampling of the region labels c, needs to be reconsidered carefully.As previously said, the SW algorithm allows the region labels associated with observations belonging to the same spin-cluster to be simultaneously updated.Let C jl be the set of super-pixels in the spin-cluster l of image j.We denote y jl , r jl and c jl the associated set of observations, linking labels and region labels, respectively, and y −jl , r −jl and c −jl the set of observations, linking labels and region labels when omitting the superpixels in C jl .The conditional posterior probability writes Pr(c jl = t|c −jl , d, r, y) The first term in ( 18) can be rewritten as: The Potts field involves the number of neighboring super-pixels of the spin-cluster that are in region t.The set V jl of that neighboring superpixels is the set of super-pixels that are neighbors of superpixels in C jl but are not in C jl .The conditional probability for the super-pixels in C jl to be in an existing region then writes Moreover, the probability that the super-pixels in C jl are in a new region is where

V. ESTIMATION OF THE HYPERPARAMETERS
The quality of the segmentation highly depends on the fine adjustment of the hyperparameter values.As for the parameter of the Potts model β, its estimation has already been investigated in the literature and can be performed for instance using an approximate Bayesian computation as proposed in [17].Thus, this paper rather focuses on the parameters of the HDP prior which are denoted in the sequel χ = {α 0 , γ }.Adopting a fully Bayesian framework, they could be assigned a prior distribution and jointly estimated with the parameters of interest, namely the class partition [5].Within the Gibbs sampler presented in Section III, sampling according to the conditional posterior distribution p(χ |c, d, y), e.g., using a Metropolis-Hastings step, would require to compute the normalization constant associated to the prior probabilities Pr(c, d|χ ) defined by (7).However, mainly due to the Potts potential included in the prior distribution (7), this normalization is not analytically tractable.Moreover, as it involves a sum over all possible partitions of ( 7), its pre-computation for realistic segmentation problems is not possible.
The proposed alternative consists in estimating χ under a maximum likelihood paradigm while exploring the posterior distribution of the class partition by resorting to a sequential Monte Carlo (SMC) sampler [10].This algorithm, which combines MCMC and importance sampling, is intended to sequentially sample from a sequence of probability distributions defined on the same measurable space.In our case, this sequence corresponds to the posterior distributions of the class and region labels conditionally upon a pre-defined grid of values for χ .The interest of the SMC is then twofold.Firstly, it only requires the target distributions to be known up to a normalizing constant.Secondly, it calculates at each iteration an estimate of this normalizing constant.Since the latter is proportional to the likelihood of the hyperparameters, it makes it possible to select afterwards their most likely value.
To further detail the proposed approach, let us denote {χ m } m=1,...,M a predefined grid of possible hyperparameter values, m the target distribution up to a normalizing constant and Z m this normalizing constant at the m th iteration of the SMC.More precisely, the latter are defined as follows: where ϕ m and ρ m write as in equations ( 5) and ( 6), respectively, with the hyperparameters set to χ m .The proposed SMC proceeds by running in parallel several MCMC samplers, while varying at each iteration the value of the hyperparameters according to the grid.The generated chains are inhomogeneous but an importance sampling step is conducted to guarantee that a Monte Carlo approximation of Pr(c, d|y, χ m ) is obtained.More precisely, if c (i) m and d (i)  m are the samples generated by the ith (i ∈ {1, . . ., I }) chain at the mth iteration, this probability writes Pr(c, d|y, χ m ) are called the particles and the importance weights, respectively.The latter are introduced to correct for the discrepancy between the actual and the target distributions of the particles.It should be noted that different implementations of the SMC can be considered depending on the choice of the MCMC samplers: either the Gibbs or the generalized Swendsen Wang algorithms described in Sections III and IV, respectively, could fit.In the first case, the particles should be updated based on ( 8)-( 16) whereas, in the second one, ( 18)-( 20) should be used.The exact computation of the importance weights involves intricate high dimensional integrals and cannot be carried out in practice.It is shown in [10] that a relevant approximation of the optimal importance weights, in the sense that they have the minimum variance, can be sequentially computed as where the incremental weights w m reduce to: It is well-known that sequential importance sampling suffers from weight impoverishment: that is, after a few iterations, all the weights except one are close to 0. To prevent this degeneracy, the particles are classically resampled according to (22) when the effective sample size (ESS) [20] is below a given threshold classically chosen as I , with 0.5 < < 1.
Then, their weights are reset to 1/I .A remarkable property of the SMC is that the sum of the unnormalized importance weights provides an estimate of the ratios of consecutive normalization constants, i.e., Then, by computing the product of the sequences of estimates of the form (25) up to the current iteration, one can obtain an estimate of Z m /Z 1 which is directly related to the likelihood of χ m as follows where Z m and Z 1 are the normalizing constants of the HDP-Potts prior distributions for the 1st and mth value of χ, respectively.Using notations introduced in Section II, we have: The difficulty inherent to the considered setting is that these normalization constants also depend on the hyperparameters and cannot be elicited.To overcome this issue, we propose to run an additional SMC defined with the same grid of hyperparameter values but targeting the prior instead of the posterior distribution of the partitions.At each iteration, the particles are updated by sampling from equations ( 8)-( 16) or ( 18)-( 20), but removing all the terms depending on the measurements y.As for the incremental weights, they are obtained as: The trick is that this SMC yields at each iteration an estimate of the ratio Z m /Z m−1 : These ratios can be combined to obtain an approximation of Z m /Z 1 which can be used to compensate this term in (25) and directly estimate f (y|χ m )/f (y|χ 1 ) in the end.Thus, f (y|χ) can finally be computed for each value of the hyperparameter χ chosen on the grid.The maximum likelihood estimate of the hyperparameters χ is finally derived as: The different steps of both SMC are detailed in Algo. 1 as well as the combination of their outputs to estimate the likelihoods of the hyperparameter values.Last but not least, it should be noted that it is not necessary to resample a partition conditional on the optimal χ.Indeed, the posterior distribution of partition is directly given by ( 22) taking χ m = χ.

VI. DERIVING THE OPTIMAL PARTITION
After a burn-in period, the algorithms described in Sections III and IV generate I samples from the posterior distribution of interest which are denoted (c (1) , d (1) ), . . ., (c (I ) , d (I ) ).Similarly, if the SMC is considered, I samples from the target distribution can be directly obtained by resampling the empirical distribution (22) corresponding to the most likely values of the hyperparameters.Then, within a Bayesian segmentation framework, the optimal partition is commonly chosen as the marginal maximum a posteriori estimator, i.e., the realization with maximum order of multiplicity.However, under the considered model, deriving this estimator is not straightforward not only because of the wellknown label switching problem [21,Chap. 13] but also due to the varying number of classes induced by the hierarchical Dirichlet process.To overcome label switching, each sample would need to be re-labeled, resulting in a prohibitive computation cost.As an alternative, this paper proposes to follow the approach introduced in [11] and detailed in what follows.Let denote the set of class labels assigned to each super-pixel in the images with κ jn d jc jn .The best partition κ is then m , d (i)  m according to ( 8)-( 16) or ( 18)-( 20 where κ (i) is the partition generated at iteration i and can be interpreted as an average pairwise probability.Thus, the optimal partition is chosen as the closest to the average one in the least-squares sense.An alternative would have consisted in thresholding ζ to define the optimal partition κ.
However this choice would have promoted an over-estimated number of classes, in particular for pixels located in edges of classes.Moreover, choosing the optimal partition among the sampled ones amounts to select a fairly probable partition.

VII. EXPERIMENTS
The proposed model has been first applied to toy-images so as to have at our disposal a ground truth to validate the obtained segmentation.Then, it has been tested on natural images from the LabelMe1 database.

A. RESULTS ON TOY IMAGES
A set of J = 3 images is generated as follows: synthetic classification maps of size 64 × 64 pixels are simulated using a Potts model with parameter β = 1.2 and a number of classes chosen as K real = 3.To each class is associated a gray level: −25, 0 or 25.These piecewise constant images are then degraded by a centered white Gaussian noise of variance σ 2 y = 15 2 to form the observed images.Noise-free and noisy images are depicted in figure 1.

1) MARGINAL LIKELIHOOD
To segment the image set, a particular instance of the generic model introduced in Section II is considered.Since the presegmentation in super-pixels is not required for small size images, the observations that are taken into account for the segmentation directly consist of the noisy gray levels of the pixels.The base distribution H of the HDP, according to which the class parameters φ k are assumed to be drawn, is the following Gaussian distribution: with µ 0 = 0 and σ 2 0 = 75 2 .Moreover, conditionally upon these parameters φ k , the distribution f (•|φ k ) of the pixel values is assumed to be Gaussian, i.e., y jn |φ k ∼ N (y jn ; φ k , σ 2 y ).Note that the distributions H and f are conjugate, which allows the marginal likelihood to be analytically computed.Indeed, the observation model can be rewritten as The marginal distribution of the set of observations associated to pixels in a class k is then a Gaussian distribution N (y A k ; µ k , k ) with mean µ k and covariance matrix k defined as .
The conditional distribution of the nth observation associated to the kth class in the jth image is also Gaussian, i.e., Similarly, the conditional distribution of the observations associated to pixels in the tth region is with where .
The same equation holds for a set of observations belonging to the same spin cluster.These quantities are involved in the different steps of the Gibbs algorithm and the GSW algorithm introduced in Sections III and IV, respectively.

2) RESULTS
The proposed HDP-Potts model-based segmentation, implemented by using either the Gibbs or the GSW algorithms, is compared to the DP-Potts [3] for which the considered images have been concatenated into a unique one beforehand.
We have enforced that pixels that originate from different images cannot be neighbors   Each sampler has been run for 10000 iterations but only the last 5000 sampled partitions were used to estimate the optimal one.With the considered value of σ y , the images are quite noisy, making the segmentation problem harder.
To speed-up the convergence, a k-means algorithm has been applied on the noisy images for the initialization.For the DP-Potts (Gibbs and GSW) algorithms, wherein all the images are concatenated to form a unique one, the k-means is applied with K init = 30.For the HDP-Potts (Gibbs and GSW) algorithms, the k-means algorithm is separately applied on each image of the set with K init = 15.
As for the proposed results, one should keep in mind that the colors are arbitrary and are automatically set depending on the number of classes.The relevance of the segmentation should be studied by comparing the set of pixels that are in the same class initially and also in the estimated partition.It can be seen that the Gibbs algorithm detects only 2 classes.The interest of the SW technique then appears: for both models, applying it improves the results.However, by comparing the required time to converge, it takes far less iterations for the HDP-Potts than for the DP-Potts.Indeed, while more than a hundred iterations are necessary for the DP-Potts (GSW), less than ten are required for the HDP-Potts (GSW).It should be noted that it is difficult to define a convergence criterion that is independent of the label-switching problem: we chose to simply monitor the stabilization of the posterior distribution.However, this is not optimal since the chain can be trapped in a local minimum.
In order to quantitatively evaluate the obtained segmentations, we have also computed the V -measure introduced in [22].Its value, comprised in the interval 0, 1 , has the advantage of being independent of the label-switching issues.The closer it is to 1, the better the classification.The V -measure yields similar values for both models on the toy-images, ranging between 0.87 to 0.95 depending on the considered images.These high values confirm the validity of the proposed method.

B. RESULTS ON NATURAL IMAGES 1) MARGINAL LIKELIHOOD
Each image is pre-segmented in super-pixels with the SLIC [7] method.The descriptor that we choose for a superpixel is a histogram.In this case, the likelihood of the observations is defined as a multinomial distribution.To be able to compute the marginal distribution of the observations, the prior distribution on the parameters is set as a Dirichlet distribution.Let φ k be the vector of parameters, the model then writes: Thanks to the conjugacy of the Dirichlet and multinomial distributions, the marginal distribution of the observations associated to pixels in class k writes: with D and M respectively the normalizing constants of the Dirichlet and multinomial distributions [23].As for the conditional distributions, they are expressed as: In the sequel, it should be noted that the type of considered histogram depends on the processed images: it can be gray level histograms or RGB histograms.They can be used in combination with a histogram of oriented gradients (HOG) [24] to address textured classes and changes of illumination.

2) RESULTS
The parameters have been set as: -    road, the vegetation and the sky.The road is quite wellrecognized in the two last images while for the first, due to a different illumination setup, it is only partially identified.The vegetation is also well detected.As for the class sky, with the histogram of gray levels, not only the clouds are separated, but also, it is divided regarding the changes of illumination.With the HOG, the problems of different illuminations are tackled.It should be noted that the HDP-Potts model provides valuable additional information: the partition in regions of each image.
Figures 6 and 7 show the results of segmentation for a mixed set of images with, as main classes, road, vegetation  and buildings.The road and vegetation are quite well recognized in the different images but the buildings are not always identified as belonging to the same class.Indeed, they are very textured, therefore the variability intra-class is high.

C. ESTIMATION OF HYPERPARAMETERS
To illustrate the relevance of the SMC algorithm proposed in section V, it is applied to the set of road images for the estimation of the scalar parameter of the HDP-Potts model γ .The algorithmic setting is the following: -  satisfying segmentation results for the set of road images, as depicted in figure 9.
To validate the SMC, one could think of generating a set of images directly with the prior model and then recovering the hyperparameters values with the SMC algorithm.However, the images thus obtained are quite irregular and therefore useless for our purpose.Indeed, the proposed algorithms aim at segmenting quite homogeneous images.

VIII. CONCLUSION
In this article, we propose a new method for the joint segmentation of a set of images with shared classes, combining the hierarchical Dirichlet process and the Potts model.The Bayesian non parametric approach allows us to automatically infer the number of classes from the data, while the Markov field classically ensures spatial smoothness.The posterior distribution is intractable and a Gibbs sampling is therefore derived to explore it.Then, to speed up the convergence of the chain, a generalized Swendsen-Wang algorithm is described.Finally, an original approach consisting of two interacting sequential Monte Carlo samplers is presented for the estimation of the model hyperparameters.Results of segmentation on both toy and natural images show the usefulness of the proposed algorithms.
As a perspective of this work, the prior model can be rewritten using the stick-breaking construction so that a variational method can be used for approximate posterior inference.It could be computationally more efficient.Also, the results show that it may be of interest to choose an appropriate texture representation for very textured images.Finally, it would be worth theoretically studying the influence of the hyperparameters (scale factors and granularity parameter) on the mean number of classes.

Algorithm 1 3 %Sampling conditionally to the hyperparameter χ m 4 for i = 1 to I do 5 %Sampling
Sequential Monte Carlo Samplers for the Hyperparameter Estimation Input: y, {χ m } m=1,••• ,M 1 % SMC targeting at the posterior distribution 2 for m = 1 to M do

10 Compute 20 %Sampling
the ratio R (m) according to (25); 11 if ESS < εI then 12 Resample the particles: 13 ∀i ∈ 1, I , set W (i) m = 1/I ; 14 end 15 end 16 % SMC targeting at the prior distribution 17 for m = 1 to M do 18 %Sampling conditionally to the hyperparameter χ m 19 for i = 1 to I do defined as the one minimizing the Dahl's criterion corresponding to the loss function associated to the Bayesian risk
. The parameters are set as -scale parameter of the DP in the DP-Potts model: α = 1, -Potts model granularity parameter in the DP-and HDP-Potts model: β = 1.2, -scale parameters of the DPs in the HDP-Potts model: α 0 = 1, γ = 1.

Figures 2
Figures 2 and 3 respectively show the segmentation results obtained with the DP-Potts and HDP-Potts models.
scale parameter of the DP in the DP-Potts model: α = 1 -Potts model granularity parameter in the DP-and HDP-Potts model: β = 1 -number of bins in the histograms: N bi = 16 -scale parameters for the DPs in the HDP-Potts model: α 0 = 1, γ = 1 -parameter of the prior Dirichlet distribution on the observed gray level histograms: ϕ 0 = 10 4 × y, where y is the mean of the observed gray level histograms and for the RGB combined to HOG histograms: ϕ c 0 = 3 × 10 4 × y c , with y c the mean of the observed RGB histograms.Figures 4, 5, 6 and 7 give the inferred segmentation for a set of images with a road and a set of images of roads and buildings.On each figure are first represented the set of original images and the pre-segmented images into superpixels.The results obtained while the descriptor of the superpixel is a RGB color histogram combined with the HOG are also shown.On figures 4 and 5 the models DP-Potts and HDP-Potts yield good segmentation results.The main classes are the

FIGURE 4 .
FIGURE 4. On (a) the original images for the first set with images of road; on (b) the pre-segmented images with the SLIC method; on (c) the results of segmentation using the DP-Potts algorithm applied to gray-level histograms and on (d) to RGB histograms combined to HOGs.

FIGURE 5 .
FIGURE 5. On (a) the original images for the first set with images of road; on (b) the pre-segmented images with the SLIC method; on (c) the segmentation in regions and on (d) in classes using the HDP-Potts with the gray-level histograms and on (e) in classes with the RGB histograms combined to the HOGs.

FIGURE 6 .
FIGURE 6.On (a) the original images for the second set with a mix of images with roads and buildings; on (b) the pre-segmented images with the SLIC method; on (c) the results of segmentation using the DP-Potts algorithm applied to gray-level histograms and on (d) to RGB histograms combined to HOGs.

FIGURE 7 .
FIGURE 7. On (a) the original images for the second set with a mix of images with roads and buildings; on (b) the pre-segmented images with the SLIC method; on (c) the segmentation in regions and on (d) in classes, both using the HDP-Potts with the gray-level histograms and on (e) with the RGB histograms combined to the HOGs.

FIGURE 8 .
FIGURE 8. Estimation of the marginal likelihood on a grid of values {γm} m≥1 for the set of road images and for a pre-segmentation using the SLIC method.
Potts model granularity parameter in the HDP-Potts model: β = 1 -scale parameter for the first DP in the HDP-Potts model: α 0 = 1 -a grid of M = 2000 test values {γ m } 1≤m≤M for the HDP scale parameter decreasing from 100 to 0.1 in logarithmic scale -number of particles in parallel: I = 100 -parameter of the prior Dirichlet distribution on the observed gray level histograms: ϕ 0 = 10 4 × y Figure 8 shows the estimation of the marginal likelihood with the SMC.As intended, it exhibits a maximum for a value of γ here equal to 3.65.It is taken as the estimation of the HDP hyperparameter in the maximum likelihood sense.Using this estimate of γ , the HDP-Potts method achieves

FIGURE 9 .
FIGURE 9. On (a) the true images; on (b) the pre-segmented images into roughly 1000 super-pixels; on (c) the results of segmentation with the SMC applied to the HDP-Potts for the estimation of γ with N bi = 16; on (d) the results of segmentation with the HDP-Potts applied with the hyperparameters set as in section VII-B but with γ set to its optimal value.
Moreoever, the Potts term writes ρ(d jc jn |c jn , c −jn , d −jc jn ) = exp q∈V n β1 d jc jq ,d jc jn .2nd Case (c jn = t new With t new ∈ 1, . . ., m j. ): The CRF metaphor (4) leads to ϕ(c jn = t new | c −jn ) ∝ α 0 and the likelihood function associated with the observation vector y jn is obtained by summing over the different possibilities for the class label d jt new to be assigned to the new region p(y jn |c jn