Generative Model With Dynamic Linear Flow

Flow-based generative models are a family of exact log-likelihood models with tractable sampling and latent-variable inference, hence conceptually attractive for modeling complex distributions. However, flow-based models are limited by density estimation performance issues as compared to state-of-the-art autoregressive models. Autoregressive models, which also belong to the family of likelihood-based methods, however suffer from limited parallelizability. In this paper, we propose <italic>Dynamic Linear Flow (DLF)</italic>, a new family of invertible transformations with partially autoregressive structure. Our method benefits from the efficient computation of flow-based methods and high density estimation performance of autoregressive methods. We demonstrate that the proposed DLF yields state-of-the-art performance on ImageNet <inline-formula> <tex-math notation="LaTeX">$32\times 32$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$64\times 64$ </tex-math></inline-formula> out of all flow-based methods. Additionally, DLF converges significantly faster than previous flow-based methods such as Glow.


Introduction
The increasing amount of data, paired with the exponential progress in the capabilities of hardware and relentless efforts for better methods, has tremendously advanced the development in the fields of deep learning, such as image classification (Krizhevsky et al., 2012;He et al., 2016;Huang et al., 2017) and machine translation (Vaswani et al., 2017;Devlin et al., 2018;Radford et al., 2019). However, most applications have been greatly limited to situations where large amounts of supervision is available, as labeling data remains a labor-intensive and cost-inefficient exercise. In the meantime, unlabeled data is generally easier to acquire but its direct utilization is yet a central challenging problem. Deep generative models, an emerging and popular branch of machine learning, aims to address these challenges by modeling the high-dimensional distributions of data without supervision.
In recent years, the field of generative modeling has advanced significantly, especially in the development and application of generative adversarial networks (GANs) (Goodfellow et al., 2014) and likelihood-based methods (Graves, 2013;Kingma and Welling, 2013;Dinh et al., 2014;Oord et al., 2016b). Likelihood-based generative methods could be further divided into three different categories: variational autoencoders (Kingma and Welling, 2013), autoregressive models (Oord et al., 2016b;Salimans et al., 2017;Chen et al., 2017;Menick and Kalchbrenner, 2018), and flow-based generative methods (Dinh et al., 2014(Dinh et al., , 2016Kingma and Dhariwal, 2018). Variational autoencoders have displayed promising parallelizability of training and synthesis, however, it could be technically challenging to optimize with the lower bound on the marginal likelihood of the data. Autoregressive models and flow-based generative models both estimate the exact likelihood of the data. However, autoregressive models suffer from the limited parallelizability of synthesis or training, and a lot of effort has been made to overcome this drawback . On the contrast, flow-based generative models are efficient for training and synthesis, but generally yield compromised performance in comparison with autoregressive models in density estimation benchmarks.
In this paper, we focus on the exact likelihood-based methods. In Section 2, we first review models of autoregressive methods and flow-based methods. Inspired by their common properties, in Section 3, we then propose a new family of invertible transformations with partially autoregressive structure. And we illustrate that autoregressive models and flow-based generative models are two extreme forms of our proposed method. In Section 5, our empirical results show that the proposed method achieves state-of-the-art density estimation performance on ImageNet dataset among flow-based methods and converges significantly faster than Glow model (Kingma and Dhariwal, 2018). Though our method has a partially autoregressive structure, we illustrate that the synthesis of a high-resolution image (i.e., 256×256 image) on modern hardware takes less than one second, which is comparable to most flow-based methods.

Flow-based Models
In most flow-based models (Kingma and Dhariwal, 2018;Dinh et al., 2014Dinh et al., , 2016, the highdimensional random variable x with complex and unknown true distribution x ∼ p (x) is generally modeled by a latent variable z: z = f θ (x), where f can be any bijective function with parameters θ and is typically composed of a series of transformations f = f 1 • f 2 • · · · • f L . p θ (z) has a tractable density, such as a standard Gaussian distribution. With the change of variables formula, we then have the marginal log-likelihood of a datapoint and take it as the optimization objective of learning θ: is the hidden output of sequence of transformations, with h 0 x and h L z.
However, the above formula requires the computation of Jacobian determinant of each intermediate transformation, which is generally intractable and therefore, becomes a limitation of the above method. In practice, to overcome this issue, the transformation function f i is well-designed to let its Jacobian matrix be triangular or diagonal, thus the log-determinant is simply the sum of log-diagonal entries: In the next part of this section, we will review invertible and tractable transformations reported in previous studies, categorized as fully autoregressive structure and non-autoregressive structure. After that, we will discuss their respective advantages and disadvantages in computational parallelizability and density estimation performance.

Autoregressive and Inverse Autoregressive Transformations
Papamakarios et al. (2017) and Kingma et al. (2016) introduced autoregressive (AR) transformation and Inverse Autoregressive (IAR) transformation, respectively. These methods model a similar invertible and tractable transformation from high-dimensional variable x to y: where x i and y i are the i-th element of x and y, respectively. The difference between AR and IAR is that s i and µ i are driven by different input: s i , µ i = g(x 1:i−1 ) in autoregressive transformation and s i , µ i = g(y 1:i−1 ) in inverse autoregressive transformation. Here g is an arbitrarily complex function, usually a neural network. The vectorized transformation and its reverse transformation for (inverse) autoregressive transformations could be described as follows: where is the Hadamard product or element-wise product, and the addition, division and subtraction are also element-wise operations.
In previous works, AR and IAR have been successfully applied to image generation  and speech synthesis (Oord et al., 2016a). However, as s i , µ i are dependent on previous elements of input x 1:i−1 or output y 1:i−1 , these transformations are inherently sequential in at least one pass of training (IAR) or synthesis (AR), making it difficult to parallelize on modern parallel hardware .

Non-autoregressive Transformations
Non-autoregressive transformations are designed to be parallelizable in both forward and backward pass, with tractable Jacobian determinants and inverses. Here, we describe a number of them: Actnorm (Kingma and Dhariwal, 2018), as one of non-autoregressive transformations, was proposed to alleviate the training problems encountered in deep models, which is actually a special case of (inverse) autoregressive transformation that the scale s and bias µ are treated as regular trainable parameters, namely, independent of the input data: It's worth mentioning that s and µ are shared between the spatial dimensions of x when the input is 2D images as described in Kingma and Dhariwal (2018).
Affine/additive coupling layers (Dinh et al., 2014(Dinh et al., , 2016 split the high-dimensional input x into two parts (x 1 , x 2 ) and applies different transformations to each one to obtain the output y = (y 1 , y 2 ). The first part is transformed with an identity function thus remains unchanged, and the second part is mapped to a new distribution with an affine transformation: with µ, s = g(x 1 ) = g(y 1 ). Same as AR and IAR, here g is an arbitrarily complex function, typically a neural network. Note that this transformation can be also rewritten in the same form as (inverse) autoregressive transformations and actnorm method: These non-autoregressive transformations have the advantage of parallelization, therefore, they are usually faster than the transformations with autoregressive structure. However, previous results have shown that they generally perform much worse in density estimation benchmarks (Ho et al., 2019).

Method
In this section, we introduce a new family transformations, which have the advantages of computational efficiency of non-autoregressive transformations and the high performance of (inverse) autoregressive transformations in density estimation benchmarks.
There are two key observations from the mentioned methods in Section 2. First, all methods have a consistent linear form: Here w is a diagonal matrix with s as its diagonal elements, thus this transformation is invertible and its inverse is simple as Eq. (5). The invertibility makes it possible to use a same transformation as the block of both encoder and decoder in generative models.
The second key observation is the weights of such linear transformations w and µ are data-dependent, in the way that the determinant of Jacobian matrix J = dy/dx is computationally efficient or tractable, usually making J triangular (AR, IAR and affine coupling layer) or diagnoal (actnorm). Therefore, the log-determinant is simply the sum of logarithm of diagonal terms log(det(J)) = sum(log |s|).
Their difference are the methods used for modelling the relationship between the weights (w, µ) and the data under the "easy determinant of the Jacobian" constraint. At each scale, the input is passed through a squeezing operation to trade the spatial size for number of channels, followed by H flows of invertible 1×1 convolution and dynamic linear transformation. The output is splitted into two halves, one for the next series of flow and another as a part of final latent variable. The condition h is optional which guides dynamic linear transformation as prior knowledge.

Dynamic Linear Transformation with Triangular Jacobian
Let us now consider a high-dimensional variable x ∈ R D : When splitting it into K parts along its dimension, we obtain x = (x 1 , . . . , x K ), with 1 ≤ K ≤ D. Then we introduce a tractable and bijective function y = f (x) as following: with k = 2, · · · , K. Variables y k , x k , s k and µ k have the same dimension, and s k , µ k = g θ k (x k−1 ) are modeled by an arbitrarily complex function (usually a neural network) with the previous part of data as input. h() is tractable and bijective with the inverse x 1 = h −1 (y 1 ). An alternative of h() is identity function y 1 = x 1 . If then, combined with K = 2, our method turns out to be the case of affine coupling layer, see Eq. (7). For the purpose of consistency, in this paper, we choose h(x 1 ) = s 1 x 1 + µ 1 , where s 1 and µ 1 are trainable. In other words, s 1 and µ 1 are modeled by g(x 0 ) with that x 0 is any constant, e.g. x 0 = 1. Therefore, Eq. (9) and its inverse can be rewritten as: where k = 1, 2, · · · , K and initial condition x 0 = 1.
The Jacobian of the above transformation is triangular with s = (s 1 , · · · , s K ) as its diagonal elements and thus has a simple log-determinant term: Note that our proposed transformation can also be rewritten in the following linear form: Figure 2: Negative log-likelihood on CIFAR-10 test set during training. Increasing K leads to no performance gain but slower convergence.
where the variables and they are data-dependent, therefore, we call our method dynamic linear transformation. As w and b changed for different inputs, dynamic linear transformation can be considered as the extreme form of piecewise linear function, each of the points learning its own weights for affine transformation.
In applications, an important concern for dynamic linear transformation is its recursive dependencies in the reverse pass, introduced by that each pair (s k , µ k ) depends on previous partition x k−1 . We show that this issuse could be addressed for two reasons: (1) the recursive dependencies are based on piece and only dependent of one earlier step, thus it is more efficient on computation than the element-level autoregressive structure, which has a great dependency on all its earlier steps; and (2) the smaller K is, the shorter the dependency chain we get. In Section 5, we will show that increasing K is not helpful and results in worse NLL score (Fig. 2), and our state-of-the-art results are achieved with K = 2, with a similar computational speed compared to non-autoregressive methods.
Similar to the transformations of AR and IAR, we also introduce a variant of dynamic linear transformation. Let s k () and µ k () take the transformed output y k−1 as input instead of x k−1 , we then have: with k = 1, 2, · · · , K and initial condition y 0 = 1. We call this variant inverse dynamic linear transformation, which has the same log-determinant as Eq. (12).

Conditional Dynamic Linear Transformation
In most samples generation scenarios, it is a common requirement to control the generating process with prior knowledge, e.g. generating an image with class label information. We introduce the conditional dynamic linear transformation to meet such requirement. Given condition h, the conditional dynamic linear transformation could be described as: The parameters of transformation s k and µ k take h as an additional input. Accordingly, when inverting the transformation, we can recompute s k and µ k from the same h and transformed x k−1 .
For the inverse dynamic linear transformation variant, its conditional form is

Dynamic Linear Flow
In high-dimensional problems (e.g. generating images of faces), the use of a single layer of dynamic linear transformation is fairly limited. In order to increase the capability of the model, in this section, we describe Dynamic Linear Flow (DLF), a flow-based model using the (inverse) dynamic linear transformation as a building block. Following by the previous works of NICE (Dinh et al., 2014), RealNVP (Dinh et al., 2016) and Glow, DLF is stacked with blocks consisting of invertible 1 × 1 convolution and (inverse) dynamic linear transformation, combined in a multi-scale architecture (Fig. 1). Since dynamic linear transformation and inverse dynamic linear transformation are similar, in Fig. 1, we only illustrate the structure of DLF with dynamic linear transformation, and the corresponding variant is obtained by replacing the layer of dynamic linear transformation with inverse dynamic linear transformation. A comparison on their density estimation performance is included in Section 5.

Multi-scale Architecture
For the case of 2D image input, following realNVP and Glow, we use squeezing operation to reduce each spatial resolution by a factor 2 and transpose them into channels, resulting in s × s × c input transformed into a s 2 × s 2 × 4c tensor. After the squeezing operation, H steps of flows consisting of invertible 1 × 1 convolution and dynamic linear transformation are combined into a sequence. Then the output of sequence stacks is factored out half of the dimensions at regular intervals, while all of the another half at different scales are concatenated to obtain the final transformed output. The above operations are iteratively applied for L times.

Invertible 1 × 1 Convolution
To ensure that each dimension can influence every other dimension during the transformation, we apply an invertible 1×1 convolution layer (Kingma and Dhariwal, 2018) before each layer of dynamic linear transformation. The invertible 1 × 1 convolution is essentially a normal 1 × 1 convolution with equal number of input and output channels: where W is the kernel with shape c × c, and i, j index the spatial dimension of 2D variables x, y.

Experiments
We evaluate the proposed DLF model on standard image modeling benchmarks such as CIFAR-10 (Krizhevsky and Hinton, 2009), ImageNet (Russakovsky et al., 2015) among others. We first investigated the impact of number of partitions K and compared the variants of dynamic linear transformation. With the optimal hyperparameters, we then compared log-likelihood with previous generative models of autoregressive and non-autoregressive families. Lastly, we assessed the conditional DLF with class label information and the qualitative aspects of DLF on high-resolution datasets.
In all our experiments, we followed a similar implementation of neural network g θ k as in Glow, using three convolutional layers with a different activation function in the last layer. More specifically,  (Oord et al., 2016b) 3.00 3.86 3.63 Gated PixelCNN (van den Oord et al., 2016) 3.03 3.83 3.57 PixelSNAIL  2.85 3.80 3.52 SPN (Menick and Kalchbrenner, 2018) -3.79 3.52 the first two convolutional layers have c channels with ReLU activation functions, and 3 × 3 and 1 × 1 filters, respectively. To control the number of model parameters, c varied for different number of partitions K and different datasets (Table. 1). The last convolution is 3 × 3 and has two times of channels as partition x k , and its outputs o are equally splitted into two parts along the channel dimension, obtained log s k , µ k = split(o). For the purpose of training stability, the final s k = exp(α tanh(log s k ) + β), where α and β are learnable scale variables. For the conditional DLF, we introduce conditions by log s k , µ k = split(o + V h) in the last layer, where V is weight matrix for conditioning data. In cases where h encodes spatial information, the matrix products (V h) is replaced by a 3 × 3 convolution operation. The parameters θ k of neural network are individual between different partitions x k . Depth H is always set to 32. See Table. 1 and Appendix A for more details of optimization.

Effect of Partitions K and Model Variants
Choosing a large K will increase the recursive complexity of the model. Therefore, a small K is preferred given the performance was not degraded. We tested number of partitions K = 2, 4 and 6 on CIFAR-10. The number of model parameters was approximately equal to 45M (same size as in Glow) by controlling channels c, see Table 1. The results are summarized in Fig. 2. As we can see, Increasing K is unnecessary and has negative effect on model performance, leading to worse NLL score and slower convergence. On the other hand, we replaced the layers of dynamic linear transformation with its inverse variant when K = 2, which does not produce significant performance difference. Therefore, we choose K = 2 and will not evaluate DLF with inverse dynamic linear transformation in the following experiments.
Note that for the case of K = 2, both the non-inverse and inverse variants start overfitting after 20 epochs. And after 50 epochs, the averaged NLL score over epoch on training set reaches 3.30 and the loss still keeps decreasing, while the validation NLL increases from 3.51 to 3.55. As mentioned in Section 3, dynamic linear transformation is the extreme form of piecewise linear function, learning weights of affine transformation for each input. This indicates that the more powerful the transformation is, the more training data our method is eager for to cover the distribution of whole dataset. Therefore, to avoid overfitting, apart from degrading the capacity of dynamic linear transformation, another approach is to increase the size of training dataset. We will discuss this in greater details in the following sections.

Density Estimation
To compare with previous likelihood-based models, we perform density estimation on natural images datasets CIFAR10 and ImageNet. In particular, we use the 32 × 32 and 64 × 64 downsampled version of ImageNet (Oord et al., 2016b). For all datasets, we follow the same preprocessing as in Kingma and Dhariwal (2018).
On CIFAR10, as discussed earlier, the DLF model with the same size as Glow displayed overfitting. A possible reason is the simplicity and small size of CIFAR10. We tested the assumption by training a same size model on the relatively complex dataset ImageNet 32×32. As shown in Table. 2, compared to Glow, the improvement is significant by 0.24 bits/dim and we did not observe overfitting on Imagenet 32×32. This encourages us to apply transfer learning to CIFAR10, initializing its parameters with the trained model on ImageNet 32×32. We found the approach helpful for CIFAR10, obtained 3.51 bits/dim without transfer learning and 3.44 bits/dim with transfer learning. on ImageNet 64×64, the DLF model led to 3.57 bits/dim, while the model size is relatively small with 50.7M parameters compared to 112.3M parameters of Glow on the same dataset.
Summarily, the DLF model achieves state-of-the-art density modeling results on ImageNet 32×32 and 64×64 among all non-autoregressive models, and it is comparable to most autoregressive models. It is worth mentioning that all results are obtained within 50 epochs. To our knowledge, it is more than 10 times more efficient than Glow and Flow++ (Ho et al., 2019), which generally require at least thousands of epochs to converge.

Conditional DLF
For conditional DLF, we experimented on MNIST (LeCun et al., 1998) and CIFAR10 with class label as prior. The hyperparameters can be found in Table. 1 (For CIFAR10, only K = 2 was tested). For the conditional version, during training, we represent the class label as a 10-dimensional, one-hot encoded vector h, and add it to each layer of dynamic linear transformation. On contrary, class label is not given in the unconditional version. Once converged, we synthesize samples by randomly generating latent variables z from standard Gaussian distribution, and giving one-hot encoded label to all layers of dynamic linear transformation for conditional DLF. As in Fig. 3, the class-conditional samples (sampled after 150 epochs) are controlled by the corresponding label and the quality is better than the unconditional samples (sampled after 200 epochs). This result indicates that DLF correctly learns to control the distribution with class label prior. See appendix for samples from CIFAR10.

Samples and Interpolation
We present samples randomly generated from the trained DLF model on ImageNet 64×64 and CelebA HQ 256×256 (Karras et al., 2017) in Fig. 4, both on 8-bit. For CelebA 256×256 dataset, our model has 57.4M parameters, which is approximately 1/4 of Glow's, and is trained with only 400 epochs. Note that our model have not fully converged on CelebA 256×256, due to limited computational resources.
In Fig. 5, we take pairs of real images from Celeba HQ 256×256 test set, encode them to obtain the latent representations, and linearly interpolate between the latents to decode samples. As we can see, the image manifold is smoothly changed.  During sampling, generating a 256×256 image at batch size 1 takes about 315ms on a single 1080 Ti GPU, and 1078ms on a single i7-6700k CPU. We believe this sampling speed can be further improved by using inverse dynamic linear transformation, as it has no recursive structure in the reverse computation.