Efficient Approximate Online Convolutional Dictionary Learning

Most existing convolutional dictionary learning (CDL) algorithms are based on batch learning, where the dictionary filters and the convolutional sparse representations are optimized in an alternating manner using a training dataset. When large training datasets are used, batch CDL algorithms become prohibitively memory-intensive. An online-learning technique is used to reduce the memory requirements of CDL by optimizing the dictionary incrementally after finding the sparse representations of each training sample. Nevertheless, learning large dictionaries using the existing online CDL (OCDL) algorithms remains highly computationally expensive. In this paper, we present a novel approximate OCDL method that incorporates sparse decomposition of the training samples. The resulting optimization problems are addressed using the alternating direction method of multipliers. Extensive experimental evaluations using several image datasets and based on an image fusion task show that the proposed method substantially reduces computational costs while preserving the effectiveness of the state-of-the-art OCDL algorithms.


Efficient Approximate Online Convolutional
Dictionary Learning Farshad G. Veshki , Member, IEEE, and Sergiy A. Vorobyov , Fellow, IEEE Abstract-Most existing convolutional dictionary learning (CDL) algorithms are based on batch learning, where the dictionary filters and the convolutional sparse representations are optimized in an alternating manner using a training dataset.When large training datasets are used, batch CDL algorithms become prohibitively memory-intensive.An online-learning technique is used to reduce the memory requirements of CDL by optimizing the dictionary incrementally after finding the sparse representations of each training sample.Nevertheless, learning large dictionaries using the existing online CDL (OCDL) algorithms remains highly computationally expensive.In this paper, we present a novel approximate OCDL method that incorporates sparse decomposition of the training samples.The resulting optimization problems are addressed using the alternating direction method of multipliers.Extensive experimental evaluations using several image datasets and based on an image fusion task show that the proposed method substantially reduces computational costs while preserving the effectiveness of the state-of-the-art OCDL algorithms.Index Terms-Convolutional sparse coding, online convolutional dictionary learning.

I. INTRODUCTION
S PARSE representations have achieved significant success and widespread adoption as models for solving inverse problems in signal processing, image processing, and computational imaging [1], [2], [3], [4], [5], [6].The sparse representation model approximates a signal using a product of a matrix called a dictionary and a vector that only has a few non-zero entries (sparse representation).There are numerous applications where the use of the sparse representation model coupled with a learned dictionary results in remarkably improved performance.A learned dictionary aims to produce sparser representations and more accurate approximations of its domain signals [7], [8], [9].Dictionary learning has been utilized in diverse computational imaging tasks such as image reconstruction [10], [11], image super-resolution [5], and image fusion [12].
Typically, dictionary learning and sparse approximation are used to extract local patterns and features from high-dimensional signals (such as images).Therefore, a prior decomposition of the original signals into vectorized overlapping blocks is usually required (e.g., patch extraction in image processing).However, relations between neighboring blocks are ignored, which results in multi-valued sparse representations and dictionaries composed of similar (shifted) atoms.
The CSC model replaces the matrix-vector product used in the standard sparse approximation by a sum of convolutions of dictionary filters {d k ∈ R m } K k=1 and convolutional sparse representations (CSRs) {x k ∈ R P } K k=1 (also called sparse feature maps).The convolutional sparse approximation problem can be formulated as follows minimize where s ∈ R P is the signal, λ > 0 is the regularization parameter that controls the sparsity of the representations, * denotes the convolution operator (here, with "same" padding), and • 1 and • 2 represent the 1 -norm and the Euclidean norm of a vector, respectively.The convolutional dictionary learning (CDL) problem is typically addressed using a batch approach in which the sparse representations and the dictionary filters are optimized alternately (batch CDL) [16], [20], [21], [22], [23], [24], [25].The following is the formulation of the dictionary optimization problem over a batch of N training signals where Ω(•) represents the indicator function of the constraint set for the dictionary filters, that is, The existing batch CDL methods require access to all training signals and their CSRs at once.As a result, a memory of the order of NP K is required [26], which can be extremely expensive when using large training datasets, i.e., when N K.It is reminded that K is the number of dictionary filters, N is the number of training signals (the batch size), and P is the dimension of the training signals, for example, the number of pixels in an image (usually P K and P N ).The memory requirement of CDL can be reduced using an online-learning approach, where the dictionary is optimized incrementally after observing each training signal and finding its sparse representations [9].The online CDL (OCDL) methods are also useful when the training signals are not available all at once, but they are observed gradually over time, such as in streaming data settings.The stateof-the-art OCDL methods have achieved memory requirements of the order of K 2 P [27], [28], which is independent of the number of training signals.Nevertheless, when learning large dictionaries or using high-dimensional signals, these methods can still incur excessive computational costs.
This paper presents a novel approximate OCDL method that significantly improves the computational efficiency of the stateof-the-art algorithms while providing competitive performance compared to the existing methods. 1 As a result, we propose a method that requires a memory of the order of KP only.More specifically, our method approximates the OCDL problem by minimizing an upper bound of the objective function, where the dictionary optimization problem is decentralized with respect to the convolutional filters.We then solve the resulting optimization problem using the alternating direction method of multipliers (ADMM).MATLAB implementations of the proposed algorithms are available at https:// github.com/FarshadGVeshki/ Approximate-Online-Convolutional-Dictionary-Learning.
The rest of the paper is organized as follows.Section II briefly reviews CDL in the Fourier domain.The proposed CDL method and derivation of the algorithms are presented in detail in Section III.Thorough experimental evaluation results in terms of convergence properties and reconstruction accuracy using multiple image datasets of varying sizes and in the context of a multimodal image fusion task are presented in Section IV.The conclusions are provided in Section V.

II. OCDL IN THE FOURIER DOMAIN
Most efficient CDL methods are based on the Fourier transform [16], [25], [27], [28].In the frequency (Fourier) domain, problem (2) is equivalent to minimize (3) where (•) and denote the discrete Fourier transform (DFT) and the elementwise multiplication operator, respectively.The filters {d k } K k=1 are zero-padded prior to DFT, so that { dk } K k=1 are of the same size as the CSRs. 1 The preliminary findings and results of the proposed method have been reported in [29].This paper provides a comprehensive presentation, encompassing in-depth derivations, additional algorithmic developments, and a more extensive array of experimental results including the application demonstrating the utility of OCDL in image fusion.
where (•) T is the transpose operator.The most efficient solutions to problem (4) (the batch CDL problem) have been proposed based on ADMM, and the fast iterative shrinkage-thresholding algorithm (FISTA) [25], [26].The complexities of these algorithms are of O(KN P ) and they require memory of the order of KN P .As a result, when the training dataset is large, batch CDL becomes excessively computationally demanding in practice.
OCDL alleviates the problem of large required memory by storing sufficient statistics of the training signals and their CSRs in compact history arrays.An online reformulation of problem (4) can be written as minimize where (•) H is the Hermitian transpose operator, and the history arrays A N p ∈ R K×K and b N p ∈ R K , p = 1, . . ., P , are defined as with (•) * standing for the element-wise complex conjugate of an array vector.After observing each training signal and finding its sparse representations, the history arrays are recalculated incrementally using the following formulas The history arrays are initialized using zero arrays.In OCDL, the dictionary is optimized by solving problem (5) only after the updated history arrays are available.As a result, a memory requirement of K 2 P and a complexity of O(K 2 NP ) are achieved [27], [28].

III. THE PROPOSED METHOD
In the proposed method, the training signals are approximated in a distributed manner using A fusion of the separately optimized dictionaries based on the respective CSRs is used to calculate the dictionary {d k } K k=1 .Specifically, the quadratic term in CDL problem (2) is approximated using the following upper-bound estimate Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where the inequality is due to the triangle inequality.Accordingly, the proposed approximate CDL problem is formulated as minimize In the following, two ADMM-based online methods for addressing ( 9) are presented.The first algorithm uses a standard approach for optimization of {d k } K k=1 and {c N k } K k=1 , while the second algorithm incorporates pragmatic modifications to the first algorithm to improve the effectiveness of the proposed approximation method and lower computational costs.

A. Algorithm 1
Optimization problem ( 9) is jointly convex with respect to . Thus, using the OCDL framework, problem (9) can be addressed for the joint optimization variables {c N k , d k } K k=1 after observing the N th training signal s N and obtaining its CSRs {x N k } K k=1 .Compact history arrays are used to store sufficient statistics of where {f N k , g k } K k=1 are the (joint) ADMM auxiliary variables.The ADMM iterations consist of the following three steps.
The {f , g}-update step: In this step the auxiliary variables where Updating the scaled Lagrangian parameters: Finally, the scaled Lagrangian variables are updates as The {c, d}-update step involves projecting (f N k ) t+1 + (u k ) t (in (13)) and (g k ) t+1 + (v k ) t (in ( 14)) onto the constraint set.First, the entries outside the support (R m ) are mapped to zero (recall that the filters are zero-padded), followed by projection onto the unit 2 -norm ball.
In the {f , g}-update step, solving problem ( 11) is equivalent to solving the following optimization problem minimize where By equating the derivative of the objective in (16) to zero and using the Sherman-Morrison (SM) formula, the solution to the f -update step is found as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Algorithm 1: OCDL Method Proposed in Section III-A.where , the f -update step can be carried out with the complexity of O(KP ) using (17).
Problem ( 12) can be addressed via solving the following optimization problem minimize The solution to (18) can be found as where history arrays α N k ∈ R P and β N k ∈ R P , k = 1, . . ., K, are defined as The history arrays are incrementally updated using Algorithm 1 summarizes the main steps of the proposed approximate OCDL algorithm detailed in this section.Unitnorm Gaussian distributed random arrays can be used as initial dictionary {d 0 k } K k=1 .At the first iteration, dictionary {d k } K k=1 can be used to initialize {c n k } K k=1 and {g k } K k=1 .Note that, before each iteration of the ADMM algorithm, {β n k } K k=1 needs to be recalculated using (22) based on the latest values of {f n k } K k=1 .

B. Algorithm 2
To improve the performance of the proposed OCDL algorithm, dictionary optimization can be performed exactly for the latest observed signal s N , while the proposed approximation method is used for {s n } N −1 n=1 .Thus, the modified approximate CDL problem is now formulated as minimize The alternating procedure for addressing (23) consists of the following steps.1) Optimization of {d k } K k=1 : Solving ( 23) with respect to {d k } K k=1 can be addressed using the following ADMM formulation minimize where r n k c n k * x n k .The ADMM iterations consist of the following steps: 1) the g-update step: a convolutional least-squares fitting problem; 2) the d-update step: projection on the constraint set (similar to ( 14)); 3) updating the Lagrangian multipliers (similar to (15)).The g-update step requires solving the optimization problem in the form of minimize Equating the derivative to zero and using the SM formula, optimization problem (25) can be solved as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Algorithm 2: OCDL Method Proposed in Section III-B.

Input: Training signals {s
The incremental update rules for αN k and βN k can be found as The g-update (26) can be performed with the complexity of O(KP ) using precalculated values of In the modified algorithm, dictionary {c N k } K k=1 is optimized only to provide a more accurate approximation of s N (in comparison with the approximation provided using {d k } K k=1 ).It means that the second quadratic term in (23) is ignored in the step of {c N k } K k=1 optimization.Here we rely on the fact that CSRs {x N k } K k=1 are direct products of {d k } K k=1 .As a result, considering that the approximation is based on {x N k } K k=1 , the resulting {c N k } K k=1 cannot unfavorably deviate from {d k } K k=1 .Problem (23), which needs to be solved now for {c N k } K k=1 only, is then reduced to the following optimization problem minimize which is a CDL problem involving a single training signal, and can be addressed using the existing CDL methods (e.g., [25]).
The main steps of the presented approximate OCDL algorithm are summarized in Algorithm 2. Optimization of dictionaries {d k } K k=1 and {c n k } K k=1 (lines 3 and 4) can be initialized using the existing {d k } K k=1 .

C. Memory Requirements and Computational Complexity
The largest arrays used in the proposed algorithms are of size KP .The most computationally expensive steps of performing updates ( 17) and ( 26) both have a complexity of O(KP ), which is slightly dominated by the complexity of DFT that is of O(KP log(P )) when performed using Fast Fourier Transform.Thus, the computational complexity of the proposed algorithm is of the order of KP sequentially performed N times (once for each signal in the training dataset).

A. Compared Methods
The performance of the proposed algorithms is benchmarked against the following state-of-the-art OCDL methods: OCSC The ADMM-based OCDL method of [27], which uses the iterative Sherman-Morrison formula for updating the history arrays; FISTA The FISTA-based OCDL method of [28] that uses gradient calculated in the Fourier domain.
In addition, we compare the OCDL methods to the following batch-CDL algorithm, ADMM-cns The batch-CDL method of [25] that is based on consensus-ADMM.

B. Datasets
The experiments are conducted using the following 6 image datasets: Fruit and City Two small datasets, each composed of 10 images of size 100 × 100.These datasets are typically used as benchmarks for CSC and CDL [20], [21], [27] RGB-NIR A dataset composed of 10 pairs of multimodal visible-light (VL) and near-infrared (NIR) images size 382 × 586 taken from the EPFL RGB-NIR scene dataset (https:// www.epfl.ch/labs/ ivrl/ research/ downloads/ rgb-nirscene-dataset/ ).The RGB-NIR dataset is used for the VL-NIR imaging experiment presented in Section IV-I.
The initial images are transformed into greyscale and the 8-bit pixel values are normalized to a range of 0-1 by dividing by 255.As the CSC model is not capable of effectively handling low-frequency signals, it is a common practice to use high-pass filtered images for CDL [16], [26], [28].In the experiments, the low-frequency components of all images are eliminated using the lowpass function of the SPORCO toolbox [30] with a regularization parameter of 5.

C. Implementation Details
The proposed algorithms employ the unconstrained convolutional sparse approximation method of [25].We use publicly available online implementations of the compared methods, as provided by their respective authors.In all ADMM-based algorithms (both sparse approximation and dictionary learning) the maximum number of iterations2 is set to 300, and stopping criteria discussed in [31,Subsection 3.3]   The OCSC method incorporates the ADMM penalty parameter ρ in the history arrays.Thus, this method cannot use varying penalty parameter extension.For methods OCSC and FISTA, we use the default parameters set by the authors of the paper (the stopping criteria of the OCSC method are modified to be uniform with other ADMM-based algorithms compared).
In all experiments, we use λ = 0.1λ max , where λ max is the smallest value that results in all-zero sparse representations and can be obtained using ∞ -norm of the gradient of the objective of convolutional sparse approximation problem (1) at {x k } K k=1 = 0. Here, the value of λ max is calculated only once using the first image in the training datasets.Table I reports the values of λ max calculated for different training datasets used in the experiments.
All algorithms are implemented using MATLAB.All experiments are performed using a PC equipped with an Intel(R) Core(TM) i5-8365 U 1.60 GHz CPU and 16 GB memory.

D. Comparison Criteria
The effectiveness of the CDL algorithms is typically evaluated based on the objective values of the convolutional sparse approximation problem (1) averaged over the entire test datasets [27], [28], [32].A lower objective value indicates a better performance.For the small datasets Fruit and City, since there is no test data, the average training objective values are reported to compare the effectiveness of the optimization algorithms [20].Using visualized dictionary filters, the OCDL algorithms are evaluated for their ability to extract (learn) visual features.The efficiency of the algorithms is measured using the training times.

E. Small Datasets Fruit and City
Fig. 1 shows the images in the small datasets Fruit and City.Tables II and III    the training times obtained using the methods tested for these two datasets.To facilitate comparison, the results are presented as bar plots in Fig. 2. The experiments based on datasets Fruit and City are performed using dictionary size K = 64.As can be observed, the ADMM-cns batch CDL algorithm yields the lowest objective function values.However, this method is not suitable for large datasets as mentioned earlier.The proposed methods produce objective values that are comparable to other OCDL algorithms tested.In particular, Algorithm 2 (Proposed-2) results in the smallest objective for the Fruit dataset among all OCDL algorithms.For the City dataset, the OCSC method has the lowest objective compared to other OCDL methods (slightly better than that of Proposed-2), but shows a longer training time.As shown in Tables II and III, the proposed algorithms result in substantially shorter training times, especially Algorithm 2, which is noticeably faster than Algorithm 1.
Fig. 3 presents a comparison of the average number of (ADMM and FISTA) iterations utilized for performing dictionary learning on datasets Fruit and City using the OCDL methods compared.Specifically, for Proposed-2 method, we provide the cumulative count of ADMM iterations used for optimizing the dictionaries {d k } K k=1 and {c n k } K k=1 (lines 3 and 4 in Algorithm 2).As previously explained in Section IV-C, in the OCSC method, the ADMM penalty parameter ρ is incorporated into the history arrays.Consequently, this method is unable to take advantage of the varying penalty parameter extension of ADMM, which may account for its comparatively slower convergence when compared to the other OCDL methods.The average number of iterations used by the proposed methods and the FISTA method falls within a similar range.However, since the proposed methods are significantly computationally more efficient, they result in substantially shorter training times.

F. Datasets SIPI and Flickr
Fig. 4 depicts 10 images randomly selected from the SIPI and Flickr datasets.The experiments for SIPI dataset are carried out using a dictionary size of K = 80.A dictionary size of K = 100 is used for the experiments based on Flickr dataset.The average test objective values and the training times obtained using all methods tested for these two datasets are reported in Tables IV and V, and displayed in bar charts in Fig. 5.
As can be seen in Tables IV and V, the ADMM-cns method achieves the lowest test objective values.However, its advantage over the OCDL methods is not as noticeable as in the case of experiments on small datasets Fruit and City.Specifically, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.In particular, Algorithm 2 has the smallest objective among all OCDL algorithms for the SIPI dataset.In Fig. 6, we visually compare images reconstructed using different methods.As can be observed, the proposed approximate methods yield results of comparable accuracy to those obtained using the state-of-the-art algorithms.

G. Learning Large Dictionaries
In this experiment, we use the proposed algorithms to learn large dictionaries of sizes K = 200, K = 300, and K = 400 based on the Flickr dataset.Learning such large dictionaries over the images of the size of those in Flickr is not practically feasible using the OCDL methods, OCSC and FISTA.Indeed,

H. CDL Over a Large Dataset
In this section, we demonstrate the scalability of the proposed algorithms using the Flickr-large dataset (with 1000 training images).Dictionaries composed of K = 100 filters are used in this experiment.Fig. 9 shows the average test objective values obtained using the learned dictionaries after processing 1, 10, 100, and 1000 images.The results show that both proposed algorithms are applicable to large training datasets.However, Algorithm 2 leads to considerably lower objective values.

I. VL-NIR Imaging
This section assesses the efficacy of the OCDL algorithms through an image processing task involving the fusion of VL-NIR images.NIR images are known for their high contrast resolutions, particularly when capturing vegetation scenes or imaging in low-visibility atmospheric conditions like fog or haze.Therefore, they are used to enhance outdoor VL images based on these properties.
In recent work, we proposed a VL-NIR fusion method based on OCDL [33].This method captures correlated features in NIR and VL images as pairs of convolutional filters in two Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.dictionaries, which are learned in a coupled manner via CSRs with identical supports, also known as simultaneous CSC.
In this experiment, we utilize the proposed OCDL method and the FISTA method to perform convolutional coupled feature learning (CCFL) using the method of [33] and compare the performances.Specifically, to perform simultaneous CSC in the method of [33], the 1 -norm and 1,2 -norm regularization parameters are set as 0.001 and 0.005, respectively.The parameter setting of the proposed OCDL method and the FISTA method are as described in Section IV-C.
Our results demonstrate a significant improvement in terms of training time, as CCFL over the RGB-NIR dataset using the proposed method (Algorithm 2) took only 640 seconds compared to 2547 seconds for the FISTA method.
The learned coupled VL-NIR dictionaries are presented in Fig. 10, where correlations between the corresponding filters (coupled VL-NIR features) are clearly visible.An example of VL-NIR image fusion using the proposed OCDL algorithm is depicted in Fig. 11, revealing a noticeable improvement in contrast resolution and visibility range.

V. CONCLUSION AND DISCUSSION
An efficient approximate method for CDL has been presented.The proposed method is based on a novel formulation of the CDL problem that incorporates approximate sparse decomposition of training data samples.We have developed two computationally efficient OCDL algorithms based on ADMM to address the proposed approximate CDL problem.The proposed OCDL algorithms substantially reduce the required memory and improve the computational complexities of the state-of-the-art CDL algorithms.Extensive experimental evaluations using multiple image datasets for OCDL-based VL-NIR image fusion task have demonstrated the effectiveness of the proposed OCDL algorithms.
It is worth mentioning that one advantageous feature of using ADMM (as it is in the proposed computationally efficient OCDL algorithms) is its ability to leverage distributed optimization.As an illustrative scenario, within the proposed Algorithm 2, the convolutional sparse approximation step (line 2 in Algorithm 2) can be executed in parallel for multiple input data samples.Subsequently, the d-update step (line 3 in Algorithm 2) can be efficiently carried out using consensus ADMM, which is a distributed implementation of ADMM.This approach can optimize the dictionary based on input from multiple sources.
With respect to strategies for reducing a possible performance gap between the exact (but computationally and memory hungry) OCDL and the proposed efficient approximate OCDL methods, one approach is to repeat the inner steps of the proposed Algorithms (lines 2-4 of Algorithm 1, for example) for a certain number of iterations.In our extensive simulations, we however did not observe a noticeable performance gap between the exact and approximate OCDL methods.

Fig. 2 .
Fig. 2. Comparison of training objective values and training times obtained using all methods compared for datasets Fruit (top) and City (bottom).
with absolute and relative tolerance values of 10 −4 are used.We use dictionary filters of size 8 × 8 in all experiments.All ADMM-based algorithms except OCSC use ADMM extensions over-relaxation [31, Subsection 3.4.3]and varying penalty parameter [31, Subsection 3.4.1]with initial penalty parameter ρ = 10 (the same parameters are used in all methods).

Fig. 3 .
Fig. 3. Comparison of the average number of (ADMM and FISTA) iterations of the OCDL algorithms compared for datasets Fruit (left) and City (right).
Fig.1shows the images in the small datasets Fruit and City.TablesII and IIIreport the average training objective values and

Fig. 5 .
Fig. 5. Comparison of test objective values and training times obtained using all methods compared for datasets SIPI (top) and Flickr (bottom).

Fig. 6 .
Fig. 6.Images from datasets SIPI (top) and Flickr (bottom) and their reconstructions obtained using different CDL methods.

Fig. 8 .
Fig. 8.Comparison of training times obtained using the proposed algorithms and dataset Flickr for learning dictionaries of different sizes.

Fig. 9 .
Fig. 9. Results for CDL on Flickr-large dataset using the proposed algorithms: average test objective values over the number of processed training images (left) and training time (right).

TABLE I λ
max VALUES USED FOR DIFFERENT DATASETS

TABLE II AVERAGE
TRAINING OBJECTIVE VALUES AND TRAINING TIMES OBTAINED USING THE METHODS COMPARED FOR DATASET FRUIT TABLE III AVERAGE TRAINING OBJECTIVE VALUES AND TRAINING TIMES OBTAINED USING THE METHODS COMPARED FOR DATASET CITY

TABLE IV AVERAGE
TEST OBJECTIVE VALUES AND TRAINING TIMES OBTAINED USING THE METHODS COMPARED FOR DATASET SIPITABLE V AVERAGE TEST OBJECTIVE VALUES AND TRAINING TIMES OBTAINED USING THE METHODS COMPARED FOR DATASET FLICKR in the experiments on the larger dataset Flickr, ADMM-cns performs only slightly better than FISTA and Proposed-2, while requiring the longest training time.Among the OCDL methods, FISTA results in the smallest test objective in the experiments on Flickr, although it takes the longest training time.The proposed methods result in comparable test objective values to other OCDL methods while substantially shortening the training time.

TABLE VI TRAINING
TIMES (SECONDS) USED BY THE PROPOSED METHODS FOR DATASET FLICKRin single precision, for K = 200, only the larger history array of these methods, that is of size K 2 P , would require more than 10 Gigabytes memory, and thus, would require a specialized computer, which is not desirable.The learned large dictionaries are visualized in Fig.7.It can be seen that all dictionaries learned are mostly composed of visually valid features.The obtained training times are reported in Table VI and Fig.8.As can be seen, the longest training times obtained using the proposed methods are still significantly shorter than those resulting from using other methods tested for learning smaller dictionaries (see TableV, for example).