By Topic

- Aerospace
- Bioengineering
- Communication, Networking & Broadcasting
- Components, Circuits, Devices & Systems
- Computing & Processing (Hardware/Software)
- Engineered Materials, Dielectrics & Plasmas

SECTION I

In recent years, deep learning has been receiving great popularity from both academia and industry due to its excellent performance in many practical problems. Deep belief nets (DBNs) with stacked restricted Boltzmann machines (RBMs) [1], [2] are one of the most important multiple-layer network architectures in deep learning. DBNs are generative models that are trained to extract a deep hierarchical representation of the input data by maximizing the likelihood of the training data. For the learning of a DBN, the weights and biases in each level RBM are initialized at first by using a greedy layer-wise unsupervised training [3], and all the weights and biases in the global net are then fine-tuned by using a (supervised) back-propagation algorithm [4].

Although DBNs have achieved great potential in various applications, such as image and object recognition [1], [2], [5], speech and phone recognition [6] [7] [8], information retrieval [9] and human motion modeling [10], the current sequential implementation of both RBM and the back-propagation based fine-tuning limits their application to large scale datasets due to the memory demanding and time-consuming computation. Scalable and efficient learning on emerging big data requires distributed computing for RBMs and DBNs.

MapReduce is a programming model introduced by Google [11] for processing massive datasets. It is typically used for parallel computing in a distributed environment on a large number of computation nodes. MapReduce has been implemented in several systems. One of the most powerful implementations is Apache Hadoop [12], a popular free open-source software framework. In addition to high data throughput, the Hadoop system can automatically not only manage the data partition, inter-computer communication and MapReduce task schedule across clusters of computers, but also handle computer failure with a high degree. With a suitable configuration of the Hadoop ecosystem to the problem at hand, all the users need to do is to design a *master* controller and provide a $Map$ function and a *Reduce* function. Nevertheless, Hadoop does not easily allow iterative processing which is common in machine learning algorithms.

To make DBNs amenable to large-scale datasets stored on computer clusters, this paper develops a distributed learning paradigm for DBNs with MapReduce. We design proper *key-value* pairs for each level RBM, and the pre-training is achieved via layer-wise distributed learning of RBMs in the MapReduce framework. Subsequently, the fine-tuning is done via the use of a distributed back-propagation algorithm based on MapReduce. In particular, mrjob [13] is used in the implementation to automatically run multi-step MapReduce jobs, which provides a way for Hadoop to perform iterative computing required during both the training of RBMs and the back-propagation. Thus, the distributed learning of DBNs is accomplished by stacking a series of distributed RBMs for pretraining and a distributed back-propagation for fine-tuning.

Recently, increasing attention on massive data and large scale network architectures has driven parallel implementation of deep learning techniques. Locally connected neural networks [14] and convolutional-alike neural networks [15] were successfully paralleled on computer clusters. Different from these works, this paper explores the performance of deep neural networks with unsupervised pretraining under distributed settings. In addition, parallel processing of deep unsupervised learning models, such as stacked RBMs and sparse coding, using graphical processors (GPUs) has been discussed in [16], but the use of GPUs may reduce model performance and is hardly scalable to big data due to limited memory (typically less than 6 gigabytes). Conversely, our work enjoys the benefit of high data throughput inherent in the MapReuce framework. To the best of our knowledge, this is the first work with implementation details of parallelizing RBMs and DBNs with the MapReuce framework. To leverage the data parallelism, we also propose a modified mini-batch approach for updating parameters.

The remaining of the paper is organized as follows. Section II provides some basic background on MapReduce, RBMs and DBNs. Section III elaborates the developed scheme for distributed RBMs and DBNs based on MapReduce. Experiments and evaluation results on the benchmark datasets are given in Section IV with respect to accuracy and scalability. Finally, this paper is concluded in Section V.

SECTION II

In this section, we will give a brief introduction of MapReduce and DBNs.

MapReduce provides a programming paradigm for performing distributed computation on computer clusters. Fig. 2 gives an overview of the MapReduce framework. In a MapReduce system such as hadoop, the user program forks a *Master controller* process and a series of Map tasks (*Mappers*) and Reduce tasks (*Reducers*) at different computers (nodes of a cluster). The responsibilities of the Master involve creating some number of Mappers and Reducers and keeping track of the status of each Mapper and Reducer (executing, complete or idle).

The computation in one MapReduce job consists of two phases, i.e., a map phase and a reduce phase. In the Map phase, the input dataset (stored in a distributed file system, e.g., HDFS) is divided into a number of disjoint subsets which are assigned to mappers in terms of ${<}{\rm key},~{\rm value}{>}$ pairs. In parallel, each Mapper applies the user-specified map function to each input ${<}{\rm key},~{\rm value}{>}$ pair and outputs a set of intermediate ${<}{\rm key},~{\rm value}{>}$ pairs which are written to local disks of the map computers. The underlying system pass the locations of these intermediate pairs to the master who is responsible to notify the reducers about these locations. In the Reduce phase, when the reducers have remotely read all intermediate pairs, they sort and group them by the intermediate keys. Each Reducer literately invokes a user-specified reduce function to process all the values for each unique key and generate a new value for each key. The resulting ${<}{\rm key},~{\rm value}{>}$ pairs from all of the Reducers are collected as final results which are then written to an output file. In the MapReduce system, all the map tasks (and reduce tasks) are executed in a fully parallel way. Therefore, high-level parallelism can be achieved for data processing through the use of the MapReduce model. In recent years, there have been some parallel learning algorithms [17] [18] [19] [20] [21] [22] [23] using the MapReduce framework for efficient implementation.

An RBM is composed of an input (visible) layer and a hidden layer with an array of connection weights between the input and hidden neurons but no connections between neurons of the same layer. Fig. 2(b) illustrates the undirected graphical network of an RBM.

Rooted in the probabilistic model, RBM is also one particular type of energy model. Consider an RBM with the input layer ${\rm x}$, hidden layer ${\rm h}$, the energy function of the pair of observation and hidden variables is bilinear (assume the vectors in this paper are column vectors):TeX Source$${\rm Energy}\left({{\rm x},{\rm h}}\right)=-{{\rm b}^{T}}{\rm x}-{{\rm c}^{T}}{\rm h}-{{\rm h}^{T}}{\rm Wx},\eqno{\hbox{(1)}}$$ where vectors ${\rm b}$ and ${\rm c}$ are the biases of the input layer and the hidden layer, respectively, and matrix W is the fully connected weight between the two layers. Thus, the distribution of input is tractable as follows: TeX Source$$P\left({\rm x}\right)=\mathop\sum_{\rm h}P({\rm x},{\rm h})=\mathop\sum_{\rm h}{{e^{-{\rm Energy}({\rm x},{\rm h})}}\over{Z}}.$$ where $Z$ is the partition function. By introducing a new definition TeX Source$$\eqalignno{{\rm freeE}\left({\rm x}\right)=&\, -{\rm log}\mathop\sum_{\rm h}{e^{-{\rm Energy}({\rm x},{\rm h})}}\cr=&\, -{{\rm b}^{T}}{\rm x}-\mathop\sum_{i}{\rm log}\mathop\sum_{{\rm h}_{i}}{e^{{{\rm h}_{i}^{T}}({{\rm c}_{i}}+{{\rm W}_{i}}{\rm x})}},&{\hbox{(2)}}}$$ the input likelihood can be expressed in an easier way as TeX Source$$P\left({\rm x}\right)={{e^{-{\rm freeE}({\rm x})}}\over{\mathtilde Z}},\eqno{\hbox{(3)}}$$ where ${\mathtilde Z}=\sum\nolimits_{\rm x}{e^{-{\rm freeE}({\rm x})}}$.

According to [1] and [24], the conditional distribution in RBM can be factorized due to the lack of input-input and hidden-hidden connections. That is to say, calculation of the conditional distribution can be decomposed into that on the single node $P\left({{\rm h}{\rm\vert}{\rm x}}\right)=\prod\nolimits_{i}{P({{\rm h}_{i}}\vert{\rm x})}$. In the binary case where hidden node takes either zero or one, the probability of hidden node taking value one happens to be a sigmoid function of the input as follows:TeX Source$$\eqalignno{P\left({{{\rm h}_{i}}=1{\rm\vert}{\rm x}}\right)=&\,{\rm sigmod}\left({{{\rm c}_{i}}+{{\rm W}_{i}}{\rm x}}\right)\cr=&\,{{1}\over{1+{{\mathop{\rm e}\nolimits}^{- ({{\rm c}_{i}}+{{\rm W}_{i}}{\rm x})}}}}.&{\hbox{(4)}}}$$ The essential goal of the training is actually for the hidden random variables to maintain the distribution of the input data as much as possible, say, find the optimal parameters ${\mmb{\Theta}}=\left\{{\rm W},{\rm b},{\rm c}\right\}$ to maximize the input likelihood. According to the gradient descent method, the parameters can be iteratively updated proportional to the gradient of log-likelihood TeX Source$${{\partial{\rm log}P\left({\rm x}\right)}\over{\partial{\mmb{\Theta}}}}=-{{\partial{\rm freeE}\left({\rm x}\right)}\over{\partial{\mmb{\Theta}}}}+\mathop\sum_{\mathtilde{x}}P({\mathtilde{x}}){{\partial{\rm freeE(}{\mathtilde{x}})}\over{\partial{\mmb{\Theta}}}},\eqno{\hbox{(5)}}$$ where ${\mathtilde x}$ denotes the reconstructed ${\rm x}$. One commonly used updating rule to train RBM with approximate data log-likelihood gradient is contrastive divergence (CD) [25]. In CD, the second term in (5), statistically representing an average over all the possible inputs, is replaced with a single term since the iteration itself has done the average job. Thus the gradient can be written as TeX Source$$\Delta{\mmb{\Theta}}\approx -{{\partial{\rm freeE}\left({\rm x}\right)}\over{\partial{\mmb{\Theta}}}}+{{\partial{\rm freeE(}{\mathtilde x})}\over{\partial{\mmb{\Theta}}}}.\eqno{\hbox{(6)}}$$ One can run a Markov chain Monte Carlo (MCMC) to obtain the input reconstructed by the model. $K$-step CD takes the input ${\rm x}$ as the initial state ${{\rm x}_{1}}$, and runs the chain for $k$ times ${{\rm x}_{1}},{{\rm x}_{2}},\ldots,{{\rm x}_{k+1}}$ by reconstructing input using the learned model. Although longer MCMC chain promises better performance, the pain is there regarding the computational cost. Note that small values of $k$ normally suffices for a good result, even in the case when $k=1$.

As a building block for deeper architecture, single RBM is stacked on top of each other taking the output of previous RBM as the input after parameters of each RBM are learned properly. Fig. 2 gives an illustration of a DBN with stacked RBMs.

In DBNs, concerning that parameters learned in previous RBMs might not be optimal for parameters learned afterwards, label information is involved for improvement on the discriminative power. Hinton *et al.* [1] proposed to integrate the label information into the input of top two layers and fine tune the stacked RBMs with a contrastive version of the “wake-sleep” algorithm, which performed a bottom-up pass followed by a top-down pass. As far as we are concerned, the process is tedious and the efficiency is not guaranteed. It is more straightforward to put the label layer on top as the output layer and fine-tune the parameters in all layers as in conventional multilayer perceptron (MLP) [2]. Therefore, the distributed implementation of stacked RBMs in this paper is conducted on the basis of MLP structure.

SECTION III

This section will describe the main design of distributed RBMs and DBNs using MapReduce. The key is to design both a Map function and a Reduce function with proper input/output key-values pairs for the MapReduce jobs.

Given an input dataset ${\cal D}=\left\{{{{\rm x}_{i}}\vert i=1,2,\ldots,N}\right\}$, the goal of training an RBM is to learn the weights ${\rm W}$, the biases ${\rm b}$ and ${\rm c}$. In general, an iterative procedure with a number of epochs to reach convergence is necessary. In the case of distributed RBM with MapReduce, one MapReduce job is required in every epoch. In this paper, we automate the execution flow of multiple MapReduce jobs with the help of the mrjob [13] framework which enables the design of multi-step MapReduce jobs.

Since Gibbs sampling needs to do substantial matrix-matrix multiplications, it dominates the computation time during the training of RBM. Hence parallelizing Gibbs sampling on different data subsets in the Map phase will improve the efficiency. Procedure 1 outlines the pseudo code for distributed RBM. First, some variables are initialized such as the numbers of neurons for both visible and hidden layers, the weight ${\rm W}$, the input layer bias ${\rm b}$, the hidden layer bias ${\rm c}$, the number of epochs (e.g., $T$) to run, and the hyper-parameters (e.g., learning rate, momentum factor). Then both the map phase and the reduce phase are repeated for $T$ times. In each epoch, each mapper performs Gibbs sampling to compute the approximate gradients of ${\rm W}$, ${\rm b}$ and ${\rm c}$, and the reducer updates them with the calculated increments. (The details for the map phase and the reduce phase are provided in Sections III–V.) It is noteworthy that the format of key-value pairs emitted by the reducer should be the same as that of the input for the mapper so that the output of the reducer can be as the input of the mapper in the next epoch.

For each mapper, the corresponding mapper ID (a number) is as the input $key$ and the input *value* is a list of values. Each of the values has two elements: the first is a string (e.g., $^{\prime}{W}^{\prime}$) identifying the type of this value, the second is the corresponding data (e.g., it can be an $M\times N$ matrix if the first element is $^{\prime}{W}^{\prime}$). In every epoch (except the first one), the *value* is the output of the reducer in the previous epoch, which is the updated ${\rm W}$, ${\rm b}$ and ${\rm c}$ and their accumulated approximate gradients.

The input dataset ${\cal D}$ is divided into a number of disjoint subsets which are stored as a sequence of files (blocks) on Hadoop Distributed File System (HDFS). After reading all of the key-value pairs, each mapper loads one subset from the HDFS into memory. Given the information, each mapper can compute the approximate gradients of the weight and biases by going through all the mini-batches of the subset of the training dataset. Each mapper will emit three types of intermediate $key{\rm s}$: $delta{\_}{\rm W}$, $delta{\_}{\rm b}$ and $delta{\_}{\rm c}$ which represent the increments of ${\rm W}$, ${\rm b}$ and ${\rm c}$, respectively, and the intermediate *value*s have three elements: the value of $delta{\_}{\rm W}$, $delta{\_}{\rm b}$ or $delta{\_}{\rm c}$, the corresponding increment and the current epoch index.

Procedure 2 provides the pseudo code for the map function executed by each mapper. Step 1 gets the parameters' values, where $t\in [1,T]$ is the epoch index. Steps 2–7 go through each data batch to compute the approximate gradients of both the weight and the biases, and update their increments. Finally, the intermediate key-value pairs are emitted as the output.

For the training of RBM, there are three reducers in ideal case. Each reducer reads as input one type (i.e., $delta{\_}{\rm W}$, $delta{\_}{\rm b}$ or $delta{\_}{\rm c}$) of the intermediate key-value pairs, and applies the reduce function to first calculate the increments and then update parameter. The reducer takes the mapper ID as the output $key$, and the resulting increment and the updated parameter as the output *value*.

Procedure 3 gives the pseudo code for the reduce function executed by each reducer. Steps 1–10 are to process the weight where Step 2 gets the current weight, epoch index, and a list of approximate gradients for weight, Steps 3–4 compute the weight increment and update the weight, Steps 5–10 decide to whether save the learned weight when it is the final epoch or increase the epoch index and emit the key-value pairs to the mappers. In a similar way, Steps 11–19 and Steps 20–28 are to process the input layer bias and hidden layer bias, respectively.

Considering a DBN with $H$ hidden layers, the training of this distributed DBN consists of learning $H$ distributed RBMs for the pre-training and one distributed back-propagation algorithm for fine-tuning the global network. In addition, a main controller is required to manage the entire learning process.

The bottom-level RBM is trained in the same way as that described in Section III-A. The training of the rest level RBMs is also similar to the bottom-level RBM except that the input dataset is changed accordingly. The input data for the $l$th $(H\geq l>1)$ level RBM will be the conditional probability of hidden nodes computed in the $(l-1)$th level RBM, that is TeX Source$$\cases{P\left({{{\rm h}_{1}}\vert{\rm x}}\right){\rm,}&{\rm when}~${{l=2}}$;\cr P\left({{{\rm h}_{l}}\vert{{\rm h}_{l - 1}}}\right),&{\rm when}~${{H\geq l>2}}$.}\eqno{\hbox{(7)}}$$ Thus, the details of both the map function and the reduce function are omitted here.

In the completion of pre-training of all the hidden layers, it is time to gain discriminative power by simply putting the label layer on top of the network and iteratively tuning the weights of all the layers (i.e., ${{\rm W}_{1}},\ldots,{{\rm W}_{H+1}}$). Actually, in the first few epochs (e.g., 5), we first fine-tune the weight ${{\rm W}_{H+1}}$ connecting the $H$ hidden layer and the output layer, so that it has a reasonable initialization. Note that during the fine-tuning the ‘weight’ of each layer means the concatenation of the original weight and the bias.

For the distributed back-propagation based fine-tuning, the feed-forward and back-propagation procedure [4] to compute the gradient of weights using gradient descent is dominated the computation time. Thus, in each epoch, this procedure is executed parallely on each subset of the data in the map phase, and then the reducers compute the weight increments and update the weights.

Procedure 4 outlines the pseudo code for the distributed back-propagation based fine-tuning. Step 1 loads the pre-trained weights ${{\rm W}_{1}},\ldots,{{\rm W}_{H}}$ and initialize the variables such as the weight ${{\rm W}_{H+1}}$ and some hyper-parameters. Steps 2–5 are for the map function and reduce function. In the map phase (Step 3), each mapper take the mapper ID as the input key, and the weight and its increment as the input value. For each data batch, the mappers calculate the gradient of weights and update the weight increments. Finally, each mapper emits the intermediate key-value pairs. In the reduce phase (Step 4), each reducer takes one or more type of weights, computes the weight increments, updates the weight, and then passes back to the mappers. In the final epoch, the reducers save the fine-tuned weights, which are the final output.

In this section, we further design a main controller to manage the entire learning process of a DBN. The main controller schedules the running of MapReduce jobs for each level RBM and the fine-tuning.

Procedure 5 outlines the pseudo code for the main controller of a DBN. Steps 1–11 are to run MapReduce jobs for all $H$ levels of distributed RBMs. For the first level RBM, the input data will be the training dataset ${\cal D}$, and the pretrained weight ${{\rm W}_{1}}$ and bias ${{\rm c}_{1}}$ are saved for loading in the fine-tuning stage. For the other RBM levels, the input data will be $P\left({{{\rm h}_{l-1}}\vert{{\rm h}_{l - 2}}}\right)$. Steps 12–14 are to run MapReduce jobs for the distributed back-propagation based fine-tuning. The pretrained weights and biases of all levels of RBM are loaded. The resulting weights and biases of all layers are saved as the final output.

Thus, a distributed DBN is trained with MapReduce programming model via the help of the *mrjob* framework. The training can be done off-line. Given a learned DBN, testing on new data samples can be directly performed.

SECTION IV

This section will demonstrate the performance of the distributed RBMs and DBNs on several benchmark datasets for various learning tasks. In particular, we investigate their accuracy, and the scalability under conditions of varying Hadoop cluster sizes and data samples.

The datasets we tested are the MNIST^{1} for hand-written digits recognition, and the 20 Newsgroups^{2} document set. For the MNIST dataset, there are 60,000 images as the training set and 10,000 images as the testing set. All the images was size-normalized and centered in a fixed size of 28$\,\times\,$28 pixels. The intensity was normalized to have a value in $[{0, 1}]$. The labels are integers in $[{0, 9}]$ indicating which digit the image presents. For the 20 Newsgroups dataset, there are 18,774 postings taken from the Usenet newsgroup collection with 11,269 training documents and 7,505 test documents. Each document is represented as a 2000-dimensional vector whose elements are the probabilities of the 2000 most frequently used words. The labels for each document are integers in $[{0, 19}]$ indicating which topic the document belongs to. In this paper, the training set of original MNIST and original 20 Newsgroups is copied with 10-times, 20-times, 30-times, 40-times and 50-times, which are summarized in Table I, to evaluate the scalability performance.

All the experiments were performed on a cluster of 8 computers (nodes) where each is equipped with a 64-bit AMD octo-core dual-processor with the speed of 2.4 GHz, 96 GB RAM, and Linux RHEL. The computers are connected through 10Gbit Ethernet. The cluster is configured with Hadoop 1.0.4, Java 1.7.0, and Python 2.7.5 with mrjob 0.4.1.

We set the HDFS block size to be 64 MB and the replication factor to be 4. Each node is set to simultaneously run 26 mappers and 4 reducers in maximum. It should be noted that the cluster is generally shared with other users (except when we occupy all the cores.)

The goal in this section is to compare the distributed RBMs and DBNs with their sequential versions (i.e., the original RBMs and DBNs) in terms of both testing accuracy and training time. To provide fair comparisons, we run both the sequential version and the distributed version in the same conditions. That is, in both cases, we utilize the same parameter setting including the training set (10-times of original MNIST dataset), the testing set (10,000 images), the network architecture (784-500 for RBMs, 784-500-500-2000-10 for DBNs), the initialization of the weight and the bias, the learning rate, the momentum factor, and the number of epoch to train. And both of them were programmed completely in the python codes. The sequential programs were run on one cpu while the distributed programs were run on 16 cpus of a node.

Fig. 3 shows the filters (i.e., the weight) obtained by sequential RBM and the distributed RBM after epcoch 50. Both of them learned visually excellent weights. Tables II and III provide the result comparison for RBM and DBN, respectively. One can see that both the distributed RBM and the distributed DBN obtained similar accuracy to the corresponding sequential versions but with much less training time.

We evaluate the parallel performance of distributed RBMs and DBNs with respect to the *scalability*. In particular, we study running time (for training) versus data size for various numbers of cpus. In the implementation, the distributed programs were run on the datasets summarized in Table I using the number of cpus varying from 4 to 128. Fig. 4 shows the results on various times of original MNIST dataset obtained by the distributed RBMs and DBNs. First, one can observe that the running time raises as the increased size of training data, and significantly decreases as using more cpus. Next, it is also observed that the benefit of using more cpus decreases when the size of data becomes small. The reason behind this is system overhead (e.g., communication costs, job setup for per iteration) dominates the processing time when the size of data is small. Actually, it is the overhead in the Hadoop system that makes the speedup when adding more cpus is sublinear with respect to the number of cpus.

We also performed the scalability experiment on the 20 Newsgroups dataset. An intuitively setted network architecture, i.e., 2000-500-1000-20, is trained. It should be noted that our purpose of testing on 20 Newsgroups is for measuring the scalability performance of the developed distributed DBNs but not for achieving the accurate document classification or retrieval. The specific running time of distributed DBNs on various times of 20 Newsgroups dataset using different number of cpus is listed in Table IV. Note the similar observations as before, which expects that distributed DBNs can be applied to other large-scale applications.

SECTION V

In this paper, we have presented a type of distributed DBNs using MapReduce which can be accomplished by stacking several levels of distributed RBMs and then using a distributed back-propagation algorithm for the fine-tuning. Concerning the communication cost, only data-level parallelism is performed in the developed distributed algorithm since a fully connected multi-layer network is considered. Experiments demonstrate that the distributed DBNs not only have achieved similar testing accuracy on the large version of MNIST dataset to the sequential version, but also scale well even there is big amount overhead for Hadoop system to do iterative computing. We expect the developed distributed DBNs would be able to process other massive datasets with good performances. In the future work, we will conduct the experiments on more large-scale learning problems.

No Data Available

No Data Available

None

No Data Available

- This paper appears in:
- No Data Available
- Issue Date:
- No Data Available
- On page(s):
- No Data Available
- ISSN:
- None
- INSPEC Accession Number:
- None
- Digital Object Identifier:
- None
- Date of Current Version:
- No Data Available
- Date of Original Publication:
- No Data Available

Normal | Large

- Bookmark This Article
- Email to a Colleague
- Share
- Download Citation
- Download References
- Rights and Permissions

comments powered by Disqus