A Review on Community Detection in Large Complex Networks from Conventional to Deep Learning Methods: A Call for the Use of Parallel Meta-Heuristic Algorithms

Complex networks (CNs) have gained much attention in recent years due to their importance and popularity. The rapid growth in the size of CNs leads to more difficulties in the analysis of CNs tasks. Community Detection (CD) is an important multidisciplinary research area where many machine/deep learning-based methods have been applied to map CNs into a low-dimensional representation for extracting information similarity among members of CNs. Currently, Deep Learning (DL) is one of the promising methods to extract knowledge and learn information from high dimensional space and represent it in low dimensional space. However, designing an accurate and efficient DL-based CD method especially when dealing with large CNs is always an on-going research endeavor to pursue. Meta-Heuristic (MH) algorithms have shown their potentials in improving DL models in terms of solution quality and computational cost. In addition, parallel computing is a feasible solution for building efficient DL models. The algorithmic principle of MH is parallel in nature; however, its computation framework in DL training that is reported in the literature is not really implemented in a parallel computing setup. In this paper, we present a systematic review of CD in CNs from conventional machine learning to DL methods and point out the gap of applying DL-based CD methods in large CNs. In addition, the relevant studies on DL with parallel and MH approaches are reviewed and their implications on DL models are highlighted to prospect effective solutions to overcome the challenges of DL-based CD methods. We also point out research challenges in the field of CD and suggest possible future research directions.


I. INTRODUCTION
The study of Complex Networks (hereinafter referred to as CNs) has gained much attention in recent years due to their importance and popularity. With the development in science and technology, a variety of research in the field of CNs have attracted a considerable amount of attention from the research community. Recently, unstructured data has increased rapidly The associate editor coordinating the review of this manuscript and approving it for publication was Feiqi Deng . due to the fast development of Internet technology. A large part of the data is in the form of CNs. Large complex systems can also be presented as a CN, such as social networks [1], biochemical networks [2], protein-protein interaction networks [3], computer networks [4], citation networks [5], etc. CNs consist of interconnected nodes. Due to proliferation of social networks such as Facebook, Twitter, LinkedIn, etc., there are billions of users. The study of user interactions on these networks is of a great importance to many parties, including academia, industry, governments, etc [6]. Hence, it is worthwhile to understand the relevant literature and study research insights from the CNs of which their complexities are ever increasing.
Community detection (hereafter referred to as CD) is one of the most important research fronts in the field of CNs. It is also a key multi-disciplinary field of research and is useful for understanding the structure of networks. [7]. The idea of CD in CNs is depicted in Fig. 1. The interest of CD is valuable in several applications. Usually, a CN is divided into several groups (or communities) so that the number of relationships among nodes in a group is more than the number of relationships the nodes between groups. CD is one of the fundamental tasks in CNs. After the relationship of nodes within a group and among groups have been identified through CD, one can analyze and mine the potentially useful information from different network communities in a CN that are knowledge representations in many applications such as social networks, medical science, machine learning, criminology, and biology [1], [6]. In addition, communities in a CN can exchange and suggest information due to similar desires among the members, and this feature is useful for tasks in a variety of applications which require recommendations, segmentation, vertex labelling, link inference, and influence analysis. CD is also helpful to detect subspecies groups or individuals [8], [9].
A great deal of effort has been devoted to develop new CD methods to extract meaningful features from a CN and represent them in a low-dimensional space (sometimes called learning embedding-based CD); then, clustering algorithms such as k-means are applied to find communities in the low dimensional space [8], [9]. Many conventional machine-learning (ML) algorithms were developed to solve the problem of CD [10], [11]. However, due to the increase in network size and complexity, the traditional ML methods such as spectral mapping and non-negative matrix factorization methods are no longer viable. Moreover, these methods can easily provide local optimal solutions and have high complexity [6], [12]. An alternative way is to introduce deep learning-based methods to cope with the CD problems and challenges. This trend has brought deep learning an attractive solution and research area for addressing the CD problem [13]. In addition, meta-heuristic algorithms were adapted to deal with relevant issues of CD [14].
Deep learning (hereinafter referred to as DL) has become one of the most active research areas in artificial intelligence and machine learning. It has achieved impressive results in many areas, such as speech recognition, image analysis and text comprehension, for both supervised and unsupervised learning strategies [15]. Unfortunately, due to the unique properties of graphs, applying DL-based methods to the CN problems has never been an easy task [16]. The low-dimensional representation capability of DL-methods in solving the CD problem was investigated in recent studies; these DL methods are, for example, stack and sparse autoencoders [8], [17], graph neural networks [18], [19], and other variants of DL [6], [20]. Despite the presence of DL-based CD methods, there is still a lack of finding communities effectively and efficiently, especially when dealing with large CNs. Unfortunately, in reality, the size of the networks increases considerably. In addition, most of the existing DL-based CD methods are trained with gradient decent strategy and backpropagation algorithm, which have three limitations: (1) training process is slow, especially with big data; (2) sensitive to parameter initiation; and (3) the solutions from these methods are likely to fall in local minima [21], [22].
Meta-heuristic (MH) is a concept of a set of algorithms including evolutionary algorithms such as naturally inspired algorithms such as Particle Swarm Optimization [23], and Genetic Algorithm [24]. MH algorithms have been successful used for optimizing machine learning models. They are efficient methods to solve complex problems and could find optimal solutions within an acceptable duration of time. Nowadays, MH algorithms are the state of the art for various optimization problems, especially for problems that are very complex and have high dimensionality [25]; the CD problem is no exception. MH algorithms were adapted and employed to address the problem of CD [14]. Even though many effort have been made to develop effective MH algorithms for the CD problem, they encounter the problem of low accuracy and deficiency in dealing with large CNs [14]. On the other hand, MH algorithms have been applied to optimize DL models, where the gradient descent optimization is replaced by iterative MH algorithms to optimize the parameters and model structure [26]. However, the number of paper publications related to the application of MH in large-scale DL is still few [21], [22].
The emergence of big data is observed recently. DL is playing an important role to deal with big data and to harvest valuable knowledge from complex systems [15], [27]. On the other hand, the complexity of DL models increases significantly due to massive computational work and memory requirements. The development of High-Performance Computing (HPC) devices is helpful to support the development of large-scale DL models. The utilization of parallel computing helps to improve the efficiency for the training of large-scale DL models on big data by dividing a task into subtasks and solving them simultaneously. Parallel computing improves the efficiency of large-scale DL models by utilizing Graphics Processing Unit (GPU) devices [28], [29], and clusters of CPUs [30], [31].

A. EXISTING REVIEWS
The field of community detection is developing rapidly and plays an important role in solving various complex problems in real life. There are various review articles about the applications and methods of community detection, most of them are developed for conventional CD methods [1], [32].
Recently, a few studies review recent techniques, such as DL and MH. The study by Liu et al. [13] presented a review on CD with DL. It contains a discussion of deep graph embedding, deep neural networks, and graph neural networks. The survey by Jin et al. [9] contained a discussion of probabilistic graphical model and DL methods. In addition, Abduljabbar et al. [14] presented a generic overview Nature-inspired optimization algorithms and their role in solving CD problems. The heuristic and MH based CD algorithms were reviewed by Bara'a et al. in [33]. Furthermore, the authors described hybrid MH and hyper heuristic algorithms for CD. Despite the availability of some studies in the literature that provide the researcher with valuable information about CD using DL and MH techniques, there is still a lack of literature that gives a macroscopic view of effective and efficient DL-and MH-based CD in large CNs.
One of our main goals in this paper is to show the DL-based CD need of MH and parallel approaches to bridge the gap exist so far in a unified perspective. Our survey differs from the published ones in four aspects. First, we present a recent trend in the development of methods for CD, i.e., from conventional to DL, while the others focus mainly on individual techniques, e.g., conventional [1], [32], DL [9], [13], or MH [14], [33]. Second, the paper focuses on a survey of DL and MH-based methods for CD in large CNs that are defined as a big data problem. The scope of survey is different from the existing papers [9], [13] where the solutions are not applied for handling big data. Third, the survey summarizes recent research work involving the integration of DL and MH. Fourth, this paper refers to the most successful studies of using large DL models with MH and parallel techniques in other domains to encourage researchers to carefully reconsider the design of effective and efficient DL-based solutions for CD in large CNs.

B. CONTRIBUTIONS
This paper presents a systematic review of CD in CNs from conventional machine learning to DL methods, and point out the gap of applying DL-based CD methods in large CNs. In addition, the relevant studies on DL with parallel and MH approaches in various domains are reviewed and their implications on DL models are highlighted to prospect effective solutions to overcome the challenges of DL-based CD methods. We summarize the contribution of this paper as follows: • We provide a comprehensive review of learning-based CD from conventional machine learning to deep learning-based methods, considering existing methods in MH and parallel approaches to deal with large CNs. To our knowledge, this is the first effort dedicated to summarize current research work on the application of the mentioned methods for CD. We also point out challenges and open research questions related to DL-based methods for CD.
• We present an overview of recent remarkable studies that were developed to improve the effectiveness and efficiency of large-scale DL models in various fields, especially those using MH and parallel approaches to deal with big data. In addition, the existing challenges and open issues in such methods are discussed. The aim of this review is to motivate and inspire the research community to adapt DL to use MH and parallel approaches to overcome the existing problems of DL-based methods in CD, especially applied in large CNs.
• Finally, a summary of review, future directions, and research gaps in the field of CD are presented at before the end of the paper, which shed light on future endeavours in developing effective and efficient computing solutions for large-scale CNs.

C. TERMINOLOGY
A set of acronyms are used throughout the presentation. For a quick access, Table 1 lists all acronyms used in this article.

D. PAPER ORGANIZATION
The paper is organized as follows. Section II presents a detailed methodology to conduct this study. The preliminary concepts of CD in CNs with DL are described in Section III. The technical overview of the research on earlier conventional CD, as well as parallel conventional CD methods are presented in Section IV. Section V overviews the research progress on DL-based CD methods. Section VI overviews the research on DL-based parallel computing on regular and irregular domains. An overview of MH algorithms, CD-based MH algorithms, and DL-based MH optimization methods are presented in Section VII. The open benchmark datasets that have been used in the area of CD are presented in Section VIII. Section IX summarizes the review and point out research challenges involving the use of the mentioned methods in Section VII for CD. Section X suggests future research directions. Finally, the paper is concluded in Section XI.

II. REVIEW METHODOLOGY
A survey of the literature was conducted to identify publications describing CD algorithms from conventional to DL algorithms as well as DL with parallel and MH approaches. For this purpose, keywords and phrases were used to search for articles in well-known repositories such as Google Scholar, IEEE Xplore, Science Direct, Web of Science, Scopus, and arXiv. The overall procedure was as follows: first, the identified keywords were employed along with their synonyms to find relevant articles. The keywords were set with the following searching criteria: ((''community detection * '' OR ''community discovery * '' OR ''deep learning * community detection'' OR ''metaheuristic * community detection'' OR ''finding communities * '') AND (''complex networks'' OR ''large complex networks'' OR ''social networks'') AND/OR (''parallel * deep learning * '' OR ''metaheuristic * deep learning'')). Second, the title and abstract of the articles were read to remove unrelated articles. Next, the related papers were selected for full-text reading according to the following criteria: • Publications after 2015 were given a higher priority. If they did not exist, those articles published after 2010 were reviewed. Some remarkable studies before 2010, which were few, were also selected.
• To overcome the language difficulty, only the articles written in the English language were considered.
• The literature review was based on papers that applied DL and ML for community detection, and that introduced supporting techniques to improve the efficiency and scalability of large-scale DL models (such as MH and parallel techniques).
As a result, a total number of 184 paper publications were identified to form the basis for this review. Approximately two-thirds of the publications (63.2%) were journal articles. The rest were presented at conferences (34.1%), and book chapters (2.7%).
The selected papers were studied in details and the key information were extracted, e.g., the approach of CD and algorithms, the applications and datasets used and their attributes, the improvement techniques applied in terms of efficiency and effectiveness, and merits and demerits of individual CD methods.

III. PRELIMINARY CONCEPT
A complex network CN (n, m) can be expressed as a graph G(V , E). Here V represents a set of nodes n, V (G) = {v 1 , v 2 , . . . , v n } and E represents a set of relations m between nodes. In the abstract form of the relations, they are considered as undirected, unweighted and unsigned relations. The adjacency matrix A = [a ij ] ∈ R n×n is the most widely used representation of these relations, where the pair nodes i and j are represented as a ij and defined as a ij = w ij , if (a ij = 1) means the nodes are connected, a ij = 0 the nodes are disconnected. The number of relationships between the node v i and other nodes is called node-degree that can be calculated as: The node with a high degree refers to the importance of the node in the CN. Another representation is a modularity matrix B = [b ij ] ∈ R n×n . The element b ij can be expressed as B = a ij − (k i k j /2m), where k i and k j indicate the degree of node i, j and m is the total number of all edges in the G/CN. Each node v i has a vector with a length size equals n, which represents the node i relationships with all nodes in the G/CN, v i,j , j = {1, 2, . . . , n}.
The goal of CD algorithms is to partition a given CN into a set of k communities C = {c 1 , c 2 , . . . , c k }, so that the relationships inside each community, e.g., c 1 , is denser than the relationships between the communities, e.g., c 1 ↔ c 2 . This is typically calculated by a Q-Modularity function that should be maximized. The Q-Modularity function is denoted as: Here, m, d i and A ij are the number of edges between the nodes in the sub-graph, the degree of the node i, and the adjacency matrix, respectively. In case of δ ij = 1, i and j are in the same community, otherwise δ ij = 0. The nodes with a nearly similar vector representation are assigned to the same community. This can be performed as follows. First, some nodes are selected as community centers correspond to k communities v c = {v c1 , v c2 , . . . , v ck }; usually, by referring to Eq. (1), these nodes have high degree of relationships. Next, the Euclidean distance between each 96504 VOLUME 9, 2021 node v i and all community centers v c is computed according to the following equation: Nodes v i (where i > 1 and i < n) with a similar representation obtain similar distances from the same community center v cj . As a result, they are grouped in the same community.
The main problem in this process is how to find an effective low-dimensional representation of the node so that the similarity between the nodes is promoted and the original network representation is preserved.
DL can effectively extract low-dimensional representation and find information similarity among nodes of CNs. For example, the deep autoencoder model reconstructs the output data so that it is similar to the input data [34]. Figure 2 shows the general structure of the autoencoder. It has two phases: (1) The first phase is called encoding, where the hidden layer h converts the data x and its features n into a latent representation with feature spaces d, where d is smaller than n. The mapping process is conducted with the f 1 function, h = f 1(W 1 x +b 1 ). (2) The second phase is called decoding, which attempts to reconstruct the original data from the output of the hidden layer using the f 2 function h = f 2(W 2 x + b 2 ). The Sigmoid, Tanh, and Relu activation functions are normally used with the decoding and encoding functions since they are nonlinear mapping functions. The parameters of autoencoder model are referred to as θ = [W 1 , b 1 , W 2 , b 2 ] and trained by minimizing the loss function: with respect to all samples n (i.e., nodes), where x and y denote the original input data and the reconstruction data, respectively. The training steps of autoencoder include a feedforward process and a backward process. The time complexity of the autoencoder model is calculated by: where n, h, and o refer to the number of nodes (i.e., samples of training), the number of neurons in the hidden and input/output layers. The deep autoencoder requires l layers and t iterations, and this changes the time complexity to O(tlnho). In CN representation, a node i is represented by a vector of size n. Therefore, the time complexity is O(tln 2 h).
The time complexity of assigning nodes to the nearest community is O(nkd), where n, d, and k denote the number of nodes, the size of the dimensional representation, and the number of communities, respectively.

IV. CONVENTIONAL COMMUNITY DETECTION METHODS
A great effort has been devoted for CD using conventional ML methods. This section presents remarkable and recent literature studies related to use of conventional ML-based methods for performing CD.

A. EARLIER CONVENTIONAL COMMUNITY DETECTION METHODS
In the last two decades, several conventional ML methods have been developed to address CD in CNs. Spectral mapping and clustering algorithms were introduced to find communities in networks. Ng et al. [35] presented a spectral clustering algorithm to detect communities in a CN. The main idea of the algorithm was to represent the data of a CN in a small-dimensional space. Then, a clustering algorithm for the new small dimensional space was used to discover communities. White and Smyth [11] devised a spectral clustering algorithm for CD based on the reformulation of eigenvectors of matrices. The study mapped the node data of the network into Euclidean space and then grouped them into clusters. The spectral algorithm was improved by Niu et al. [36]. The authors used the page rank method to find the core nodes, and then the cores were employed to initiate the cluster centers in the spectral clustering algorithm. Zhang et al. [37] proposed an extension of the spectral clustering algorithm to support the detection of overlapping communities, which included spectral mapping and Fuzzy c-means clustering [38]. Newman and Girvan [39] proposed a hierarchical clustering algorithm to solve the CD issue. It was based on a measurement called ''betweenness centrality'' [40]. It started by calculating the betweenness of all edges. Then, the edges with high betweenness were removed, and the betweenness for remaining edges was recalculated. Brandes et al [41] proposed a method for grouping the nodes of a CN into communities based on the modularity index. The method was aimed to find the best clustering with maximum modularity. A Random-Walk algorithm was built based on a random walk process, which was a process of navigating node at random [42]. Rosvall and Bergstrom [43] proposed an algorithm for CD based on flow of information between nodes, called an Infomap algorithm. They used maps to represent the flow of information between nodes. A group of nodes between which information flows were grouped into a single cluster. A random walk was used as a proxy for information flow. VOLUME 9, 2021 Stochastic Block Model (SBM) is a generative model [44] to solve the CD problem. The model uses a node membership probability function to find hidden communities in the network. Karrer and Newman [45] improved the performance of the SBM model by proposing a new model based on the corrected degree. They overcame the limitations of the original SBM that used the basic node degrees (i.e., adjacent degrees) in graph clustering, which were not uniformly distributed. The study introduced parameter degrees for all nodes that could scale with edge probabilities and expected a new appropriate degree. Non-Negative Matrix Factorization (NMF) is another Conventional ML method developed for CD [46]. It mapped the topology information of a CN to a latent low-dimensional space. The new space was soft membership vectors that assign each node to a particular cluster. Shi et al. [47] developed a new NMF-based pairwise constrained method to improve the performance of CD. Wang et al. [48] also adapted the NMF algorithm to detect overlapping and non-overlapping communities in CNs.

B. PARALLEL CONVENTIONAL COMMUNITY DETECTION METHODS
Recently, many researchers have dedicated their efforts in developing parallel conventional ML methods to find communities in CNs. They use HPC that is available in either CPU or GPU devices to design parallel computation models. The research works are reviewed in the following sections.

1) CPU BASED PARALLEL COMPUTATION
Prat-Pérez et al [49] proposed a parallel algorithm based on optimizing weighted community clustering for CD in large graphs. The algorithm first used the clustering coefficient as an evidence to find initial communities, then refined these communities by moving nodes between them. The calculation of the clustering coefficient of vertices and the refinement functions were performed in parallel. He et al. [50] proposed a parallel CD algorithm based on the distance dynamics model [51], which made the interaction scope of each node is only affected by its neighbors. The algorithm divided the large network into sub-networks by a divide-and-conquer strategy. Fazlali et al. [52] proposed a parallel method to tackle Louvain CD algorithm. The method introduced a thread level parallelism for the calculation of adding qualified neighbor nodes to the community in a parallel way. Parallel processing with a threaded binary trees data structure method was proposed for CD in [53]. The method was performed over weighted networks in irregular topologies. The aforesaid CD methods were evaluated in multi-processors platforms.
Distributed parallel CD algorithm was proposed by Moon et al. in [54]. The authors developed two parallel versions of Newman and Girvan's algorithm [39] to handle CD in large CNs. MapReduce and the vertex-centric models were adopted for performing parallelization operations of the Newman and Girvan's algorithm. In addition, the algorithms were implemented together with GraphChi and Hadoop. Similarly, a parallel method based on a partitioning algorithm was proposed to solve the CD problem [55]. It used a subtree-split strategy to divide the network into sub-networks. The performance of above algorithms was evaluated on a cluster of CPUs platform and applied to different large networks. The experimental results showed that they could give a good performance in terms of time computation.

2) GPU BASED PARALLEL COMPUTATION
Some researchers have also suggested that the use of GPU devices could improve the efficiency of CD methods. Soman and Narang [56] presented a new parallel algorithm for CD with GPU device. It adopted a weighted-label propagation algorithm. Li [57] proposed a parallel version of Newman algorithm based on parallel computing for CD. The algorithm utilized the architecture of GPU architecture. In the relevant issue, Al-Ayyoub et al. [58] proposed a parallel algorithm for CD in social networks. The algorithm used dynamic parallelism based on GPU device. It provided three parallel implementations of Zhang et al.'s algorithm [37]: parallel CPU, hybrid CPU-GPU, and dynamic parallelism-based GPU. Mohammadi et al. [59] adapted the Louvain algorithm for CD in large CNs to perform in parallel with GPU devices. They used thread-level parallelism and shared memory with threads in the GPU block. They also proposed parallel implementations with hybrid CPU-GPU. In [60], Souravlas et al. proposed a parallel CD algorithm with hybrid CPU-GPU devices. The algorithm transformed the network nodes into a set of threaded binary trees. It first made the CPU to take samples of CN communities and represented them in the form of threaded-binary-trees, then the GPU read the large load data and sent it into a path matrix, finally the matrix was sent back to the CPU to analyze it and find communities. The results of the above algorithms showed that they achieved a good speed-up gain compared to non-parallel CD algorithms.
In short, according to the review in Section IV (A and B), we observed that great efforts have been devoted to solve the CD issue in CNs by applying conventional ML methods with both sequential and parallel implementations. However, these conventional methods could achieve local optimal solutions, which lead to a decrease in the quality of CD solutions. In addition, these approaches are highly dependent on the properties of the data. For example, spectral mapping methods adopt eigenvectors for CD, but they do not perform well on sparse networks. These methods also struggle in the face of today's complex data and increasing scaling of CNs and dimensionality of data [13]. Therefore, researchers were motivated by the success of DL in various fields and shifted their attention to developing powerful techniques to obtain effective and efficient performance for CD with feasible computational speed. [6], [9].

V. DEEP LEARNING MODELS-BASED COMMUNITY DETECTION
DL has become one of the most proactive research areas in artificial intelligence and machine learning. It has achieved remarkable success in many domains and applications, such as speech recognition, image analysis and text comprehension [15], communications and networking [61], [62], etc., and CD is no exception. In this section, we review studies that address CD using DL models. The existing DL models in the field of CD are classified as follows.

A. STACKED AUTOENCODER-BASED COMMUNITY DETECTION
Tian et al. [17] proposed a CD method using the DL model, which is one of the earliest studies in this field. They designed a stacked autoencoder model for learning a nonlinear embedding of the graph. The model was developed by exploiting the similarity between the spectral clustering algorithm and the autoencoder model. The model took similarity, D was diagonal matrix and n nodes. The method then created a latent representation z with low-dimensional space d, z ∈ R n×d The model was trained with the loss function that is calculated with Eq. (4). with respect to all samples m (i.e., nodes. It also consisted of five layers and adapted Kullback-Leibler (KL) divergence for sparsity besides loss reconstruct function. The new loss function was then introduced, as follows.
where x and y denote the original input data and the reconstruction data, respectively, β controls the weights of sparsity, and p is a small constant value, e.g., 0.01 andP = 1 n n j=1 h j , which is the average of activation of units in the hidden layer (h), and KL(P||P) denotes the KL-divergence and is formulated as follows: Backpropagation algorithm and greedy layer-wise training process were used for training the model. Afterward, k-mean clustering algorithm was applied on the extracted latent representation for dividing the graph into groups (or communities). The results demonstrated that the model outperforms spectral clustering algorithm.
In the same direction of research work, Yang et al. [8] developed a stack autoencoder model for CD [63]. They adopted the modularity matrix B = [b ij ] ∈ R n×n as an input of the model. Then, the stack autoencoder model was trained to create a useful hidden representation of the graph in low dimensional space. The model was extended to a semi-supervised by incorporating pairwise constraints. Similarly, a stacked and sparse autoencoder was developed by Wang et al. in [12] for CD. The proposed model encapsulated stack autoencoder as embedding stage and k-means as a clustering stage. It received a normalized adjacent matrix A as input. Fei et al. [64] proposed a method for CD using a deep sparse autoencoder model. In this method, a similarity matrix S was first constructed. Then, a sparse autoencoder was performed to find an effective and low-dimensional feature space of CNs. Continuing in the same direction, a CD method based on a denoising autoencoder was developed by Geng et al. in [65]. A probability transfer matrix T of a CN was first computed. Then, the transfer matrix was nonlinearly mapped to a new space by the denoising autoencoder model. Moreover, autoencoder and Convolutional Neural Network (CNN) models were combined for CD in [66]. The method extracted the spatial localization features by autoencoder and CNN models. The above methods used k-mean clustering algorithm to group nodes into communities. A set of experiments were conducted to evaluate the performance of the aforesaid methods with small-sized real CNs, and the results showed that the methods were promising for finding communities in CNs.
The autoencoder based on the integration of the network topology and content was suggested to solve CD problem. Cao et al. [67] proposed autoencoder that can utilize network topology and network content for such problem. Modularity B and Laplacian L matrices were used as input to the model. Graph regularization was also added to the proposed autoencoder to achieve robust integration of network content and topology, even though there was a mismatch in between them. In [68], the authors proposed an autoencoder model for CD based on the combination of topology and node contents. Both Markove M matrix and modularity B matrix were used as input data to the autoencoder. In the same direction of research work, Cao et al. [69] proposed a deep autoencoder model for CD by integrating network structures and node contents. The autoencoder obtained low dimensional space of the network representation. Modularity B and Markove M matrices were also used as input data to the proposed model. The experimental results showed that the methods gave a more robust performance than some popular CD methods.
Some researchers have also suggested other variants of the autoencoder for discovering communities in CNs. Xie et al. [70] proposed a transitive autoencoder model for CD. They calculated a new similarity matrix S based on eigenvectors and eigenvalues, and used them as input to the transitive autoencoder. The model was trained to obtain a low dimensional representation of a CN, then the communities were discovered with the k-mean clustering algorithm. More recently, Xu et al. [71] proposed a new method for CD using DL techniques. The method used four similarity matrices of CNs based on modularity, diagonal, transition operations, called B, D, T, M . All these matrices were fed to the model as source and target. Then, a stack autoencoder and transfer learning were combined to find an efficient low-dimensional feature space. Finally, several clustering algorithms were integrated to effectively find communities of CNs. The algorithms were evaluated on medium to large CNs and outperformed traditional methods in this area. Table 2 lists research publications relevant to autoencoders -based CD methods, as well as their merits and demerits and related properties.

B. GRAPH NEURAL NETWORK-BASED COMMUNITY DETECTION
Graph Neural Networks (GNNs) are a technical merging of graph mining and DL. The rapid development of GNNs in recent times is proof of their ability to model and capture the complex relationships between members of CNs. Chen et al. [19] developed a GNN model for supervised CD to perform as a node-wise classification. The model used an adjacency matrix A to exploit the information of edge adjacency by including the non-backtracking operator of the graph. Graph Convolutional Network (GCN) is a popular type of GNNs models for integrating graphs and supervised DL, which is developed following the success of CNN in other domains [72]. Jin et al. [18] proposed a method based on autoencoder and GCN for CD. The method incorporated network topologies and network contents. It consisted of encoder model, CD model, which is based on multinomial logistic regression, and structure reconstruction model. Similarly, Wang et al. [73] proposed a method based on marginalized autoencoder and GCN models for CD. GCN was mainly used for graph classification, the authors adapted GCN to a purely unsupervised clustering task by applying a stack autoencoder model. The approach integrated the content and the structure of the network and supported clustering tasks.
Recently, the GNN model was adapted to data-driven spectral analysis as well as CD in [74]. The method used a SBM and generic inference algorithms employing supervised learning for the data-driven process. In [75], a new GNN encoding method was proposed for the problem of CD in CNs. The method used a multi-objective evolutionary algorithm to solve this problem. In GNN encoding step, the edge in an attribute network was associated with a continuous variable, then non-linear transformation was used to transfer a continuous value into a discrete-values, which indicated communities of a CN. Sun et al. [76] also proposed a CD method based on GCN-autoencoder for clustering nodes. The method extracted the network embedding through the GCN autoencoder model. It also used the community structure to maximize the modularity index and grouped the network members with a clustering algorithm. Similarly, Park et al. [77] proposed a symmetric GCN autoencoder for learning the representation of a CN in unsupervised manner. The extracted new representation was utilized for node clustering.
The performance of the aforementioned methods [69]- [72] was evaluated on small-sized CNs, and they achieved good results in terms of accuracy and effectiveness. However, these methods encounter efficiency issues and are applicable to deal with small CNs only. Table 3 lists research publications relevant to GNN and GCN -based CD methods, as well as their advantages and related properties. properties.

C. OTHER TYPES OF DEEP LEARNING-BASED COMMUNITY DETECTION
Other variants of DL models were developed to solve the problem of CD in CNs. For example, Xie et al. [6] proposed a deep sparse filtering method for this task. In the method, a new network similarity representation S∅ was proposed to denote direct and non-direct relationships between nodes. Sparse filtering model was adapted to extract meaningful features of network and represent them in low-dimensional space. The k-mean clustering algorithm was used to divide the network into communities. The authors also incorporated new constrained similarities of pairwise nodes to the loss function of sparse filtering. In the same direction of research, Ye et al. [78] integrated the concept of non-negative matrix factorization into the layer structure of the autoencoder model. The encoder and decoder layers were incorporated in non-negative matrix factorization to form a unified lose function. The method used adjacency A and non-negative factor matrices U as input data. A convolution neural network CNN model was adapted for CD by Sperli [20]. The CNN model used adjacent matrix A as an input data representation. The convolution and max pooling layers mapped the similarity graph into the latent representation in a low dimensional space. A full connected neural network was performed on the new representation to assign nodes to communities. The method was trained with a supervised learning strategy to do a classification task. The mentioned methods were tested on a set of small-sized CNs, and the results showed that the they are promising to solve CD compared to conventional method.
In the relevant problem, Zhou et al. [79] proposed an approach that used an autoencoder to address CD based on local and global structure of graphs. In this approach, the graph was first divided into subgraphs and sent to the autoencoder model to extract graph features and represent them in a low-dimensional space. Finally, it executed Student t-distribution to refine initial K subgraphs clustering. Recently, in [80], the authors developed a new learning method for CD in CNs. They divided a given CN into several parts and chunks in order to bring them into the autoencoder model with low trainable parameters. In addition, the proposed method adopted a reduction and sharing of trainable parameters to improve the efficiency of deep autoencoder model for such a task. The method used S∅ similarity representation as input data representation. Also, a parallel design was designed in the method. The above methods were evaluated with medium to CNs, and the results showed that they outperformed existing methods in the field. Table 4 lists research publications relevant to other variants of DL models for CD, including those dealing with large CNs, as well as their advantages and related properties are presented.
In a nutshell, according to the recent studies, we witnessed acceptable attention has been devoted to CD in CNs using DL models, especially with an unsupervised approach via applying autoencoder models. However, most of the existing methods have a challenge in terms of effectiveness, efficiency, and scalability. In terms of the efficiency and scalability, three issues are still open for improving better solutions: most of current methods confront low efficiency, where they do not operate in a setup with parallel CPU-GPU computing; the methods treat an entire CN as a single object; and they are usually evaluated on small CNs, except for a few recent methods. Therefore, this paper reviews the recent and relevant studies on DL with parallel approaches in Section VI to show how parallel approaches play an important role in developing efficient large-scale models. The aim of Section VI to highlight valuable solutions to overcome the challenges in developing efficient and large-scale DL-based CD methods. As for effectiveness limitations, all existing DL-based-CD methods use the gradient descent strategy with the backpropagation algorithm for the training step, which leads to a fall into local optima (i.e., low effectiveness and poor CD quality) and slow convergence. Hence, an overview of combining MH algorithms with DL models is presented in Section VII to prove the positive impact of these algorithms in developing global and effective solutions for DL models and to motivate the research community to integrate MH algorithms into the DL and CD domain.

VI. PARALLEL DEEP LEARNING
In recent years, it is witnessed the emergence of big data, which has increased significantly, due to the rapid increase in most of the real-world data. As a result, the complexity of DNNs increases as the computational intensity and memory requirements of DL models increase significantly. The evolution of HPC devices helps to efficiently design large-scale DL models by utilizing parallel computing. Parallel computing plays an important role in designing efficient DL models.
In this section, we list the remarkable and recent studies related to DL with parallel computing in regular domain (e.g., images and speech) and irregular domains (i.e., graphs).

A. PARALLEL DEEP LEARNING ON REGULAR DOMAINS
In regular domains, the underlying data representation often has a low-dimensional space, a regular and clear grid structure, which is friendly to hardware accelerators such as the utilization of GPU [28]. Dean et al. [30] proposed a powerful framework, called Distbelif. In the framework, large-scale clusters of machines were used to distribute and parallel the training of Deep Neural Networks (DNNs). Two parallel architectures have been proposed: the first was implemented on a single machine using multithreading and the other was on multi-machines. In addition, three parallelism schemas were performed with this framework: model parallelism, data parallelism and pipelining parallelism. Two optimization methods were adopted in Distbelif; the first was named Downpour Stochastic gradient descent (SGD), which used multiple replicas of a single model (i.e., a dataset was divided into several partitions and a copy of the model was executed on each partition), while the second optimization called Sandblaster, which used distributed parameter storage and manipulation. Similarly, Ma et al. [82] proposed an autoencoder model for detecting outliers in large data. They divided the training data randomly into several parts and learned a representation by training a replicator autoencoder model using parallel-SGD for each part. Then the outputs of each replicator autoencoder were aggregated to detect outliers. The performance evaluation results showed that the above frameworks accelerated the training of models with large datasets.
Other studies also used GPUs to develop an efficient and scalable DL models. Chen and Huo [83] proposed a distributed Recurrent Neural Networks (RNNs) and DNNs for speech recognition applications based on a cluster of GPU devices. They leveraged data parallelism and intra-block parallel. Makkie et al. [84] presented a scalable and fast distributed deep Convolutional Autoencoder (CAE) for functional magnetic resonance imaging (MRI) with big data analysis. CAE consisted of convolutional layers and max pooling operations. Data parallelism was leveraged to improve the efficiency of the proposed approach by distributing a copy of the entire model to all executer nodes, and each executer addresses sub-mini-batches via asynchronous SGD. In addition, Apache Spark and Tensorflow were utilized. Harlap et al. [85] suggested hybrid models in parallelism. They presented a parallel framework for DNNs, named PipeDream. It used data parallelism, model parallelism, and pipeline parallelism based on GPUs. A group of layers was divided into several chunks, and each chunk was assigned to a single GPU. Forward and backward tasks were processed in the chunk. PipeDream produced data parallelism through processing different mini-batches simultaneously; while, model parallelism and pipelining were conducted via overlapping forward and backward tasks. Experimental results demonstrated that mentioned methods were very efficient and obtained high speedup values.
More recently, Sriram et al. [86] proposed a DL model for multi-coil MRI reconstruction, called GrappaNet. The method could generate high quality reconstructions even at high speed-up factors. This was achieved by integrating parallel imaging methods into the DL model. In [87], Park et al. proposed a distributed DL training for CNN model. In the method, two parallelism schemas were utilized: data parallelism and model parallelism. In the same direction, Kim et al. [88] proposed a distributed DL method based on heterogenous systems. The schema was built by using multiple heterogenous GPUs that worked together. It also used data parallelism schema that worked via asynchronous large mini-batch training mechanism. Four types of GPUs were utilized for the evaluation process. The methods reaped a good speedup gains compared to baseline methods.

B. PARALLEL DEEP LEARNING ON IRREGULAR DOMAINS
As for irregular domain, the underlying data representation often has a high-dimensional space, an irregular and unclear grid structure, which makes it difficult to perform parallel computations [28]. Recently, large-scale DL models were developed in several studies from irregular domains with a supervised learning strategy (i.e., classification). In [89], Zeng et al. proposed a scalable GCNs model based on the combination of sampling and parallelization approaches. In the method, the graph was divided into sub-graphs and submitted each subgraph independently to the GCN model to perform in parallel. The sampling method utilized the frontier algorithm for the sampling approach [90]. In the training step was performed in parallel for each subgraph. In the same way, Chiang et al. [91] extended the variants of GCNs and designed a new variant of GCN that divided of the graphs into subgraphs. The partition task was performed as a preprocessed step by a clustering algorithm. Each a sub-graph was addressed either as a single batch or multi sub-graphs in a single batch. The experimental results showed that the methods achieved a good improvement in terms of memory and computational efficiency.
Ma et al. [28] proposed a new framework to support parallel neural networks computations to graphs. The authors investigated the capability of data partitioning, parallelism, and scheduling of the graph in neural networks in order to move beyond low dimensional regular grids. It also minimized the communication between GPU and host (CPU) and maximized the overlap computation. In the same direction, Liu et al. [92] presented a framework for GNN based on the GPU device. The framework aimed to improve the efficiency of GNN training by using parallel graph processing. In [93], Zeng et al. proposed a new parallel framework for GNN models. The proposed framework used a parallel sampling graph approach during the training process. Then, the main computation steps were parallelized on a shared memory system. Data partitioning was utilized in the framework to reduce memory traffic and improve cash utilization. Experimental results demonstrated that the methods achieve linear speedup values.
As for unsupervised learning, relevant works along this direction are still few. However, some authors have suggested deep parallel computing to develop large scale-DL models. For example, Bhatia and Rani [31] proposed a method for discovering overlapping clustering based on autoencoder model with distributed parallel computing. Hadoop platform involved 8 machines was used. An iterative-bulk synchronous parallel-based graph processing framework Giraph was utilized. The authors extended their work in [94], and applied the same idea, just added degree metrics to compute the importance of nodes during cluttering process. The authors tested their methods on large networks, and the results showed that the methods achieved an acceptable efficiency.
In a brief, based on the review in this section, we have found that parallel computing is seen as one of the important solutions for addressing large-scale DL models. In a regular domain, it is observed that a great effort has been devoted to developing large-scale DL models, but in irregular domains, e.g., graph, less attention is paid to build effective and efficient large DL models. However, there has been few works related to DL-based graph applications, especially in relation to the supervised learning techniques, e.g. classification using GCNs. However, GCN does not fit into the clustering task because the embedding driven from GCN is not oriented for CD [18]. In unsupervised DL-based graph applications, there are very few studies that use parallel approaches, for example, [31], [94] proposed parallel DL overlapping clustering. However, there are several limitations in these methods, such as: massive synchronization between nodes, since bulk synchronous parallel was used; the methods required extensive trainable parameters; they also did not use GPU parallelism, model parallelism and also data parallelism. Moreover, these DL models were optimized based on random walk and PageRank techniques, which focused on the local structure of the network and were considered conventional ML techniques. So, DL-based graph applications VOLUME 9, 2021 (e.g., DL-based CD in CNs) needs further investigation to design efficient and scalable DL models. Table 5 provides summarization of recent and important studies in the area of DL and parallel computing of regular and irregular domains.

VII. META-HEURISTIC-BASED OPTIMIZATION METHODS
MH is a concept of a set of algorithms including evolutionary algorithms such as Genetic Algorithm (GA) [24], and natural inspired algorithms such as Particle Swarm Optimization (PSO) [23]. MH algorithms have been successfully applied for optimizing machine learning models and are known as efficient methods to solve complex problems and find optimal solutions in a suitable lapse of time. They are now state of the art methods for solving various optimization problems, especially for those that are very complex and high-dimensional spaces, and the detection of communities using DL in CNs is no exception [25]. In this section, first the popular MH algorithms are presented and then conventional studies-based MH algorithms that have developed for CD (i.e., without deep learning models), and finally the methods that have used for training deep neural network models.

A. COMMON META-HEURISTIC ALGORITHMS
GA was developed in the early 1970s [95]. It is the most popular and most commonly-used evolutionary method. GA uses of a binary data representation. It can also support other types of data representations. GA comprises four key parts: chromosomes (i.e., means solutions), crossover, selection, and mutation, and fitness procedures. All these procedures work to get optimal or near optimal solutions. In the crossover process, the offspring are inherited and formed from two parents. In the mutation phase, one or more chromosomes are randomly selected and altered to increase the diversity of the population [96]. The algorithm was applied to optimize solutions to problems in various domains and applications, e.g., CD in CNs [97], classification in images [98], and DL applications [99].
PSO algorithm is a population-based optimization algorithm [23], [100]. It uses a stochastic optimization technique and emulates the behavior of bird swarms to find the solution to the optimization problem. Many particles are randomly created over the search space to form a swarm. Each particle is generated in such a way that it represents a solution candidate for an optimization problem. The particle moves in the search space with a specific velocity. The particles can go back to the best prior position according to their memory, which maintains the previous best position. PSO has been successfully used to provide solutions to optimization problems in various fields, e.g., CD in CNs [101], [102], images [103], wireless sensor networks [104], etc.
Ant Colony Optimization (ACO) is a population-based MH that can be used to find nearly optimal solutions for difficult optimization problems. It was developed in 1992 by Dorigo [105] based on the behavior of ants. ACO was inspired by the behavior of ants in their search for food to find the optimal path in the search space to solve an optimization problem. Each ant first explores the search space near its nest to search for food at random. When the ant returns to the nest, the path is traced by chemical pheromones. This helps other ants to find the shortest path to the food source through chemical pheromones left on the ground. ACO has been utilized to a variety of optimization problems in various domains and applications, e.g., CN applications [106], DL applications [107], internet-of-things applications [108], etc.
Differential Evolution (DE) algorithm was introduced in 1996 by Storn and Price [109]. It is a method that optimizes a problem by trying iteratively to improve a candidate solution with respect to a given evaluation function. The optimization process of the algorithm starts with the random initiation of a solution candidate. Then new individuals are created by crossing and mutation in the evolutionary process for each generation. The target individual and the mutated individual are recombined to create the trial individual containing a useful solution from the prior generation. Some DL and image applications have been successfully addressed using the DE algorithm [110], [111].
The Artificial Bee Colony (ABC) is a swarm based on MH algorithm introduced in 2005 [112]. The algorithm was developed inspired by the natural performance of the bee colony. The bee colony method can be effectively used for the design of intelligent system models as it includes several features such as foraging, communication, task selection, group decisions, etc. The model has three main components: employed and un-employed forager bees, and food sources. Employed and un-employed forager bees look for plentiful food sources near their hive. The model also defines two guiding behaviors required for self-organization and group intelligence: the recruitment of feed gatherers for rich sources of food, leading to positive feed-back, and the abandonment of poor sources by foragers, leading to negative feed-back. The ABC algorithm was used for optimization in various fields, such as image and DL applications [115], [116].
The Firefly Algorithm (FA) is a new MH algorithm based on the flashing patterns and behavior of fireflies. It was developed by Yang in 2007 [117]. The brightness of FA is the objective function. The attractiveness in FA depends on the brightness, so the less bright firefly will move towards the brighter one. If it is not brighter, the movement will be random. FA has been used for CN applications [118], image and DL applications [119].
Cuckoo Search (CS) is a new MH optimization algorithm introduced in 2009 [113]. The algorithm uses the breeding parasitism of some cuckoo species together with Levy flights that run randomly to solve optimization problems. CS has been used for optimization in various domains and applications, e.g., CD in social networks [120], classification and prediction in images [121], [154]. Bat Algorithm (BA) is a recent MH algorithm for global optimization. It was developed by Yang in 2010 [114]. The algorithm is inspired by the echolocation behavior of micro-bats, which involves the use of varying pulse rates of emission and loudness. Echolocation is used to determine distance, and all bats know the difference between prey and background barriers. BA was applied to solve optimization problems in various fields, e.g., CD in CNs [122] and DL applications [123].
Readers can refer to a comprehensive study that is presented in [124] about the state of art of MH algorithms. The important advantages and disadvantages of these MH algorithms are summarized in Table 6.

B. COMMUNITY DETECTION BASED ON META-HEURISTIC ALGORITHMS
In today's world, conventional ML studies-based MH algorithms have received much attention to address the CD problem, and the literature has expanded with diverse applications. Important and current methods of CD using MH algorithms are presented in this section.
In [97], Said et al. proposed an algorithm for CD-based GA in CNs. The crossover and mutation operations were iterated until the stopping conditions are met, and selected the best solutions. The algorithm used a clustering coefficient-based GA to generate the initial population and mutation method, and these improved its efficiency and accuracy. It used the modularity function as a fitness function. Likewise, Zaire and Meybodi [125] suggested a method to find communities in CNs based on the integration of GA and object migrating automata. The method sought to maximize modularity function. In [126], Rostami et al. developed a new CD based on GA for feature selection. The method consisted of three steps: the features were computed and selected, then these features were clustered; finally select the best clustering by GA.
The PSO algorithm is also applied to solve the community discovery problem. In [127], Cai et al. developed a PSO algorithm to find communities in social networks. The PSO particles were adapted to a discrete scenario. The update rules used by the method depend on the greedy strategy and the network topology. It found communities in large social networks. A discrete PSO-based multi-objective optimization algorithm was developed by Gong et al. [101] to discover communities in CNs. The method attempted to minimize the two objective functions: Kernel k-means and Ratio cut functions. It dealt with signed and unsigned networks. In [128], Liu et al. presented a multi-objective PSO algorithm for extracting network embedding and finding communities. The method mapped the nodes in a low-dimensional space. Thus, this led to an increase in search efficiency as the search space was reduced. Another variant of PSO algorithm was developed for CD in [129], called Sigmoid Fish Swarm Optimization (SFSO). The algorithm used the sigmoid function for different fish movements in a swarm. It outperformed the traditional PSO algorithm in this task.
The Firefly algorithm (FA) has been adapted for CD in CNs. Amiri et al. [118] developed a CD method based on FA optimization. The method used multi-objective optimizations so that a set of solutions with Pareto optimum can be obtained. The parameters were tuned based on self-adaptive probabilistic mutation and a chaotic mechanism. Similarly, Del Ser et al. [130] proposed a CD method based on FA algorithm. It emulated the behavioral patterns of fireflies that attract each other while flying, which was used for numerical optimization. In [131], Jaradat and Hamad also used an FA optimization algorithm to solve the CD problem. They also proposed a parallel approach to perform this task. The above FA-based CD methods have been evaluated in real and synthetic CNs, and their results were promising for this task.
BA is another MH algorithm that has been used to develop and improve CD methods. In [132], Hassan et al. proposed a new CD method based on discrete BA optimization. The method used the modularity function as a fitness function to maximize. It did not require prior information about the number of communities, which can be inferred based on the locus-based adjacency coding scheme. Likewise, Song et al. [133] solved the CD problem with a discrete BA optimization. The method found the global optimal solution, and it automatically determined the number of communities. Doush et al. [122] also adapted multi-objective BA optimization to address the CD problem. The concept of Pareto dominance was used to select the optimal solution in multi-objective optimization. The results of the above methods showed that BA optimization is able to find the communities in CNs, and showed a better performance of the CD.
ACO algorithm has also attracted the attention of researchers to address the CD problem. In [134], Chen et al. proposed an algorithm based on ACO to solve this problem. The algorithm used artificial ants traveling on a logical digraph to generate CD solutions. Each ant chose a path according to the heuristic information about each path. The degree of association was used as heuristic information. Similarly, Guo et al. [135] adapted ACO to find communities in CNs. The algorithm included initialization, and three various of searching; employed bee, onlooker, and scout bee. The CD solutions were first initialized and then tuned and addressed by the three search phases. In the same direction, the multi-objective ACO algorithm has been adapted to discover communities in CNs in [106]. Two objective functions were used: community score and community fitness, which measured the density of groups and minimized the external relations, respectively. During the process of the algorithm, a Pareto was considered to store non-dominated solutions. The results showed that these above algorithms successfully detect community structures and competitive CD performance was achieved.
CS algorithm has also been successfully used to solve the CD challenge. Zhou et al. [136] proposed a discrete multi-objective CS algorithm for CD. Two objective functions were minimized, namely negative ratio association and ratio cut. It found high-quality communities without prior information. On the same topic, Babers and Hassanien [120] proposed a CS algorithm for CD in social networks. The method used the modularity function as an objective function. The locus-based adjacency scheme was used to represent individual solutions in the network and community structure. The results of the mentioned algorithms show that the CS algorithm is promising to solve the CD problem.
In a nutshell, remarkable efforts have been made to develop conventional ML-based MH algorithms for CD. These algorithms have global search capabilities and good local learning, and can deal with a wide range of CD problems. Moreover, they can automatically determine the number of communities and they can be implemented in parallel and efficiently. However, they have shortcomings of achieving low accuracy and efficiency, especially when dealing with high dimensional and complex data, such as large CNs [14]. Therefore, the development of effective and efficient algorithms is of greater interest and worth considering, especially when dealing with large and complex data. Therefore, in the next subsections, the studies on the integration of MH algorithms and DL models in different domains are reviewed to offer interested researchers the opportunity to bridge the existing gap in the CD field.

C. META-HEURISTIC ALGORITHMS INCORPORATING WITH DEEP LEARNING
Recently, several studies have proposed MH algorithms for optimizing parameters and hyperparameters of DL models to create effective and efficient models. The success of the MH algorithms in optimization tasks has prompted the research community to solve large and difficult problems. In the field of DNNs, the Gradient Descent (GD) is replaced by iterative MH algorithms for tuning parameters, e.g., weights. The main purpose of using the methods is to overcome limitations of gradient descent methods (i.e., Backpropagation (BP)). Three shortcomings exist in GD-based BP algorithm: training process is slow, especial with big data, sensitive to parameter initiation and has fallen to local minima, which leads to a drop in performance [21], [22]. In this subsection, we outline the recent and important studies that used MH algorithms for training and optimization of DL models.

1) DEEP LEARNING BASED ON META-HEURISTIC ALGORITHMS
An earlier study adapting GA for training Artificial Neural Networks (ANNs) was proposed by Leung et al. in [137]. The authors developed an improved version of GA to train neural networks. All functions of GA were redefined, including crossover, mutation operations and fitness function. The authors showed that the parameters of the neural network can be efficiently tuned using the improved GA algorithm. In the same research direction, recently, GA was adapted to optimize the structure and hyperparameters of deep CNNs in [138]. The proposed method was evaluated for the amyloid brain images dataset for disease diagnosis. Pan et al. [98] also adapted GA to optimize deep CNNs for classification in multi-unmanned aerial vehicles. GA obtained the scenario states and path segments to train CNNs. Then, the CNN reproduced the path planning resulting from GA's experience. Similarly, David and Greental [99] incorporated GA in a deep autoencoder. The GA was used to optimize parameters (i.e., weights) of autoencoder. The proposed method overcame the problem of tied weights of encoder and decoder models to enhance the accuracy of classifications. The experimental results indicated that the GA algorithm was promising to optimize DL models.
PSO algorithm was adapted for optimizing DL models. Gudise and Venayagamoorthy [139] developed a PSO-based population algorithm to train ANNs [140]. The fitness value of each particle is the value of the cost function, which was evaluated by the current position and corresponded to the weight matrix. The authors presented a comparative study of training neural network based on PSO and BP algorithms. Recently, Rajagopal et al. [103] proposed a deep CNNs for scene classification in unmanned aerial vehicles. The authors developed the optimization of the model depending on a multi-objective PSO algorithm. The method allowed the vehicle to acquire videos, then the pre-processing step was applied to the videos. Then the training step was performed with the CNN-based multi-objective PSO model. Finally, the images were classified. In [141], Band et al. proposed a new DNN approach for modelling gully erosion susceptibility. The approach used PSO optimization in the training step. The experimental results showed that PSO was effective and efficient for training DL models.
Mavrovouniotis and Yang [107] incorporated ACO algorithm in feedforward neural network models. They proposed two optimization algorithms: firstly, the model was trained with a stand-alone ACO optimization algorithm, secondly GD-based BP and ACO were combined for an optimization process to train the model. Both optimization algorithms were evaluated on several datasets for pattern classification. The result showed that the ACO algorithm could achieve efficient performance especially when integrating ACO and GD. Recently, Zhang et al. [142] proposed DNN-based ACO optimization to forecast the cost of mining projects. BA was also adapted to optimize ANNs by Jaddi et al. [123]. It optimized the structure of and the parameters of ANNs models. VOLUME 9, 2021 The algorithm improved the effectiveness and reduce the complexity of such models. The authors also proposed two modifications of the bat algorithm (namely MBatDNN and MeanBatDNN) to improve optimization process. The results demonstrated that the proposed methods outperformed traditional optimization algorithms.
Other MH algorithms were also adapted to train ANNs. For example, Karaboga et al. [115] suggested to train ANNs based on ABC algorithm. It was the first study that trained ABC based on neural networks. Recently, In [116], ABC was combined with DL model to detect the pattern of COVID-19 patients. In this method, the deep CNN model was used to extract the features of the X-ray images, then the ABC algorithm refined the features to select the best ones. CS algorithm [143] was also adapted to optimize ANNs structure and parameters to improve the accuracy and the efficiency of the model. In [144], Cristin et al. proposed a DL method for plant disease determination and prediction. The method used a Deep Belief Network (DBN) model for this task, and the model was optimized by integrating Reiter optimization algorithm and CS algorithm. Another algorithm was proposed by Zhou et al. [145] to optimize DNNs, called water wave optimization (WWO). The method was used to solve the high-dimensional optimization problem. Similarly, FA algorithm was proposed by Strumberger et al. [119] to select optimal CNNs stricture. The hyperparameters were tuned by FA, including number of layers, kernels, kernel size. In the same research direction, Rere et al. [146] used Simulated Annealing (SA), which was a single solution-based algorithm [147] for optimizing CNNs models. The experimental results showed that the above-mentioned methods achieved good performance and were more effective than other optimization techniques in regular domains (e.g., Images-classification). Table 7 lists research publications relevant of DL-based standalone MH optimization algorithms.

2) DEEP LEARNING BASED ON HYBRID META-HEURISTICS AND GRADIENT DESCENT ALGORITHMS
Several studies have suggested an optimization process by integrating MH algorithms with GD-based algorithm. Bakhshi et al. [121] proposed GA to explore a suitable CNN architecture tune hyperparameters such as learning rates, number of layers. In the method, the hyperparameters and parameters (i.e., weights) were optimized by GA and BP algorithms, respectively. Lander and Shang [110] proposed a new evolutionary framework (called EvoAE) to optimize autoencoders. EvoAE evolved a population of autoencoders and searched for structures and features at the same time. It learned different features from large datasets and reduced training time. The EvoAE generated new autoencoder models from crossover and mutation operations with chromosomes. It also supported training with BP algorithm.
In the same research direction, in [111], Sui et al. incorporated PSO into marginalized stacked denoising autoencoder for tuning parameters. PSO could hold the best configuration of the marginalized autoencoder. Similarly, Silhan et al. [148] suggested an approach to tune the hyperparameters of stack autoencoder model by using evolutionary algorithms. The results showed that the methods could find the best hyperparameters, but the training time of the model increased approximately threefold. Similarly, Zhang et al. [149], integrated between PSO and BP to work together for neural network training. This integration aimed to take advantage of both algorithms, with PSO performing a global search at initializing phase of the weights. The BP algorithm then performed a local search around the global optimum. The experimental results showed that the combination of the algorithms was better in terms of efficiency and effectiveness than the application of standalone algorithm, PSO, or BP.
Yi et al. [22] proposed an incorporation method between BP and CS algorithms for the optimization of regression neural network model. CS assisted BP in weigh initialization. In addition, FA [150] was adapted to optimize the parameters (i.e., weights) and thresholds between the input layer and the hidden layer of neural networks. Similarly, Rojas-Delgado et al. [151] also proposed an approach that included three MH algorithms: PSO, FA and CS algorithms. They have added a continuation method to the above-mentioned algorithms, where the problem was solved by moving it progressively from the simple to the actual problem. The approach was evaluated using public benchmark datasets and the experimental results showed that continuation method of the three studied MH algorithms could reduce the execution time in by 5-30% as compared to standard MHs.
More recently, there are new hybrid optimization of gradient descent and MH algorithms. Yadav [152] developed a hybrid optimization of ANNs models for medical diagnostic applications. The method integrated PSO and GA algorithms to BP algorithm with Adam optimizer. The PSO and GA algorithms overcame the shortcoming of the BP algorithm. In [153], the PSO algorithm was also integrated to the BP algorithm to perform hybrid optimization for wind power prediction. A good comparison was made with BP and GA-PB algorithms. The PSO-BP outperformed the others. In [154], the PSO algorithm was combined with the GD-BP algorithm to train a radial basis function (RBF) neural network. The purpose of this combination was to take advantage of the two in the meantime. Similarly, Mohapatra et al. [155] integrated PSO to BP-Adam optimizer to train ANNs for mathematical equivalence of error gradients. Seifi and Soroush [156] proposed an ANN-based MH algorithm for climate prediction. The method used three MH algorithms: GA, Whale Optimization Algorithm (WOA) and Grey Wolf Optimization (GWO) algorithms with ANN models for the estimation process. Tran-Ngoc et al. [157] also developed ANNs for damage detection based on a hybrid MH optimization algorithm. The method combined GA and CS algorithms to quickly find the best solution and avoid local solutions. In addition, the method used a vectorization technique for the objective function to reduce the computational cost. The  above hybrid optimization algorithms were compared to traditional GD-BP optimizers, the theory and algorithmic performance were supported by promising results. Table 8 lists research publications relevant of DL-based hybrid GD and MH optimization algorithms.

3) LARGE SCALE DEEP LEARNING-BASED META-HEURISTIC ALGORITHMS
A few MH algorithms and DL techniques are used for the analysis of big data. Hegde and Mundada [158] presented a method for the analysis of big data based on PSO and DL. The meaningful features were extracted by PSO algorithm, and DL model was applied for the classification task. The aim was to produce high classification accuracy. Cao et al. [159] extended the combination of PSO and BP for building a distributed parallel computing of them and improving the performance and efficiency. The parallel model was designed on MapReduce on Hadoop platform. The proposed model was evaluated on images classification. The result showed that the proposed method received high classification accuracy and efficiency. Recently, Saranya and Nagarajan [160] proposed a large-scale MH optimization for DNNs training. The model was used to deal with large data for predicting agricultural yields using the Hadoop framework in parallel. The model was evaluated on large data from plant images.
Coelho et al [161] proposed DL-based MH algorithm for time series forecasting with GPU device. With this method, a multithreaded GPU-based strategy was developed 96518 VOLUME 9, 2021 to improve the efficiency of the DL-based MH model. The experimental results showed that the proposed method was scalable and efficient and achieved with GPU high speed-up as compared to a sequential implementation with CPU. Table 9 lists recent studies about the application of DL-based MH optimization methods in big data research.
In a short, it is clearly shown that the incorporation of MH algorithms with DL models is attracting increasing attention in the literature. We observe that MH algorithms play an important role in improving the performance and efficiency of DL models. Some of them work as stand-alone optimization algorithms, others are combined with gradient descent optimization (i.e., BP algorithm). Most of the existing algorithms could perform with low accuracy and efficiency, especially when dealing with large and complex data. Therefore, the development of effective and efficient DL-based metaheuristic algorithms is of a greater interest for dealing with large and complex data. To our best knowledge, no study was suggested to apply DL and MH algorithms for performing community detection.

VIII. OPEN DATASETS FOR COMMNITY DETECTION
Open datasets are available to encourage and facilitate research in CD. These datasets can be categorized into two groups: synthetic datasets and real datasets.

A. SYNTHETIC DATASETS
Synthetic dataset is artificially produced rather than collected from real events. In the field of CD, there are two widely used synthetic networks with known community structure, i.e., the Girvan Newman (GN) [162] and Lancichinetti-Fortunato-Radicchi (LFR) [163], [164] synthetic networks. The GN is a small synthetic network consisting of 128 nodes divided into four communities where 32 nodes belong to each community. Each node in the GN network has an average of 16 relations, including relations with nodes in the same community and in different communities. The LFR network is more complicated than the GN network. It has important properties like real world networks, such as scaling of network size and heterogeneity in the distributions of node connections and community sizes. These distributions can be tuned according to a power law with different exponents. The LFR is a popular benchmark synthetic dataset in community detection than others because of these good properties.

B. REAL-WORLD DATASETS
Real-world datasets are more effective and accurate benchmarks for evaluating the performance of algorithms for CD. In this subsection, we introduce real-world datasets from 27 remarkable and common open real networks, which are classified into four categories: social networks, citation networks, collaboration networks, and others. These datasets are listed in Table 10. All ten datasets (numbered from 1 to 10 in Table 10) consist of numerical records of individuals and their relationships in a social network. The eight datasets of Citation networks (numbered from 11 to 18 in Table 10) consist of records for papers or patents and their relationships, such as citations. The four datasets from collaboration networks (numbered from 19 to 23) comprise relationship records about scientists and their collaborations. Another four datasets (numbered from 24 to 27) are the records from other types of community network. The size of these datasets ranges from small to large in each category in terms of the number of nodes and edges.

IX. SUMMARY AND CHALLENGES
The review first presented the important conventional methods proposed to solve the problem of CD in CNs, i.e., without using DL techniques in Section IV. It is shown that great efforts have devoted to the CD issue on CNs by applying early VOLUME 9, 2021 conventional methods as presented in Subsection IV (A). However, most of these methods face challenges in dealing with large network efficiently. Then, efficient methods that adopted parallel computing for handling such task were introduced in Subsection IV (B). However, these conventional methods could not perform CD effectively and efficiently on CNs where the datasets have features in high dimensionality and diversity. In addition, they have high probability to get into local optimal solutions and struggle in the face of today's complex data and increasing scaling of CNs and dimensionality of data [13]. Hence, researchers were motivated by the success of DL applications in several fields and recently moved their attention to investigating effective and efficient solutions using DL models, especially with the use of autoencoder models for CD.
Section V presented the literature proposed for DL-based solutions for CD. After review, we found that an acceptable effort was devoted to develop unsupervised autoencoder models, GNNs, and other variants of DL models for CD. All these works were summarized in Tables 2, 3, and 4, respectively. However, most of the existing methods show shortcomings in terms of low efficiency and scalability, and in addition, they were evaluated on small-scale networks with at most a few thousand nodes. A few studies were proposed for addressing the mentioned problem in the field such as [31], [94]. However, these methods still showed other drawbacks, e.g., they require extensive trainable parameters, use massively synchronization between nodes through applying -bulk synchronous parallel. In addition, DL-based CD methods were not deployed in GPU parallelism and model parallelism. The other method was proposed to tackle such problem by a sampling approach, which was a speed-accuracy trade-off strategy [81]. Moreover, all existing DL-based CD methods depend on GD with BP optimization algorithm. Thus, there are three inherent limitations: the training process is slow, especially when processing large datasets; the method is sensitive to parameter initialization; and the solutions provided by the method could lead to local minima, which reduce the quality of CD.
Section VI presented a simple review of different DL models that use parallel computing methods in both regular, and irregular fields, which were listed in Table 5. According to the review of this Section, it is clearly that the parallel computing is considered a better solution for DL on a large-scale data. In regular areas (i.e., clear and grid structure, low-dimensional space) we have seen great efforts to develop large-scale models for DL based on parallel computing. As in irregular areas (i.e., unclear structure, high-dimensional space such as graphs or CNs), an acceptable effort has been made to use parallel computing to improve the efficiency of DL-based graph applications, especially in supervised learning methods, such as classification with GCNs. On the other hand, only very limited studies were proposed to improve unsupervised DL methods (i.e. stack autoencoders for CN clustering) such as [31], [94], however, there is a gap in deploying these unsupervised DL methods in a parallel computing framework. For example, no GPU parallelization or efficient partitioning technique was used, and parallel computing also suffers from high synchronization overheads.
Section VII introduced the common MH algorithms. Their advantages and disadvantages are summarized in Table 6. Then the section presented the utilization of MH algorithms to solve the problem of CD. According to the review in Subsection VII (B), it is clear that the MH algorithms have global search capabilities and good local learning, and can deal with a wide range of CD problems. Moreover, they can automatically determine the number of communities and they can be implemented in parallel and efficiently. However, they show low capabilities of achieving high accuracy and efficiency, especially when dealing with large and CNs.
Section VII also presented the integration of MH algorithms into DL models for the optimization process. The optimization of DL models could be based on stand-alone MH algorithms, or on a hybrid of gradient descent and MH algorithms, which were summarized in Table 7 and 8 respectively. In addition, some methods have been introduced to integrate parallel computing and MH algorithms with DL models to process big data more efficiently (see Table 9 ). The MH algorithms have demonstrated their ability in improving the performance of DL models in various applications. However, most of the existing methods focus on improving the effectiveness of DL models and ignoring their efficiency in terms of space and time complexities, especially when the models are applied to handle large datasets. Nonetheless, some studies have been developed to improve the efficiency of DL models. MH algorithms have been blended into parallel computations to improve a DL model's efficiency, but the number of such research work across different application domains is still very few. Likewise, in the CD field, the use of MH algorithms for optimizing DL models in CN applications (either oriented on clustering or classification) is still very rare, and DL-based CD is no exception.

X. FUTURE DIRECTIONS AND PERSPECTIVES
Despite the fact that DL-based CD has shown superior performances, there are several problems that are still open and have not been fully solved. In this section, we briefly discuss these future prospects that can help overcome these problems.
One of the main challenges of CD is to efficiently identify communities in large-scale CNs. Many existing DL-based CD methods are inefficient when dealing with large CNs due to prohibitive demands on memory and computation (see Sections V and VI). However, nowadays, large networks exist in many real-world CNs, e.g., Facebook and Twitter networks. Therefore, novel DL-based CD methods shall be developed to efficiently and effectively utilize the rich information of large CNs. One possible direction is to develop DL-based CD methods that can achieve HPC with three features: efficient sub-network training, data and model parallelisms, the use of both CPU and GPU devices. In addition, a wise consideration is to develop methods from a parallelism approach to reduce the synchronization and communication operations and increase the asynchronization while focusing on improving the performance of CD. Memory consumption is an important issue that has not yet been fully addressed. CNs are usually available in a high-dimensional feature space, which leads to the generation of massive number of trainable parameters for refinement as well as requiring a large amount of memory for storage and computation. Thus, it is worthwhile to utilize techniques such as data partitioning and sharing the trainable parameters to reduce the trainable parameters [80].
MHs are promising algorithms used for optimizing the parameters of DL models. However, most of the existing algorithms show shortcomings in terms of low accuracy and efficiency when dealing with data that are collected from large and complex problems. Several possible research directions are suggested to explore for research in the future. In the research direction of improving effectiveness of a machine learning or DL model, a long-term goal is to develop powerful MH algorithms that can be saved from the trap of local optimum and premature convergence. This can be achieved by integrating different MH algorithms together or integrating MH with a GD-BP algorithm. Moreover, considering the continuation approach' ability to solve the optimization problem [151], [181], its ability in optimizing DL models shall be further investigated. Objective functions play an important role in guiding a search process for high quality solutions. Many efforts have been made to formulate a single objective solution, but multi-objective functions are still rarely used. Thus, it is worthwhile to further investigate the development of DL based on MH algorithms with multiple objective functions [98].
MH algorithms are known as a time-consuming process, especially for big data. Thus, we find that most of the DL with MH algorithms are evaluated with small sized problems. Since time is a very crucial factor, this is an important issue since the real world is full of big data, especially in CN applications. In the research direction of improving efficiency, parallel computing with CPUs and GPUs will be an essential solution to overcome this problem. In addition, scalability is another important issue as real-world data is growing rapidly and the dimensions of the data are very huge. Therefore, efficient and scalable MH algorithms are long-term preferable solutions used for training large-scale DL models.
For CNs and their applications, the use of DL models with MH algorithms is rather new and there are no studies that propose to solve CNs problems, e.g., CD, with those methods. Therefore, several possible research directions can be explored in the future, e.g., optimization of model parameters, automatic definition of model structure, reduction of dimensional space to reduce computational cost. In addition, it is worthwhile to investigate the possibility of integrating MH algorithms into DL-based CD models. Furthermore, the MH algorithms could be integrated into GD-BP algorithms to improve the effectiveness and efficiency of such models. Finally, since CNs in real words are often large, it is worth to explore the integration of parallel computing, including CPU, GPUs, or hybrids thereof, to improve the efficiency and scalability of DL-based CD models with the MH algorithms.
Finally, there are general issues related to DL-based CD models in CNs that need further investigation. Most of the existing models deal with complete, static, or homogeneous networks. However, in the real world, many networks are incomplete, dynamic, or heterogeneous. Incomplete networks [182] are those that lack of information about topology or features. Dynamic networks [183] incorporate topology and attributes that shall be updated over time. Heterogeneous networks [184] contain members belonging to different types, e.g., images and texts. These complex structures in networks cannot be solved with the algorithms proposed for complete, static and homogeneous networks. As a result, several issues arise. For example, for incomplete networks, the entire network structure cannot be learned and discovered; for dynamic networks, the network structure shall be re-learned and re-explored, which is computationally costly; the network members of different types require complex algorithms to deal with their features and topology. Unfortunately, it is rarely possible to find accurate and complete networks without noise in the real world. Therefore, it is worthwhile to investigate the performance of DL and MH algorithms in community detection with noisy networks, which are characterized as incomplete, dynamic, and/or heterogeneous networks.

XI. CONCLUSION
In this paper, we present a literature review of CD in CNs from conventional ML to DL methods and point out the gap of applying DL-based CD methods in large CNs. In addition, the relevant studies on DL with parallel and MH approaches are reviewed and their implications on DL models are highlighted. One of our main goals is to present and organize most of the work done so far on DL-based CD methods, as well as on DL with MH and parallel computing approaches, in a unified perspective. This is intended to encourage the research community to bridge the gap between DL-based CD and DL with MH and parallel computing approaches. Specifically, we have reviewed the recent literature on CD methods, including those using early classical ML and MH approaches, as well as more recently DL approaches, and presented their merits and demerits. Despite considerable research efforts have been made to propose solutions for this topic, there are still lack of effective and efficient methods that can fully meet performance requirements, particularly when large CNs are involved. Therefore, this paper presents an overview on the success of integrating DL models with MH algorithms and parallel computing in different domains, explains their pros and cons. How DL with MH and parallel computing methods could contribute to improve the effectiveness and efficiency of DL models in the CD field has also been discussed. Finally, we summarize the work and point out some research directions for bridging the gaps between DL-based CD methods with MH and parallel computing approaches to deal with large CNs effectively and efficiently.