Journals & Magazines >IEEE Access >Volume: 11

Deep Representation Learning: Fundamentals, Technologies, Applications, and Open Challenges

A general timeline for some of the most important sequence models in history, including pretrained ones.

Abstract:

Machine learning algorithms have had a profound impact on the field of computer science over the past few decades. The performance of these algorithms heavily depends on ...Show More

Society Section: IEEE Systems, Man and Cybernetics Society Section

Metadata

Abstract:

Machine learning algorithms have had a profound impact on the field of computer science over the past few decades. The performance of these algorithms heavily depends on the representations derived from the data during the learning process. Successful learning processes aim to produce concise, discrete, meaningful representations that can be effectively applied to various tasks. Recent advancements in deep learning models have proven to be highly effective in capturing high-dimensional, non-linear, and multi-modal characteristics. In this work, we provide a comprehensive overview of the current state-of-the-art in deep representation learning and the principles and developments made in the process of representation learning. Our study encompasses both supervised and unsupervised methods, including popular techniques such as autoencoders, self-supervised methods, and deep neural networks. Furthermore, we explore a wide range of applications, including image recognition and natural language processing. In addition, we discuss recent trends, key issues, and open challenges in the field. This survey endeavors to make a significant contribution to the field of deep representation learning, fostering its understanding and facilitating further advancements.

Society Section: IEEE Systems, Man and Cybernetics Society Section

A general timeline for some of the most important sequence models in history, including pretrained ones.

Published in: IEEE Access ( Volume: 11)

Page(s): 137621 - 137659

Date of Publication: 20 November 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3335196

Contents

SECTION I.

Introduction

In recent years, machine learning [1], [2], [3], [4], [5], [6], [7], [8] has shown promising capabilities in various fields of study and application. Representation learning, as a core component in artificial intelligence is attracting more and more scientists every day. This interest is mirrored in an increasing number of papers, publications, and workshops on representation learning in international conferences and various influential journals.

Representation learning involves the detection, extraction, encoding, and decoding of features from raw data, which can then be used in learning tasks. Its objective is to abstract features that best represent data, and the algorithms developed for this purpose are collectively referred to as representation learning [9]. The performance of deep learning models relies heavily on the methods used to represent data. Consequently, the rapid growth of deep learning has been accompanied by significant advances in representation learning techniques. Deep learning owes its success to architectures composed of multi-layered non-linear modules, each transforming features into higher-level representations.

Learning representation aims to encode (embed) the raw input data into lower-dimensional real-valued vectors (embeddings), ideally disentangling the features that cause variation in the data distribution. Ideally, these representations should be robust to small differences or outliers in the input data, ensuring that temporally or spatially similar samples fall into close proximity in the representation space. Deep representation learning methods enable the hierarchical structuring of descriptive factors, where higher layers capture more abstract concepts. An ideal high-level representation consists of simple and linearly correlated factors [10]. Owing to the nature of feature extraction in representation learning, representations can be shared and utilized across different tasks. Although achieving the characteristics mentioned above is challenging, the learned representation facilitates the discovery of latent patterns and trends in data for the learner, hence enhancing the learning of the multiple tasks [10]. Based on the application, the raw input data can be of any type, for instance, texts, images, audio, video, etc. Given a particular task, such as classification, segmentation, synthesis, and prediction, the main objective is to update the parameters of a neural network so that can represent the input data in a lower dimension.

In the domain of image processing, representation learning finds applications in visualization [11], regression [12], [13], [14], interpretation of predictions [15], [16], [17], generating synthetic data [18], finding and retrieving similar images [19], [20], image enhancement and denoising [21], [22], semantic segmentation, and object detection [23], [24], [25]. Challenges in 2D image processing also extend to volumetric image processing contexts, such as 3D MRI [26] and point cloud data captured by depth sensors [27].

In the analysis of sequential data, representation learning plays a crucial role in transferring representations across domains. This enables the generation of annotations and captions for images [28], [29], [30] and facilitates post-hoc interpretation in medical data analysis [31]. By leveraging learned representations, researchers can bridge the gap between different data modalities, allowing for more comprehensive and meaningful insights.

Natural language processing (NLP) leverages representation learning approaches across various domains, including text classification [32], question answering [33], machine translation [34], [35], [36], electronic health records [37], financial forecasting [38], chatbots [39], social media analysis [40], [41] and more. The field of NLP has witnessed an evolution from early rule-based methods to the application of statistical learning techniques, enabled by access to large amounts of data. However, the introduction of deep learning approaches to NLP in 2012 revolutionized the field, making neural network-based methods the dominant approaches [42]. In modern NLP, Word2Vec [43] and GloVe [44] have emerged as advanced, well-known approaches for representing words as vectors. Following a breakthrough in 2017 with attention-based models [45], advanced pre-trained models, particularly BERT [46], have garnered significant attention and generated excitement within the NLP community. These models have showcased exceptional performance and have become the focal point of current NLP research and applications.

Linear factor models, such as PCA and ICA, have been employed as early methods of feature extraction in representation learning. While these models can be extended to form more powerful representations, this article focuses primarily on deep models of representation. For a more comprehensive discussion on linear factor models, readers are encouraged to refer to [10] and [47]. The subsequent sections of this article delve into the prevalent approaches in deep representation learning, providing insights into their principles and techniques.

This survey provides a comprehensive overview of the current state-of-the-art methods and principles in deep representation learning. While representation learning has been reviewed in several previous surveys, this work offers a uniquely comprehensive and up-to-date treatment. Existing surveys have focused on specific approaches such as autoencoders [48], [49], generative adversarial networks [50], and foundation models [51]. Bengio et al. [10], in their 2013 publication, provided a perspective focused on disentangling factors of variation. LeCun et al. [9], in their 2015 work, reviewed representation learning, emphasizing deep learning breakthroughs. It’s essential to consult the original paper for a detailed understanding of their coverage. More recent works, such as Zhou et al. [52] and Otter et al. [53], delivered insightful surveys on representation learning for computer vision and natural language processing, respectively. Zhou et al. discuss methods for various video segmentation tasks, while Otter et al. review developments in core NLP areas and related applications.

Our work encompasses a broader scope, including major techniques for both supervised and unsupervised feature learning. We discuss recent advancements spanning autoencoders, generative adversarial networks, graph neural networks, Bayesian deep learning, transformers, and other critical topics. Additionally, we explore applications across computer vision, natural language processing, healthcare, and other domains. This survey aims to connect key concepts in deep representation learning, tracing progress from foundational methods to cutting-edge techniques. By synthesizing a wide range of contemporary research into a single source, we hope to provide valuable insights into this rapidly evolving field and offer a comprehensive reference for representation learning distinct from previous works.

SECTION II.

Multi Layer Perceptron

A multi layer perceptron or feedforward neural network is a stack of multiple layers. Each layer, consists of one linear transformation and one non-linear activation function. Given an input vector $\vec {x} \in \mathbb {R}^{n}$ and weight matrix $W \in \mathbb {R}^{n\times m}$ , transformed vector $\vec {y} \in \mathbb {R}^{m}$ can be calculated as:\begin{equation*} \vec {y} = W^{T}\vec {x} \tag{1}\end{equation*} View Source The weight matrix $W$ in Eq. 1 consists of $m$ rows $\vec {r}_{i} \in \mathbb {R}^{m}$ (where $1\leq i \leq m$ ). As depicted in Fig. 1, each row $\vec {r}_{i}$ can be thought of as a vector perpendicular to a surface $S_{i}$ in hyperspace that passes through the origin. Surface $S_{i}$ divides the $n$ -dimensional space into 3 sub-spaces: three sets of points residing on the surface and the two sides of it. Each $y_{i}$ in vector $\vec {y}=(y_{1},y_{2},\ldots,y_{m})$ , is calculated by the dot product of row $\vec {r}_{i}$ and the input vector $\vec {x}$ . Depending on the relative positions of the point $\vec {x}$ and surface $S_{i}$ , the value of $\vec {y}$ along the $i$ -th dimension may be positive, negative, or zero. A bias number $b_{i}$ can also be employed to further control the value of $\vec {y}$ . Essentially, the parameters of the weight ($\vec {r}_{i}$ ) and bias ($\vec {b}$ ) vectors decide on how the features of the input vector $\vec {x}$ affect $\vec {y}$ along the $i$ -th dimension in the target space of $m$ dimensions. The training process, updates these weights and biases so that they can fit the input data to their corresponding target values. Thus, the network learns how to distinguish or generate certain similarities and patterns among the features of the input data. Each of the parameters of $\vec {y}$ are passed to an activation function in order to add non-linearity to the output. In a similar way, an extra layer can be utilized to capture the patterns and similarities in the output vectors of the previous layer. Hence, extracting more complex characteristics in the data. Adding extra layers may increase the capability of a network in learning representations in exchange for its computational complexity.

$FIGURE 1. - An intuitive representation of how the weights of a linear layer transform from the input space to the output space. $\vec {r}_{i}$ , the $i$ -th row of the weight matrix of the linear layer may be considered as the normal vector of surface $S_{i}$ , that may affect different input data points in different ways: a) Points above the surface: $\vec {r}_{i} \cdot \vec {x}_{1} > 0 $ . b) Points residing on the surface: $\vec {r}_{i} \cdot \vec {x}_{1} = 0 $ . c) Points that are posed below the surface: $\vec {r}_{i} \cdot \vec {x}_{1} < 0$ . The result of the product may also be passed through an activation function to add non-linearity.$

FIGURE 1.

An intuitive representation of how the weights of a linear layer transform from the input space to the output space. $\vec {r}_{i}$ , the $i$ -th row of the weight matrix of the linear layer may be considered as the normal vector of surface $S_{i}$ , that may affect different input data points in different ways: a) Points above the surface: $\vec {r}_{i} \cdot \vec {x}_{1} > 0 $ . b) Points residing on the surface: $\vec {r}_{i} \cdot \vec {x}_{1} = 0 $ . c) Points that are posed below the surface: $\vec {r}_{i} \cdot \vec {x}_{1} < 0$ . The result of the product may also be passed through an activation function to add non-linearity.

Show All

SECTION III.

Generative Models

Generative models are unsupervised methods that aim to learn and approximate the distribution function from which the samples of a given unlabeled dataset are generated. By acquiring knowledge of this approximate generator function, models gain the ability to generate random samples that are not originally present in the dataset, yet possess resemblances to the existing data [54]. Generative models can be grouped into two categories: energy-based and function-based models [54]. Energy-based models include Boltzmann Machines (BM), Restricted Boltzmann Machines (RBM), and Deep Belief Networks (DBN) [55]. Energy-based models are probabilistic models that provide information about the probability density or mass function without explicitly determining the normalizing constant, resulting in un-normalized probabilities. These models exclusively define the energy function, which corresponds to the unnormalized negative log-probability [56]. On the other hand, function-based models, such as the Auto-Encoder [11], [57] and its variants and Generative Adversarial Networks (GANs) [58], learn the mapping function from input to output, enabling the generation of new samples based on this learned mapping.

A. Boltzmann Machines

1) Boltzmann Machine

The Boltzmann Machine is an energy-based model initially introduced for learning arbitrary probability distributions over binary vectors [47]. Later, continuous variations of Boltzmann Machines have been proposed [59].

Given a $d$ -dimensional binary vector $x \in \{0,1\}^{d}$ as input, the joint probability distribution is defined as:\begin{equation*} P(x) = \frac {exp(-E(x))}{Z} \tag{2}\end{equation*} View Source where Z is normalization parameter defined as:\begin{equation*} Z = \sum _{x}^{}{exp(-E(x))} \tag{3}\end{equation*} View Source ensuring that $P(x)$ forms a probability density. In Equation 2, $E(x)$ represents the energy function defined as:\begin{equation*} E(x)=-(x^{T}Wx + b^{T}x) \tag{4}\end{equation*} View Source

The training process involves maximizing the likelihood and minimizing the energy function. Boltzmann Machines exhibit a learning procedure inspired by biological neurons, where the connection between two neurons strengthens if they are both excited together and weakens otherwise. This biologically inspired learning mechanism enhances the model’s ability to capture dependencies and patterns within the data.

One popular training algorithm for Boltzmann Machines is Contrastive Divergence, which provides an efficient approximation to maximum likelihood training using Gibbs sampling [56], [60].

2) Restricted Boltzmann Machine (RBM)

The Restricted Boltzmann Machine limits the connections among the nodes of a graph to only links between the visible and hidden neurons. Consequently, there are no connections among the hidden neurons or the visible ones. The vector of nodes, denoted as $x$ , can be divided into two subsets: visible nodes $v$ and hidden nodes $h$ . The energy function for RBM is given by:\begin{equation*} \text { E}(v,h)= -b^{T}v - c^{T}h - v^{T}Wh \tag{5}\end{equation*} View Source

Here, $b$ and $c$ represent the bias weights, and the matrix $W$ represents the connection weights.

The partition function for RBM, denoted as $Z$ , is defined as:\begin{equation*} Z = \sum _{v}^{}\sum _{h}^{}{e^{-E(v,h)}} \tag{6}\end{equation*} View Source

RBMs are probabilistic graphical models and serve as the fundamental building blocks of Deep Belief Networks (DBNs). However, due to the intractability of the partition function $Z$ , training RBMs requires specialized methods such as Contrastive Divergence [61] and Score Matching [47].

3) Deep Belief Network (DBN)

A Deep Belief Network consists of several RBMs. When a DBN has only one hidden layer, it can be considered as an RBM. To train a DBN, an RBM is first trained using likelihood maximization or contrastive divergence [47]. Subsequently, another RBM is trained to model the distribution of the previous layers. By adding more layers, the variational lower bound of the log-likelihood of the data increases, enabling the DBN to capture complex patterns and dependencies.

The deepest layer of DBNs is characterized by undirected connections, setting them apart from other deep neural network architectures [55]. However, it is important to mention that the term “DBN” is sometimes incorrectly used to refer to any neural network, which may lead to confusion.

4) Other Variants

There are other variants of Boltzmann machines proposed such as Deep Boltzmann Machines (DBM) [62], Spike and Slab Restricted Boltzmann Machines (ssRBM) [63], Convolutional Boltzmann Machines [64]. However, other generative models such as variational auto encoders and GANs have proved as viable substitutes for variations and derivations of Boltzmann machines [58].

B. Auto-Encoders

Autoencoder-based models are considered to be some of the most robust unsupervised learning models for extracting effective and discriminating features from a large unlabeled dataset. The general architecture of an auto-encoder consists of two components: Encoder: Function $f$ which aims to transform the inputs $x$ to a latent variable $h$ in lower dimensions. Decoder: Function $g$ reconstructs the input $\hat {x}$ , given the latent variable $h$ . The training process involves updating the weights of the encoder and decoder networks according to the loss function of the reconstruction:\begin{equation*} \mathcal {L}(x,\hat {x}) = \mathcal {L}\Big (x,g\big (f(x)\big)\Big) \tag{7}\end{equation*} View Source

Many variants of auto-encoders have been proposed in the literature; however, they can be categorized in four major groups [54].

1) Undercomplete Autoencoder

In order to make the autoencoder learn the distributions from the data, the latent variables should have lower dimensions than the input data. Otherwise, the network would fail to learn any useful features from the data. This type of autoencoder is known as undercomplete autoencoder [47].

2) Denoising Autoencoder (DAE)

Denoising Autoencoder corrupts the data by adding stochastic noise reconstructs it back into intact data. Hence, it is called denoising autoencoder. As depicted in Fig. 3, the added noise to the input is the only difference of this method to the traditional autoencoders. This approach results in better feature extraction and better generalization in classification tasks [65]. Also, Several DAEs can be trained locally by adding noise to their inputs and stacking consecutively to form a deep architecture called Stacked DAE, with higher representation capabilities.

$FIGURE 2. - General architecture of an auto-encoder. The encoder transforms input $x$ into the latent vector $h$ : $h = f(x)$ . The decoder reconstructs the input from $h$ : $\hat {x} = g(h)$ .$

FIGURE 2.

General architecture of an auto-encoder. The encoder transforms input $x$ into the latent vector $h$ : $h = f(x)$ . The decoder reconstructs the input from $h$ : $\hat {x} = g(h)$ .

Show All

FIGURE 3.

General architecture of a denoising auto-encoder (DAE). Adding noise to the input during the training process, results in more robust learning of the features. Hence, increasing the generalization ability.

Show All

3) Sparse Autoencoders (SAE)

Sparse representation refers to the technique of decomposing a data set into a set of overcomplete vectors where only a small subset of those vectors combine to describe the data. The overcompleteness of representation can lead to more expressive basis vectors which can capture complex structures more effectively. The sparsity puts an additional constraint on the number of basis vectors present for decomposing data to basis vectors. Sparse representation can be formulated as the disentangling of an input signal into a linear combination of its latent features [66]. The loss function of a sparse autoencoder includes an additional sparsity constraint ($\Omega (h)$ ) on the latent variables [47]:\begin{equation*} \mathcal {L} = \mathcal {L}\big (x,g(f(x))\big) + \Omega (h) \tag{8}\end{equation*} View Source Thus making the autoencoder to extract features from the data and represent them in sparse vectors and matrices [67].

4) Variational Autoencoder (VAE)

Although this type of autoencoder has the same components as the traditional autoencoder (Fig. 2), its training process is based on variational inference [68]. Just as the traditional autoencoder, the encoder function $f$ is trained to map the input data to the latent variables $z$ and the decoder function $g$ is trained to map the latent variables $z$ to the input data. However, for this autoencoder to work, the latent variable $z$ is assumed to be Guassian.1 By choosing this representation, we gain significant control over how the latent distribution should be modeled, resulting in a smoother and more continuous latent space. And the loss function for this training consists of two parameters: First, Kullback-Leibler(KL) divergence [69] of the output of the encoder $f$ and Guassian distribution; Thus forcing the encoder to map the input data to the Gaussian distribution in the latent space. Second, the reconstruction loss: [70]:\begin{equation*} \mathcal {L} = \mathcal {D}\big (KL(f\| \mathcal {N}(0,I))\big) + \mathcal {L}\big (x,g(f(x))\big) \tag{9}\end{equation*} View Source Variational inference is discussed in more details in Section V-C.

5) Contractive Autoencoder (CAE)

The main goal in proposing this variant of autoencoder was to make the features in the activation layer invariant with respect to small perturbations in the input [71]. The basic autoencoder may be converted to a contractive autoencoder by adding the following regularization to its loss function:\begin{equation*} \| J_{f}(x) \|_{F}^{2} = \sum _{i}\sum _{j}\left({\frac {\partial h_{j}(x)}{\partial x_{i}} }\right)^{2} \tag{10}\end{equation*} View Source where $f: \mathbb {R}^{m} \rightarrow \mathbb {R}^{n}$ is a non-linear mapping function from input space $x \in \mathbb {R}^{m}$ to the hidden layer $h \in \mathbb {R}^{n}$ . The regularization term is the squared value of the first-order partial derivatives of the hidden values with respect to the input values. By penalizing the first derivative of the encoding function, the derivative is forced to maintain lower values. In this way, the encoding function may learn a flatter representation. As a result, the encoding function may become more robust or invariant to small perturbations in the input.

The loss function of the contractive autoencoder may be written as:\begin{equation*} \mathcal {L}_{CAE} = \sum _{x \in X}\bigg (\mathcal {L}_{R}\big (x,g(f(x))\big) + \lambda \| J_{f}(x) \|_{F}^{2} \bigg) \tag{11}\end{equation*} View Source where $X$ is the dataset of training samples, $\mathcal {L}_{R}$ denotes the reconstruction loss, and $\lambda \in \mathbb {R}$ controls the effect of contractive loss. The input points get closer in distance when mapped to the hidden state i.e. they are contracted. This contraction can be thought as the reason behind robustness in features.

C. Generative Adversarial Networks

Although both Autoencoders and GANs are generative models, their learning mechanism is different. Autoencoders are trained to learn hidden representations, where GANs are designed to generate new data. The most prevalent generative model utilized in many applications is the GAN architecture [58]. As depicted in Fig. 4, it resembles a two-player minimax game where two functions known as the generator $\mathcal {G}$ and the discriminator $\mathcal {D}$ are trained as opponents. The $\mathcal {G}$ function tries to generate fake samples as similar as possible to the real input data from a noise variable $z$ , and the $\mathcal {D}$ function aims to discriminate the fake and real data apart. The minimax game can be described with the following objective function:\begin{align*} \underset {\mathcal {G}}{min} \underset {\mathcal {D}}{max} V(\mathcal {D},\mathcal {G}) &= \mathbb {E}_{x\sim p_{data}(x)}[log \mathcal {D}(x)] \\ &\quad +\mathbb {E}_{z\sim p_{z}(z)}[log \big (1 - \mathcal {D}(\mathcal {G}(z))\big)] \tag{12}\end{align*} View Source where $x \sim p_{data}$ denotes the real data sample $x$ with its distribution $p_{data}$ . And $\mathcal {D}(x)$ represents the class label that the discriminator $\mathcal {D}$ assigns to the input sample $x$ . For the noise variable $z$ a prior is assumed as $z \sim p_{z}(z)$ .

FIGURE 4.

Generative adversarial network.

Show All

The success of CNNs in image analysis and the capabilities that GANs provide, has made generative CNNs possible [72]. Numerous extensions to the original GAN have been proposed so far [73] such as interpretable representation learning by information maximizing (InfoGAN) that forces the model to disentangle and represent features of images in certain elements of the latent vector [74]. Or Cycle-Consistent GAN (CycleGAN) that learns characteristics of an image dataset and translates them into another image dataset without any dataset of paired images [75]. An inherent limitation of the original GAN is that it does not have any control over its output. Conditional Generative Adversarial Nets [76] incorporate auxiliary inputs such as class labels into their model to generate the desired output.

D. Applications

Generative models provide a powerful framework for learning and approximating complex data distributions, allowing for the generation of realistic and novel samples. They have shown promise in a wide range of applications, contributing to advancements in various fields. These models have found applications in numerous domains, enabling the development of powerful deep architectures. In the field of NLP, generative models have been utilized for tasks such as text generation [77], [78] and machine translation [79]. Notably, the GPT-3.5 model has demonstrated remarkable performance in language generation tasks [39].

In image processing, generative models have demonstrated their effectiveness in various applications. They have been employed for tasks such as denoising 3D magnetic images [80], unsupervised image generation [81], image-to-image translation [75], [82], cross-modality synthesis [83], [84], data augmentation and anonymization [85], image segmentation [86], [87], super-resolution [73], [88], [89], [90], and video analysis [91].

Furthermore, generative neural networks and their derivatives have been utilized in combination with deep reinforcement learning algorithms for tasks such as object detection [92], [93]. They have also been applied in the analysis of graph data, contributing to advancements in areas like graph generation [94] and graph representation learning [95].

Overall, generative neural networks have proven to be versatile tools with applications spanning a wide range of disciplines, delivering state-of-the-art performance in various problem domains.

SECTION IV.

Graph Neural Networks

The widespread success of deep learning in a myriad of applications over the past decade is well-documented [35], [36], [59], [79], [96], [97]. In the evolving landscape of deep learning research, Graph Neural Networks (GNNs) stand out as a pivotal advancement for effective data analysis in non-Euclidean geometries. GNNs have found applications in diverse real-world contexts, including but not limited to, biological regulatory networks in genomics [98], [99], telecommunication infrastructures [100], social interaction frameworks [101], transportation systems [102], [103], [104], energy grids [105], [106], [107], electrical circuits [108], [109], epidemiological spread [110], and neural networks in the brain [111]. Traditional deep learning architectures like ConvNets struggle with the irregular, non-Euclidean structure of graphs, primarily because the varying neighborhood sizes of graph nodes are incompatible with ConvNets’ fixed-size kernels. To address this, a plethora of GNN models have been proposed, leveraging the strengths of deep learning to capture the inherent complexities of non-Euclidean graphs [112], [113], [114]

A. Basics of GNN

Graph convolution originates from spectral graph theory which is the study of the properties of a graph in relationship to the eigenvalues, and eigenvectors of associated graph matrices [115], [116], [117]. The spectral convolution methods [112], [113], [114], [118] are the major algorithm designed as the graph convolution methods, and it is based on the graph Fourier transform [119], [120]. GCN focus processing graph signals defined on undirected graphs $\mathcal {G} = (\mathcal {V}, \mathcal {E}, \mathcal {W})$ , where $\mathcal {V}$ is a set of n vertexes, $\mathcal {E}$ represents edges and $\mathcal {W} = [w_{ij}] \in \{0,1\}^{n\times n}$ is an unweighted adjacency matrix. A signal $x: \mathcal {V} \rightarrow \mathbb {R}$ defined on the nodes may be regarded as a vector $x \in \mathbb {R}^{n}$ . Combinatorial graph Laplacian [115] is defined as $\mathop {\mathrm {\mathbf {L}}}\nolimits = D-\mathcal {W} \in \mathbb {R}^{n\times n}$ where $D$ is degree matrix. As $\mathop {\mathrm {\mathbf {L}}}\nolimits $ is a real symmetric positive semidefinite matrix, it has a complete set of orthonormal eigenvectors and their associated ordered real nonnegative eigenvalues identified as the frequencies of the graph. The Laplacian is diagonalized by the Fourier basis $\mathop {\mathrm {\mathbf {U^{\intercal }}}}\nolimits $ : $\mathop {\mathrm {\mathbf {L}}}\nolimits = \mathop {\mathrm {\mathbf {U}}}\nolimits \Lambda \mathop {\mathrm {\mathbf {U^{\intercal }}}}\nolimits $ where $\Lambda $ is the diagonal matrix whose diagonal elements are the corresponding eigenvalues, i.e., ${\displaystyle \Lambda _{ii}=\lambda _{i}}$ . The graph Fourier transform of a signal $x\in \mathbb {R}^{n}$ is defined as $\hat {x}= \mathop {\mathrm {\mathbf {U^{\intercal }}}}\nolimits x \in \mathbb {R}^{n}$ and its inverse as $x= \mathop {\mathrm {\mathbf {U}}}\nolimits \hat {x}$ [119], [120], [121]. To enable the formulation of fundamental operations such as filtering in the vertex domain, the convolution operator on graph is defined in the Fourier domain such that $f_{1}*f_{2}= \mathop {\mathrm {\mathbf {U}}}\nolimits \left [{\left ({\mathop {\mathrm {\mathbf {U^{\intercal }}}}\nolimits f_{1} }\right) \odot \left ({\mathop {\mathrm {\mathbf {U^{\intercal }}}}\nolimits f_{2}}\right)}\right]$ , where $\odot $ is the element-wise product, and $f_{1}/f_{2}$ are two signals defined on vertex domain. It follows that a vertex signal $f_{2}=x$ is filtered by spectral signal $\hat {f_{1}}= \mathop {\mathrm {\mathbf {U^{\intercal }}}}\nolimits f_{1}= \mathop {\mathrm {\mathbf {g}}}\nolimits $ as:\begin{equation*} \mathop {\mathrm {\mathbf {g}}}\nolimits * x = \mathop {\mathrm {\mathbf {U}}}\nolimits \left [{ \mathop {\mathrm {\mathbf {g}}}\nolimits (\mathop {\mathrm {\boldsymbol{\Lambda }}}\nolimits)\odot \left ({\mathop {\mathrm {\mathbf {U^{\intercal }}}}\nolimits f_{2}}\right)}\right] = \mathop {\mathrm {\mathbf {U}}}\nolimits \mathop {\mathrm {\mathbf {g}}}\nolimits (\mathop {\mathrm {\boldsymbol{\Lambda }}}\nolimits) \mathop {\mathrm {\mathbf {U^{\intercal }}}}\nolimits x.\end{equation*} View Source

Note that a real symmetric matrix $\mathop {\mathrm {\mathbf {L}}}\nolimits $ can be decomposed as $\mathop {\mathrm {\mathbf {L}}}\nolimits = \mathop {\mathrm {\mathbf {U}}}\nolimits \mathop {\mathrm {\boldsymbol{\Lambda }}}\nolimits \mathop {\mathrm {\mathbf {U}}}\nolimits ^{-1} = \mathop {\mathrm {\mathbf {U}}}\nolimits \mathop {\mathrm {\boldsymbol{\Lambda }}}\nolimits \mathop {\mathrm {\mathbf {U^{\intercal }}}}\nolimits $ since $\mathop {\mathrm {\mathbf {U}}}\nolimits ^{-1}= \mathop {\mathrm {\mathbf {U^{\intercal }}}}\nolimits $ . D. K. Hammond et al. and Defferrard et al. [114], [122] apply polynomial approximation on spectral filter $\mathop {\mathrm {\mathbf {g}}}\nolimits $ so that:\begin{align*} {\mathbf{g}} * x &= {\mathbf{U}} {\mathbf{g}} (\mathop {\mathrm {\boldsymbol{\Lambda }}}\nolimits) \mathop {\mathrm {\mathbf {U^{\intercal }}}}\nolimits x \\ &\approx {\mathbf{U}} \sum _{k}^{}\theta _{k} T_{k}(\tilde { \mathop {\mathrm {\boldsymbol{\Lambda }}}\nolimits }) \mathop {\mathrm {\mathbf {U^{\intercal }}}}\nolimits x \quad {\left({\tilde { \mathop {\mathrm {\boldsymbol{\Lambda }}}\nolimits }=\frac {2}{\lambda _{max}} \mathop {\mathrm {\boldsymbol{\Lambda }}}\nolimits - {\mathbf{I}} _{ {\mathbf{N}} }}\right)}\\ &=\sum _{k}^{}\theta _{k} T_{k}(\tilde { {\mathbf{L}} }) x\quad {({\mathbf{U}} \mathop {\mathrm {\boldsymbol{\Lambda }}}\nolimits ^{k} \mathop {\mathrm {\mathbf {U^{\intercal }}}}\nolimits =({\mathbf{U}} \mathop {\mathrm {\boldsymbol{\Lambda }}}\nolimits \mathop {\mathrm {\mathbf {U^{\intercal }}}}\nolimits)^{k})}\\{}\end{align*} View Source Kipf et al. [113] simplifies it by applying multiple tricks:\begin{align*} & {\mathbf {g}} * x &&\\ &\approx \theta _{0} {\mathbf{I}} _{ {\mathbf{N}} }x+\theta _{1}\tilde {\mathbf{L}} x &&({\scriptstyle \text { expand to 1st order)}}\\ &=\theta _{0} {\mathbf{I}} _{ {\mathbf{N}} }x+\theta _{1}\left({\frac {2}{\lambda _{max}} {\mathbf{L}} - {\mathbf{I}} _{ {\mathbf{N}} }}\right)\big) x &&{\scriptstyle \left({\tilde {\mathbf{L}} =\frac {2}{\lambda _{max}} {\mathbf{L}} - {\mathbf{I}} _{ {\mathbf{N}} }}\right)\big)} \\ &=\theta _{0} {\mathbf{I}} _{ {\mathbf{N}} }x+\theta _{1}({\mathbf{L}} - {\mathbf{I}} _{ {\mathbf{N}} })) x &&{\scriptstyle (\lambda _{max}=2)} \\ &=\theta _{0} {\mathbf{I}} _{ {\mathbf{N}} }x-\theta _{1} \mathop {\mathrm {\mathbf {D^{-\frac {1}{2}}}}}\nolimits {\mathbf{A}} \mathop {\mathrm {\mathbf {D^{-\frac {1}{2}}}}}\nolimits x &&{\scriptstyle \left({{\mathbf{L}} = {\mathbf{I}} _{ {\mathbf{N}} }- \mathop {\mathrm {\mathbf {D^{-\frac {1}{2}}}}}\nolimits {\mathbf{A}} \mathop {\mathrm {\mathbf {D^{-\frac {1}{2}}}}}\nolimits }\right)} \\ &=\theta _{0}\left({{\mathbf{I}} _{ {\mathbf{N}} } + \mathop {\mathrm {\mathbf {D^{-\frac {1}{2}}}}}\nolimits {\mathbf{A}} \mathop {\mathrm {\mathbf {D^{-\frac {1}{2}}}}}\nolimits }\right) x &&{\scriptstyle \big(\theta _{0}=-\theta _{1}\big)} \\ &=\theta _{0}\left({\tilde {\mathbf{D}} ^{-\frac {1}{2}}\tilde {\mathbf{A}} \tilde {\mathbf{D}} ^{-\frac {1}{2}}}\right) x &&{\scriptstyle \big(\text {renormalization}: \tilde {\mathbf{A}} = {\mathbf{A}} + {\mathbf{I}} _{ {\mathbf{N}} },}\\ &&&{\scriptstyle \tilde {\mathbf{D}} _{ii}=\sum _{j} {\mathbf{A}} _{ij}\big)}.\end{align*} View SourceRewriting the above GCN in matrix form: $\mathrm{g}_{\theta} * X \approx\left(\tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}}\right) X \Theta$ , it leads to symmetric normalized Laplacian with raw feature. GCN has been analyzed in [123] using smoothing Laplacian [124], and the updated features ($y$ ) equals to the smoothing Laplacian, i.e., the weighted sum of itself ($x_{i}$ ) and its neighbors ($x_{j}$ ): $y= (1-\gamma) x_{i} + \gamma \sum _{j}\frac {\tilde a_{ij}}{d_{i}}x_{j} = x_{i} - \gamma \left({x_{i}-\sum _{j}\frac {\tilde a_{ij}}{d_{i}}x_{j}}\right)$ ,where $\gamma $ is a weight parameter between the current vertex $x_{i}$ and the features of its neighbors $x_{j}$ , $d_{i}$ is degree of $x_{i}$ , and $y$ is the smoothed Laplacian. Rewriting in matrix form, the smoothing Laplacian is:\begin{align*} Y&= x -\gamma \tilde {\mathbf{D}} ^{-1}\tilde {\mathbf{L}} x\quad \\ &= ({\mathbf{I}} _{ {\mathbf{N}} }-\tilde {\mathbf{D}} ^{-1}\tilde {\mathbf{L}})x \qquad \qquad \quad (\gamma =1)\\ &= ({\mathbf{I}} _{ {\mathbf{N}} }-\tilde {\mathbf{D}} ^{-1}(\tilde {\mathbf{D}} -\tilde {\mathbf{A}}))x \qquad (\tilde {\mathbf{L}} =\tilde {\mathbf{D}} -\tilde {\mathbf{A}})\\ &= \tilde {\mathbf{D}} ^{-1}\tilde {\mathbf{A}} x.\end{align*} View Source The above formula is random walk normalized Laplacian as a counterpart of symmetric normalized Laplacian. Therefore, GCN can be treated as a first-order Laplacian smoothing which averages neighbors of each vertex.

B. Taxonomy of GNN

As many surveys on GNN state [118], [125], [126], [127], [128], [129], GCNs can be classified into two major categories based on the operation type. Therefore, we introduce a taxonomy of GNN in the following two perspectives.

C. Spectral-based GNN

This group of GCN highly relies on spectral graph analysis and approximation theory. Spectral-based GNN models analyzes the weight-adjusting function (i.e., filter function) on eigenvalues of graph matrices, which corresponds to adjusting the weights assigned to frequency components (eigenvectors). Many of Spectral-based GNN models are equivalent to low-pass filters [94]. Based on the type of filter function, there are linear filtering [113], [130], [131], polynomial filtering [122], [132], [133], [134], [135], and rational filtering [94], [136], [137], [138]. Beyond that, [139] adaptively learns the center of spectral filter. Closely, [140] proposed a high-low-pass filter based on p-Laplacian. References [141], [142], and [143] revisit the spectral graph convolutional filter and make theoretical analyze. Optionally, one can choose graph wavelet to model spectrum of each node [144], [145], [146], [147].

D. Spatial-based GNN

Nowadays, there are more emerging GNNs using spatial operations. Based on the spatial operation, they can be categorized into three groups: local aggregation which only combine direct neighbors [130], [131], [148], [149], [150], higher order aggregation which involves second order or higher orders of neighbors [114], [122], [133], [134], [135], [151], and dual-directional aggregation that propagates information in both forward and backward directions [94], [136], [137], [138], [152], [153], [154].

E. Applications

Graph neural networks have been applied in numerous domains such as physics, chemistry, biology, computer vision, NLP, intelligent transportation, social networks [118], [125], [126], [127], [128], [155]. To model physical objects, DeepMind [156] provides a toolkit to generalize the operations on graphs, including manipulating structured knowledge and producing structured behaviors, and [157] simulates fluids, rigid solids, and deformable materials. Treating chemical structure as a graph [158], [159], [160], [161] represent molecular structure, and [162], [163], [164] model protein interfaces. Further, [165] predict the chemical reaction and retrosynthesis. In computer vision, question-specific interactions are modeled as graphs in visual question answering [166], [167]. Similar to physics applications, human interaction with humans could be represented by their connections [168], [169], [170], [171]. Reference [172] model the relationship among word and document as a graph, while [173] and [174] characterize the syntactic relations as a dependency tree. Predicting traffic flow is a fundamental problem in urban computing, and transportation network can be modeled as a spatiotemporal graph [175], [176], [177], [178]. Functional MRI (fMRI) is a graph data where brain regions are connected by functional correlation [179], [180]. Reference [181] employs a graph convolutional network to localize eloquent cortex in brain tumor patients, [182] integrates structural and functional MRIs using Graph Convolutional Networks to do Autism Classification, and [183] applies graph convolutional networks to classify mental imagery states of healthy subjects by only using functional connectivity. To go beyond rs-fMRI and model both functional dependency among brain regions and the temporal dynamics of brain activity, spatio-temporal graph convolutional networks (ST-GCN) are applied to formulate functional connectivity networks in the format of spatio-temporal graphs, which can be also applied in physcial flows [102], [184], [185].

SECTION V.

Bayesian Deep Learning and Variational Inference

Bayesian networks are statistical methodology that combines standard networks with Bayesian inference. Following the Bayes rule (Eq. 13), the random variables of a problem can be represented as a directed acyclic graph known as Bayesian Network or belief network [47].

Let ${\mathbf {Z}}= \{z_{1},z_{2},\ldots,z_{N}\}$ , and ${\mathbf {x}}= \{x_{1},x_{2},\ldots,x_{M}\}$ denote the latent variables and the observations respectively. The latent variables facilitate the representation of the observations’ distribution. Given a prior distribution $p(\text {z})$ over the latent variables, the Bayesian model maps the latent variables to the observations by the likelihood function $p(\text {x|z})$ . Thus producing the joint distribution of the latent variables and observations:\begin{equation*} p(\text {z,x}) = p(\text {x|z})p(\text {z}) = p(\text {z|x})p(\text {x}) \tag{13}\end{equation*} View Source In Bayesian models, inference involves in calculating the the posterior distribution which is the conditional distribution of the latent variables given the observations:\begin{equation*} p(\text {z|x}) = \frac {p(\text {z,x})}{p(\text {x})} \tag{14}\end{equation*} View Source The marginal density of the observations $p(\text {x})$ is called evidence which is calculated by integration over latent variables:\begin{equation*} p(\text {x}) = \int p(\text {x,z})d\text {z} \tag{15}\end{equation*} View Source

A. Hidden Markov Model

Hidden Markov Model (HMM) is a probabilistic Bayesian network architecture [186] that approximate the likelihood of distributions in a sequence of observations [187]. As opposed to Bayesian networks, these networks are undirected and can be cyclic. The family of HMMs, including the Hidden Semi-Markov Model (HSMM), are widely used to identify patterns in sequential data of time varying and non-time varying nature [188]. They are well suited for sequencing time series problems with a linear degree of growth over data patterns [189]. A generalized HMM is composed of a state model of Markov process $z_{t}$ , linked to an observation model $P(x_{t}|z_{t})$ , which contains the observations $x_{t}$ of the state model.

While HMMs are considered agnostic of the duration of the states, the HSMMs can take the duration of each state into consideration [190], which makes HSMMs suitable for prognosis [191], [192]. Neither HMM nor HSMM can capture the inter-dependencies of observations in temporal data, which is a key factor in determining the state of the system. To overcome this shortcoming, one can use the Auto-Regressive Hidden Markov Model (ARHMM) which accounts for the inter-dependencies between consecutive observations to model longer time series [193], [194], [195]

HMMS can loose their efficiency when dealing with distributed state representations. The Factorial HMM (FHMM) is an extension of HMM that aims at addressing this problem by using several independent layers of state structure HMMs. These layers are free to evolve irrespective of the other layers, allowing observations at any given time to be dependent on the value of all states at that time [196].

Due to exponential time complexity of this integration, its computation is intractable. Thereupon, the posterior distribution cannot be calculated directly. Rather, it is approximated [68]. There are two major methods of posterior approximation:

Sampling based: Markov Chain Monte Carlo (MCMC) methods are often able to approximate the true and unbiased posterior through sampling, although they are slow and computationally demanding on large and complex datasets with high dimensions.
Optimization based: approaches for Variational Inference (VI) tend to converge much faster though they may provide over-simplified approximations.

The following subsections explain each of these approaches on these topics due to their importance.

B. Markov Chain Monte Carlo

Monte Carlo estimation is a method for approximating the expectation of random variables where their expectation may involve intractable integrations as in Eq. 15. Markov Chain Monte Carlo, Metropolis-Hastings (MH) sampling, Gibbs sampling, and their parallel and scalable variations are instances of MCMC estimations [197].

Although the basic Monte Carlo algorithm requires the samples to be independent and identically distributed (i.i.d), obtaining such samples my be computationally intensive in practice. Nonetheless, the sample generation process can still be facilitated by satisfying some properties as described below [198]:

1) Markov Property

Given the past and present states, the probability of transition to the future states relies on the present state only. Mathematically speaking, a Markov chain is a sequence of random variables ${\text {X}}_{1},\text {X}_{2},\ldots,\text {X}_{n}$ representing states, that hold the following property:\begin{align*} P(\text {X}_{n + 1}&=x|\text {X}_{n} = x_{n}, \text {X}_{n-1} = x_{n-1},\ldots, \text {X}_{1} = x_{1}) \\ & =P(\text {X}_{n + 1}=x|\text {X}_{n} = x_{n}) \tag{16}\end{align*} View Source

2) Time-homogeneity

A stochastic process that the probability of transition is independent of the index $n$ , is time-homogeneous.

3) Stationary distribution

A probability distribution of a Markov chain represented as a row vector $\pi $ that is invariant by matrix of transition probabilities K. \begin{equation*} \pi = \pi \text { K} \tag{17}\end{equation*} View Source

4) Irreduciblity

A Markov chain is irreducible if in a discrete state space, it can go from any state $x$ to any other state $y$ in a finite number of transitions. In mathematical terms, given that:\begin{equation*} \text { K(x,y)} = P(\text {X}_{n + 1}=y|\text {X}_{n} = x) \tag{18}\end{equation*} View Source where K is a matrix, there exist an integer $n$ such that $K_{(x,y)}^{n} > 0 $ .

A stationary distribution of a chain is unique if it has stationary distribution and it is irreducible. Considering a Markov chain with a unique stationary distribution $\pi $ , according to the law of large numbers [198], the expectation value of a function $f(x)$ over $\pi $ can be approximated by calculating the mean of the outputs from the Markov chain:\begin{equation*} E_{\pi }[f(x)] = \int f(x)\pi (x)dx = \underset {n \rightarrow \infty }{\text {lim}}\frac {1}{n}\sum _{i=1}^{n}f(x_{i}) \tag{19}\end{equation*} View Source

For a more detailed explanation on MCMCs, the readers may consult [197], [198].

C. Variational Inference (VI)

Variational Inference (VI) is of high importance in modern machine learning architectures. Regularization through variational droput [199], [200], representing model uncertainty in classification tasks and reinforcement learning [201], are a few of scenarios in which variational inference is utilized. The core idea in VI is to find an approximate distribution function which is simpler than the true posterior and its Kullback-Liebler divergence [69] from the true posterior is as lowest as possible [202].

The problem changes to the search for a candidate density function $q_{c}(\text {x})$ among a specified family of distributions $D$ such that it best resembles the true posterior function:\begin{equation*} q_{c}(\text {z}) = \underset {q(z) \in D}{\text {argmin}} \text { KL}(q(\text {z}) \| p(\text {z|x})) \tag{20}\end{equation*} View Source The Eq. 20 may be optimized indirectly through maximization of the variational objective function ELBO(q):\begin{equation*} \text { ELBO(q)} = \mathop {\mathbb {E}}[\text {log }p(\text {x|z})] - \text {KL }(q(\text {z}) || p(\text {z})) \tag{21}\end{equation*} View Source where ELBO(q) is called evidence lower bound function. The $\text {KL}(q(\text {z})||p(\text {z}))$ encourages the density function $q(\text {z})$ to get closer to the prior function. And the expected likelihood $\mathop {\mathbb {E}}[\text {log }p(\text {x|z})]$ encourages preference of latent variable configurations that better explain the observed data. The Eq. 21 may be rewritten as follows:\begin{equation*} \text { log }p(\text {x}) = \text {KL} (q(\text {z}) || p(\text {z|x})) + \text {ELBO(q)} \tag{22}\end{equation*} View Source The value of the left hand side (log-evidence) is constant and the $\text {KL(.)} \geq 0$ . As a result, ELBO(q) is the lower-bound of evidence.

There are numerous extensions and proposed approaches for variational inference in the literature such as Expectation Propagation (EP) [203] and stochastic gradient optimization [197]. For a more detailed and comprehensive review, the readers are encouraged to consult [68], [202].

D. Applications

Bayesian models and Variational Inference techniques have demonstrated their versatility and effectiveness in various domains. The Bayesian inference plays a crucial role in calculations across disciplines, including personalized advertising recommendation systems in healthcare applications [197], research in astronomy [204], and search engines [205].

In the fields of Physics and Chemistry, these models are utilized to simulate physical objects such as fluids, rigid solids, and deformable materials [157].

By leveraging Bayesian models, researchers have made significant strides in computer vision tasks, particularly in the field of semantic segmentation [206].

The impact of Bayesian models is also evident in the domain of robotics. Its application has been pivotal in tasks such as robot perception, enabling machines to understand and interpret their environment accurately. Additionally, it has facilitated advancements in motion planning, allowing robots to navigate complex and dynamic environments [207], [208], [209].

SECTION VI.

Convolutional Neural Network

Convolutional Neural Networks (CNN) are the prevalent approach in extracting features from image data. Though several variants of CNNs have been proposed, they all share pretty much the same basic components: convolution, pooling, and fully-connected layers.

Convolution Layer
Extracts features from a given input layer and stores them on several feature maps which make up the higher layer. Each convolution layer has several feature extractors called kernels(filters) that each of them correspond to a single feature map. Every single neuron of the feature map corresponds to a group of neighboring neurons from the input layer referred to by neuron’s receptive field. Each kernel is used to calculate the convolution over all of the possible receptive fields of the input layer. The convolution value is then passed through a non-linear activation function such as $tanh(.)$ or $sigmoid$ or $ReLU$ [210] to add non-linearity to the representation.
The feature value $z_{i,j,k}^{l}$ , at location (i,j) of the $k-$ th feature map of layer $l$ can be calculated as:\begin{equation*} z_{i,j,k}^{l} = w_{l}^{k} \odot x_{i,j}^{l} + b_{k}^{l} \tag{23} \end{equation*} View Source where $x_{i,j}^{l}$ represents the receptive field of neuron $z_{i,j,k}^{l}$ in the input layer. And the symbol $\odot $ represents the discrete convolution i.e. the sum of elements of Hadamard product(element-wise) of the two matrices. The activation of each feature can be obtained from the Eq. 24 [211].\begin{equation*} a_{i,j,k}^{l} = f(z_{i,j,k}^{l}) \tag{24}\end{equation*} View Source where $f$ refers to the activation function.
Pooling Layer
The next step after convolution is reducing the size of the shared feature map. Various pooling operations are proposed; though the average pooling and max pooling are typically used [212]. The pooling operation can be represented mathematically as:\begin{equation*} y_{m,n,k}^{l} = pool(\{\forall a_{k}^{l} \in V_{m,n}\}) \tag{25}\end{equation*} View Source
The neuron $y_{m,n,k}^{l}$ at location (m,n) of the $k-$ th pooled feature map of layer $l$ would be calculated from a set of neighboring neurons $V_{m,n}$ on the convolution feature map passed through the pooling function $pool$ . One of the main advantages of the convolutions over other architectures are having the shift-invariance. A small displacement(rotation, translation) of the input wouldn’t change the output dramatically. The main characteristic comes from sharing the kernels and pooling layers.
Fully-Connected Layer
After several convolutional and pooling layers, that perform as feature extractors, typically a few fully-connected layers (MLPs as discussed in section II) are added in order to perform high-level reasoning given the extracted features [213]. For classification tasks the fully-connected network takes in all the neurons from the previous layers as input and provide an output of classes followed by a $softmax$ function. Given a dataset of $N$ pairs of inputs and outputs $\{(x_{1},y_{1}),(x_{2},y_{2}),\ldots,(x_{N},y_{N})\}$ , and the weights and biases of the whole network denoted by $\theta $ , the total classification error of the network can be calculated by the following loss function:\begin{equation*} \mathcal {L} = \frac {1}{N}\sum _{i=1}^{N}\ell (\theta;y_{i},\hat {y}_{i}) \tag{26}\end{equation*} View Source where $\hat {y}_{i}$ denotes the class label calculated by the network and $y_{i}$ is the true value of the class label. The training process of the network would involve the global minimization of loss function. The model’s parameters $\theta $ can be updated using Stochastic Gradient Descent (SGD) [214] that is a common method for training CNNs, although various other optimization methods and loss functions have been proposed. Additionally, the fully-connected layers and the final layer of CNN may be replaced by some other types of networks or models. Also, numerous variants of CNNs and additional components are proposed in the literature [215], [216], [217], [218]. The readers can consult [212] for a comprehensive introduction to the CNNs.
Fig 6 provides an abstract depiction of the structure and components of a CNN as described above.

FIGURE 5.

Illustration of graph convolution.

Show All

FIGURE 6.

Abstract structure of convolutional neural networks.

Show All

A. Applications

CNN and its extensions can be seen in almost all of the state of the art methods of deep representation learning. It has demonstrated competitive capabilities in numerous supervised and unsupervised tasks on 2D/3D images and point cloud data [27] such as image retrieval [219], segmentation [23], [220], [221], registration [24], object detection [222], [223], and data augmentation [85], [224]. It has also been applied to sequential data to extract longitudinal patterns of signals [225], [226], i.e. applicable to 1D data as well. CNNs, also empower reinforcement learning algorithms [227] and may be utilized to analyze graph data [127], [228] as will be discussed in section IV.

Fig 7 showcasing the evolution of CNN models over the years. The progression highlights key milestones and breakthrough models that have significantly impacted deep representation learning.

FIGURE 7.

A general timeline for some of the most important CNN-Based models in history. [229], [230], [231], [232], [233].

Show All

SECTION VII.

Word Representation Learning

Representing words numerically is a crucial component of natural language processing, as it forms the foundation for employing Artificial Neural Networks (ANN) in NLP tasks. The simplest way to represent a word in a computer-readable format is through a one-hot vector, where each word is assigned a dimension in a vector equal to the size of the vocabulary [234]. The main flaw in this approach is that it neglects the semantic relatedness between words. Vector space model [235] is one of the first methods of representing words mathematically that made it possible to calculate similarity between documents in the field of Information Retrieval.

More recently, word embeddings have emerged as a method for learning low-dimensional vector representations of words from text corpora, capturing the semantic and contextual information of words [236], [237].

Word embeddings will not only present the semantic meanings of the word but also may show the word-context information. We can view language model as a tool which represents a sequence of words’ probability distribution based on training data [238]. Language models, such as those based on neural networks, learn the joint probability function of word sequences in a corpus [238], [239]. One of the earliest neural language models proposed by Bengio et al. [239] aimed to address the challenge of learning the joint probability function for word sequences due to the curse of dimensionality. They introduced a method that estimates distributed representations for each word, allowing the model to learn about exponentially many semantically neighboring sentences. In their model, the learned distributed encoding of each word is fed into the last unit (softmax) to predict the probabilities for incoming words. Other research works have also explored word embeddings in prediction models [240], [241], [242]. Later on, as one of the most popular approaches, Word2vec [43] has proposed two methods: Continuous Bag of Words (CBOW) and Skip-Gram (SG) [44], [243].

In CBOW, the model predicts the middle word given the distributed representations of its context (or surrounding words), while SG predicts the context words given the center word. To address the computational burden of these methods, negative sampling was proposed [244]. Negative sampling involves using a random subsample of frequent words instead of the entire training set to calculate the denominator of the softmax equation.

The cross-entropy loss function is $H (p,q)= - \sum _{x \in X} p(x)log^ {q(x)}$ in this model.\begin{equation*} P({ \text {out }} \mid { \text {center }})=\frac {\exp \left ({u_{\text {out }}^{T} v_{\text {center }}}\right)}{\sum _{w \in V} \exp \left ({u_{w}^{T} v_{\text {center}}}\right)} \tag{27}\end{equation*} View Source

As another influential approach Glove [44] captures the difference between a pair of words as ratio of co-occurence probabilities for target words with selected context word. Later on, FastText [245] built upon the “Glove” and “Word2Vec” to mitigate their shortcoming in handling out-of-vocabulary(OOV) [246], [247]. FastText builds word representations by considering subword information. It represents each word as a bag of character n-grams and utilizes these subword representations to generate word embeddings. This approach addresses the challenge of handling OOV words, as it can capture the meaning of unseen words based on their character compositions [245].

A. Applications

In language modeling, word embeddings are used to capture the semantic meaning and contextual information of words. Using these embeddings, the model is capable of performing tasks such as machine translation [45], [248], sentiment analysis [249], [250], and named entity recognition [251], [252].

Word embeddings are being used in information retrieval applications, making it possible to calculate word similarity accurately and rank documents more effectively. Recent advances in this area include models such as SBERT (Sentence-BERT) [253] and CLIP (Contrastive Language-Image Pretraining) [254], due to their ability to enhance semantic understanding and cross-modal retrieval using contextual embeddings.

In question answering, word embeddings are a key component in models like GPT (Generative Pre-trained Transformer) [255] and T5 (Text-to-Text Transfer Transformer) [256], incorporating language generation capabilities and doing well on benchmark datasets.

Moreover, word embeddings are used in text classification tasks, including sentiment analysis, topic classification. Recent models such as ULMFiT (Universal Language Model Fine-tuning) [257] and RoBERTa (Robustly Optimized BERT Pretraining Approach) [258] show superior results by fine-tuning large pretrained language models.

In addition, word embeddings are employed in document summarization [259], [260], document clustering [261], [262], and text generation [263], [264].

SECTION VIII.

Sequential Representation Learning

In many real-world applications, data often exhibit a sequential nature, where the order of elements in a sequence holds valuable information. Examples of such sequential data include sentences in NLP tasks [265] and medical records in healthcare research [266], [267]. In order to effectively capture and represent the underlying patterns in these sequences, it is crucial to employ architectures that can handle inputs of varying lengths and capture the dependencies between data points.

Recurrent Neural Networks (RNNs) [57] have emerged as a popular choice for sequential representation learning due to their ability to address these requirements. RNNs are designed to process sequences by sharing parameters across different steps [47], allowing them to handle inputs with varying lengths. This characteristic enables RNNs to handle sequential data more effectively than traditional feedforward neural networks.

RNNs capture dependencies between data points. These models are capable of considering the historical context when processing each element in a sequence by maintaining an internal state or memory. The memory component of RNNs is essential for capturing the sequential patterns present in data, as it enables them to model relationships and dependencies between elements over time.

A. Recurrent Neural Network

The general architecture of a recurrent neural network is composed of cells with hidden states. In mathematical terms, hidden units $h_{t}$ store the state of the model that depends on the state at previous time step $h_{t-1}$ and the input of the current time step $x_{t}$ :\begin{equation*} h_{(t)} = f_{a}(Wh_{(t-1)},Ux_{(t)} + b) \tag{28}\end{equation*} View Source where the matrices $U,W$ are weight matrices and $b$ is bias vector, and $f_{a}$ represents the activation function. The same set of model parameters is used for calculation of $h_{t}$ for any of the elements in a sequence of inputs $(x_{1},x_{2},\ldots,x_{n})$ . In this way, the parameters are shared across the input elements. For supervised tasks such as classification, the hidden unit $h_{t}$ is mapped to the output variables $y_{t}$ via the weight matrix $V$ :\begin{equation*} \hat {y_{t}} = \text {softmax}(Vh_{t} + c) \tag{29}\end{equation*} View Source $c$ is the bias vector. Due to the recursive nature of Eq. (28), the unfolded computational graph for a given input sequence, can be displayed as a regular neural network.

RNNs can be trained by back-propagation [47] as if it is calculated for the unfolded computational graph. Two of the most important problems with RNNs are vanishing and exploding gradients. The longer the input sequence, the more gradient values are multiplied together, which may cause it to converge to zero, or exponentially gets large. Either way, the RNN fails to learn anything. Various methods have been proposed to facilitate training RNNs on longer sequences. For instance, skip connections [268] let the information flow from a farther past to the present state. Another method, is incorporation of leaky units [269] in order to keep track of the older states of hidden layers by linear self-connections. Nonetheless, the problem of learning long-term dependencies is yet to be resolved completely.

B. Long Short-Term Memory (LSTM)

The most prominent architecture for learning from sequential data, is Long Short-Term Memory that combines various strategies for handling longer dependencies [270]. In LSTM, tuning of the hyper-parameters is a part of the learning procedure. The general architecture of an LSTM cell contains separate gates that control the information flow across the time steps of sequences:

1) Input Gate

Controls whether the input is accumulated into the hidden state.\begin{equation*} i_{t} = \sigma (W_{xi} \cdot x_{t} + W_{hi} \cdot h_{t-1} + b_{i}) \tag{30}\end{equation*} View Source where:\begin{align*} & i_{t} \text { is the input gate at time step } t, \\ & x_{t} \text { is the current input at time step } t, \\ & h_{t-1} \text { is the previous hidden state at time step } t-1, \\ & W_{xi} \text { is the weight matrix for the input connections,} \\ & W_{hi} \text { is the weight matrix for the hidden state connections,} \\ & b_{i} \text { is the bias term for the input gate,} \\ & \sigma \text { is the sigmoid activation function.}\end{align*} View Source

The sigmoid activation function is defined as:\begin{equation*} \sigma (x) = \frac {1}{1 + e^{-x}} \tag{31}\end{equation*} View Source

The input gate $i_{t}$ determines the relevance of the current input $x_{t}$ and its impact on updating the hidden state $h_{t}$ . A value close to 0 for $i_{t}$ indicates that the current input is ignored, while a value close to 1 indicates that the current input has a significant impact on the hidden state update.

By incorporating the input gate, LSTM networks can selectively accumulate relevant information from the current input and previous hidden state, enabling them to capture long-term dependencies and effectively learn from sequential data.

2) Forget Gate

Controls the amount of effect that the previous state has on the current state. Whenever this gate lets information flow in completely, it acts as a skip-connection [268]. Otherwise, similar to leaky units, it keeps track of previous hidden states with a linear coefficient.\begin{equation*} f_{t} = \sigma (W_{xf} \cdot x_{t} + W_{hf} \cdot h_{t-1} + b_{f}) \tag{32}\end{equation*} View Source where:\begin{align*} & f_{t} \text { is the forget gate at time step } t, \\ & x_{t} \text { is the current input at time step } t, \\ & h_{t-1} \text { is the previous hidden state at time step } t-1, \\ & W_{xf} \text { is the weight matrix for the input connections,} \\ & W_{hf} \text { is the weight matrix for the hidden state connections,} \\ & b_{f} \text { is the bias term for the forget gate,} \\ & \sigma \text { is the sigmoid activation function.}\end{align*} View Source

3) Output Gate

The output gate $o_{t}$ controls whether the output of the LSTM cell should be stopped or allowed to propagate further. It regulates the flow of information from the hidden state to the output of the LSTM cell.\begin{equation*} o_{t} = \sigma (W_{xo} \cdot x_{t} + W_{ho} \cdot h_{t-1} + b_{o}) \tag{33}\end{equation*} View Source where:\begin{align*} & o_{t} \text { is the output gate at time step } t, \\ & x_{t} \text { is the current input at time step } t, \\ & h_{t-1} \text { is the previous hidden state at time step } t-1, \\ & W_{xo} \text { is the weight matrix for the input connections,} \\ & W_{ho} \text { is the weight matrix for the hidden state connections,} \\ & b_{o} \text { is the bias term for the output gate,} \\ & \sigma \text { is the sigmoid activation function.}\end{align*} View Source

C. Gated Recurrent Unit (GRU)

Another variant of recurrent neural network that addresses long-term dependencies through gating mechanisms is the Gate Recurrent Unit (GRU) [271]. The GRU architecture is simpler compared to LSTM, resulting in fewer parameters within a GRU cell. GRU cells consist of two types of gates:

1) Update Gate

Controls the weights of interpolation of the current state and the candidate state in order to update the hidden state.\begin{equation*} z_{t} = \sigma (W_{xz} \cdot x_{t} + W_{hz} \cdot h_{t-1} + b_{z}) \tag{34}\end{equation*} View Source where:\begin{align*} & z_{t} \text { is the update gate at time step } t, \\ & x_{t} \text { is the current input at time step } t, \\ & h_{t-1} \text { is the previous hidden state at time step } t-1, \\ & W_{xz} \text { is the weight matrix for the input connections}, \\ & W_{hz} \text { is the weight matrix for the hidden state connections}, \\ & b_{z} \text { is the bias term for the update gate}, \\ & \sigma \text { is the sigmoid activation function}.\end{align*} View Source

The update gate $z_{t}$ controls the weights used for interpolating between the current state and the candidate state in order to update the hidden state. A value close to 0 for $z_{t}$ indicates that the current state is mostly updated based on the candidate state, while a value close to 1 indicates that the current state is mostly retained from the previous hidden state.

By incorporating the update gate, GRU networks can selectively update and retain relevant information from both the current input and the previous hidden state, enabling them to capture and utilize long-term dependencies effectively.

2) Reset Gate

Makes the hidden state forget the past dependencies.\begin{equation*} r_{t} = \sigma (W_{xr} \cdot x_{t} + W_{hr} \cdot h_{t-1} + b_{r})\tag{35} \end{equation*} View Source where:\begin{align*} & r_{t} \text { is the reset gate at time step } t, \\ & x_{t} \text { is the current input at time step } t, \\ & h_{t-1} \text { is the previous hidden state at time step } t-1, \\ & W_{xr} \text { is the weight matrix for the input connections}, \\ & W_{hr} \text { is the weight matrix for the hidden state connections}, \\ & b_{r} \text { is the bias term for the reset gate}, \\ & \sigma \text { is the sigmoid activation function}.\end{align*} View Source

The reset gate $r_{t}$ controls the degree to which the hidden state should forget past dependencies. By selectively resetting the hidden state based on the reset gate, the GRU can adjust the influence of previous states on the current state.

All the gates in variants of the gated recurrent neural networks, are controlled by linear neural networks. Training the recurrent neural network as a whole, also trains and updates the weights of the gate controller networks.

D. Applications

There are myriad problems and use cases for the RNNs. Instances are: analysis and embedding of texts and medical reports to be combined with medical images [272], real-time denoising of medical video [273], classification of electroenephalogram (EEG) data [274], generating captions for images [28], [29], [30], [275], biomedical image segmentation [276], semantic segmentation of unstructured 3D point clouds [277], [278]. Other examples of RNN applications are predictive maintenance [279], prediction and classification of ICU outcomes [31], [280], [281].

In table 1, we have presented a summary of a few of the most important applications of RNNs, LSTMs, and GRUs with or without attention.

TABLE 1 General Usecases for RNN, LSTM, and GRU with or Without Attention in Different Areas

SECTION IX.

Attention-Based Models

In deep learning, attention-based models have emerged as a powerful paradigm, providing breakthroughs to various domains by focusing selectively on relevant information. These models have gained significant popularity in NLP, where they have revolutionized tasks such as machine translation, sentiment analysis, and text summarization. By dynamically assigning different weights to different parts of the input sequence, attention mechanisms allow the models to capture dependencies and relationships effectively [45], [336]. As a result, not only are the predictions more accurate but they are also more interpretable since the important aspects of the input are highlighted [337]. Attention-based encoder-decoder models are innovated to solve the shortcomings of RNN, LSTM, and GRU, which were fairly known as the state-of-the-art approacehes.

A. Learning to Align and Translate

The first Encode-decoder model with an attention mechanism was proposed by [34] in 2015 as a novel architecture to improve the performance of neural machine translation models. The key contribution of Bahdanau et al. [34] was the introduction of an attention mechanism in the decoder part, which involved calculating the weighted sum of the hidden states of the input. Unlike the basic encoder-decoder model that utilizes a single fixed-length vector, Bahdanau et al. extended this approach by encoding variable-length vectors. During the decoding process, the attention mechanism allows for selective focus on relevant parts of the input.

As illustrated in the Fig 8, the $c_{t}$ is what is added to this model as attention. So, the $s_{t}=f(s_{t-1}, y_{t-1}, c_{t}) $ output of the decoder will be based on the $c_{t}= \sum \alpha _{tt'}h_{t}$ which is defined as weighted attention where \begin{equation*} \alpha _{tt'} = \frac {exp(e_{tt'})}{ \sum \limits _{T} exp(e_{tT})} \tag{36}\end{equation*} View Source and $e_{tt'}$ is called alignment score. In other words, $\alpha _{tt'}$ is called amount of attention $y^{t}$ (the output in time step t) pay to $x^{t'}$ . There are different options to calculate the alignment score, and it is one of the parameters can be trained, but in general, it is based on $align(h_{i},s_{0})$ .

$FIGURE 8. - Based on the current target state $h_{t}$ and all source states $h_{s}$ , the model determines an alignment weight vector at time step $t$ . A global context vector $c_{t}$ is then computed as the weighted average over all the source states [34].$

FIGURE 8.

Based on the current target state $h_{t}$ and all source states $h_{s}$ , the model determines an alignment weight vector at time step $t$ . A global context vector $c_{t}$ is then computed as the weighted average over all the source states [34].

Show All

The model proposed by Bahdanau et al. introduced a groundbreaking approach that served as a source of inspiration for subsequent state-of-the-art models. Nonetheless, it exhibited limitations inherent to conventional encoder-decoder recurrent models and did not possess parallel computing capabilities.

B. Transformers

Despite the notable contribution made by Bahdanau et al. [34] in introducing attention for RNN-based models, this model is still challenging to train because of the long gradient path, specifically for long data sequences. The introduction of Transformers [45] brought about a significant breakthrough by making use of the power of attention within the context of the input sequence. Transformers overcome the two main drawbacks of LSTM-based models: 1. parallel computing capability and 2. the problem of long gradient paths. The model architecture is based on a stack of multiple encoder-decoder layers, each sharing the same structure. The input first undergoes an “input embedding” layer to transform one-hot token representations into word vectors. After positional encoding, the result is fed into the encoder. The core component of the encoder and decoder blocks is a multi-headed self-attention mechanism $(Q, K, V)$ , followed by point-wise feed-forward networks.

Self-attention structure:

To estimate the relevance of each element in a given series to all others, self-alignment is employed. The process involves the following steps:

Step 1:
Randomly generate $W_{Q}$ , $W_{K}$ , $W_{V}$ weights and calculate:
$Queries=X_{k}\,\, \scriptstyle (\text {embedding}) \,\,*W_{Q}$
$Key = X_{k}\,\, \scriptstyle (\text {embedding}) \,\,*W_{K}$
$\text { Value }=X_k \text { (embedding) } * W_V$
$\text { (Where } X \in \mathbb{R}^{T \times \mu_m}, W^Q \in \mathbb{R}^{D \times D_Q}, W^K \in \mathbb{R}^{D \times D_K}, W^V \in \mathbb{R}^{D \times D_V} \text { ) }$
Step 2:
Calculate the z-score for each input by applying row-wise softmax on the scores obtained from the pairwise multiplication of queries and keys:\begin{equation*} Z(Q,K,V)= \textit {Softmax}\left({\frac {Q.K^{T}}{\sqrt {d_{K}}}}\right).V \tag{37}\end{equation*} View Source

Next, concatenate the z-scores, initialize a new weight matrix, and multiply it by the z-scores. Finally, feed the result to a fully connected neural network (FCNN).

One common operation is to apply another set of feed-forward layers to the $Z$ scores, often referred to as the “point-wise feed-forward network” (FFN). This step allows for additional nonlinear transformations and feature extraction.

After the FFN, the outputs can be passed to subsequent layers of the transformer, which may involve stacking multiple encoder-decoder layers or performing additional attention mechanisms. This hierarchical structure enables capturing complex dependencies and relationships among the input elements.

C. Extra Large Transformers

Transformers have become indispensable to the modern deep learning stack by significantly impacting several fields. This has made it the center of focus and caused a overwhelming number of model variants proposing basic enhancements to mitigate a widely known concern with self-attention: its quadratic time and memory complexity [338]. These two drawbacks can pose significant challenges to model scalability in many settings. To overcome this limitation, researchers have explored various approaches, which can be categorized in several ways [339]. Some of these approaches include:

1) Recurrence

One of the most well-known extensions to the vanilla Transformer model is Transformer-XL [340]. It employs a segment-level recurrence mechanism that connects multiple adjacent blocks. This model introduces two key ideas. Firstly, by using segment-level recurrence, hidden states from the previous batch can be cached and reused. Secondly, it introduces a novel positional encoding scheme that enables temporal coherence.

As an extension to the block-wise approach, Transformer-XL splits the input into small non-overlapping subsequences known as blocks [341]. Although it exhibits impressive performance compared to the vanilla transformer, this model lacks the ability to maintain long-term dependencies and discards past activations as it progresses through the blocks. Specifically, Transformer-XL propagates gradients across the current segment, caches them, processes the second segment using the memory from the first segment (without gradients for the first segment), moves on to the third segment, and discards the gradient information from the first window. Consequently, this can be seen as a form of truncated back-propagation through time (BPTT).

The distinctive aspect of this model, which sets it apart from others, lies in its relative positional encoding scheme that ensures temporal coherence. The relative positional encoding encodes distances on edges rather than nodes. While previous work on relative positional encoding existed [342], Transformer-XL introduces two additional features: a global content and location bias, and the replacement of trainable positional embeddings with sinusoid embeddings. Their results demonstrate that Transformer-XL outperforms vanilla Transformers even without the use of a recurrence mechanism. Compressive Transformer [343] is another model which can be classified as a recurrence approach.

2) Reduced Dimenssions/ Kernels/ Low- rank methods

The “Transformers are RNNs” [344] introduces the concept of utilizing a kernel function $\phi (X)=elu(x_{i})+1$ instead of softmax to map the attention matrix to its approximation. The function is applied on ‘Keys’ and ‘Queries,’ lowering their dimension. ${Q}_{({ N}\times D)}.{k^{T}}_{(D\times N)}$ and avoid computing the ${N\times N}$ matrix. Linformer [345] and Synthesizer [346] are other models based on this approach.

3) Sparse attention

The Longformer model, introduced in the paper [347], achieves linear complexity of O(n) by employing a global memory technique and drawing analogies to convolutional neural networks (CNNs). To reduce dimensions, a combination of sliding window and global attention techniques is applied to each Query. Longformer, along with Bigbird [348], ETC [349], and SWIN Transformer [350], falls into the same category of models that utilize sparse attention techniques. On the other hand, Image Transformer [343] and Axial Transformer [351] are other examples of extended sparse attention works that primarily focus on vision data.

D. Pre-trained Models

Pre-trained models are neural networks that have been trained on large-scale corpora and are designed to be capable of transfer and fine-tuning for various downstream tasks. Word embeddings as a base that enabled us to utilize machine learning for processing natural language can be viewed as pioneers of widely used pre-rained representations. Word2vec [43], and Glove [44], which we discussed earlier, are among the most famous models learning a constant embedding for each word in vector space. In what follows, we will try to pinpoint the most famous and important pre-trained models that retain contextual representations, which those mentioned earlier are incapable of.

Reference [352] from 2015 is one of the earliest instances of supervised sequence learning using LSTMs that pre-trained an entire language model for use in various classification tasks. ELMO [353], a deep contextualized word representation, is analogous to the earlier one; however, it is bidirectional. CoVE [354] is another recurrent model in this category that has demonstrated good performance. GPT1 [355] is a Transformer-based pre-trained model that was trained on a large book corpus dataset to learn a universal representation, enabling transfer with minimal adaptation. “Deep Bidirectional Transformers for Language Understanding ” - BERT [46], a well-known turning point in the NLP area (perhaps also the entire ML stack), trains left context and right context at the same time rather than doing individually and concatenating at the end. Since BERT accesses information from both directions, it masks out K% of the input sequence to prevent the model from simply copying the input. XLNet [356], RoBERTa [357], ERNIE [358], and ELECTRA [359] are variations of the BERT model.”

All the mentioned models have proven effective in NLP studies. These successful observations within the NLP space inspired researchers to apply a similar approach to other domains. [345] has shown that pre-trained models’ success is not limited to transformer-based ones. They have demonstrated their pre-trained convolution seq2seq model can beat pre-trained Transformers in machine translation, language modeling, and abstractive summarization. Vision Transformer(ViT) [360] is one of all the foremost recent pre-trained models that helps transfer learning in image classification tasks. This model has shown outstanding results in training a pure transformer applied directly to sequences of image patches. ResNet50 [361], a pretrained CNN-based model which allows training networks with up to 1000 layers. ResNet50 consists of a succession of convolutional layers with different kernel settings. References [362], [363], [364], and [365] are all trained on the vast number of datasets for various image classification transfer learning usage categories.

While these models have shown excellent results, the range of pre-trained models is not restricted to the mentioned. One is to precisely study and examine different models and approaches to search out their dataset’s best and most efficient model.

Language models (LMs) are computational models with the capacity to comprehend and generate human language. Language models have the impressive ability to calculate the likelihood of word sequences or generate new text based on given input [366]. Researchers find that scaling pretrained-language models such as BERT can lead to an improved model capacity [367]. Recent years have witnessed incredible progress in pre-training of large language models (LLMs) like GPT-4 [368], PaLM2 [369], LLaMA 2 [370], which have proven extremely effective for transfer learning in NLP. While concerns remain around bias, safety, and environmental impact [371], [372], the application [373], [374] of LLMs continues to rapidly advance. Though the eventual impacts remain speculative, LLMs have already catalyzed a revolution in representation learning.

E. Recurrent Cell to Rescue

With the availability of GPUs as a powerful computation tool in the machine learning toolkit, LSTM (Long Short-Term Memory) emerged as a practical approach in numerous sequence-based machine learning models. With the introduction of word embeddings in 2013, LSTM and other RNN-based models have been widely dominant in sequence learning problems. After presenting transformers with their All-to-all comparison mechanism and their performance on transfer learning tasks, they became the SOTA model and dominated the deep learning space.

1) RNN vs. Transformer

While transformers can grasp the context and be used more efficiently for transferring knowledge to tasks with limited supervision by pre-trained models, these benefits come with quadratic memory and time complexity of $O(N^{2})$ [344]. Most of the current pre-trained transformer-based models do only accept 512 numbers of the input sequence. IndRNN model [375] has shown the ability to process sequences over 5000 time steps. The Legendre Memory Unit [376] is based on recurrent architecture and can be implemented by a spiking neural network [377], which can maintain the dependencies across 100,000 time steps. Apart from computation cost, [378] by Facebook shows that the accuracy gap between Bert-based [46] pre-trained models versus vanilla LSTM for a massive corpus of data is less than 1%. Henceforth, a competitive accuracy result is achievable by training a simple LSTM when many training examples are available. They also show that reusing the pre-trained token embeddings learned in BERT can significantly improve the LSTM model’s accuracy. Reference [379] shows that standard transformers are not as efficient as RNN-based models for reinforcement learning tasks. [A-13] has investigated the performance of Transformer and RNN in speech application and shows both have the same performance in text-to-speech tasks and slightly better performance by Transformer in the automatic speech recognition task.

On the other hand, [379] shows that their attention-based model can outperform the state-of-the-art in terms of precision, time, and memory requirements for satellite image time series. Reference [380] has compared LSTM performance with transformers in their proposed Frozen Pretrained Transformer model as part of their paper. They evaluate a diverse set of classification tasks to investigate the ability to learn representations for predictive learning across various modalities and show that transformers perform better. Reference [381] has proposed an improved-Transformer-based comment generation method that extracts both the text and structure information from the program code. They show that their model outperforms the regular Transformer and classical recurrent models. Reference [382] is a transformer-based transcoder network for end-to-end speech-to-speech translation that surpasses all the SOTA models in natural speech-to-speech translation tasks. Reference [383] has introduced an AttentiveConvolutional Transformer which takes advantage of Transformer and CNN for text classification tasks. Their experiment reveals that ACT can outperform RNN-based models evaluated on three different datasets.

2) Combining Recurrent and Attention

R-Transformer [384] inherits the Transformers’ architecture and is adding what they call “Local RNN” to capture sequential information in data. The main improvement proposed is defining a sequence window to capture the sequential information and sliding the Local RNN over the whole time series to get the global sequential information [385]. This approach is similar to 1-D CNN; however, CNN ignores the sequential information of positions. Also, the Transformer’s positional embedding that mitigates this problem is limited to a specific sequence length. Henceforth, they have proposed a ‘Local RNN’ model that can efficiently do parallel computation of several short sequences to capture the local structure’s global long-term dependency by applying a multi-head attention mechanism. This model has replaced the Transformers’ position embeddings with multiple local RNNs, which can outperform the simple recurrent approaches such as GRU, LSTM, convolutional [386], and regular Transformer. Reference [381] has proposed a modified LSTM cell to mitigate the similarity between hidden representations learned by LSTM across different time steps in which attention weights cannot carry much meaning. They propose two approaches: first, by orthogonalizing the hidden state at time $t$ with the mean of previous states, they ensure low conicity between hidden states. The second is a loss function in which a joint probability for the ground truth class and input sentences is used and also minimizes the conicity between the hidden states. These mutations provide a more precise ranking of hidden states, are better indicative of words important for the model’s predictions, and correlate better with gradient-based attribution methods.

While we have mentioned works in which LSTM outperforms Transformers and vice versa, one should study the proper approach based on the dataset, accessible computation resources, and so forth.

F. Applications

A wide range of transformer models and variants have been applied in various domains, demonstrating their versatility and effectiveness. We discuss some of these models’ notable applications.

In machine translation, models like Transformer [45] have surpassed traditional recurrent neural network-based models, achieving state-of-the-art performance. Several models have demonstrated the ability to understand context and generate accurate answers for question answering tasks, including BERT [387] and GPT4 [368]. Various transformer models have demonstrated excellent performance in classifying sentiment in text, such as BERT and XLNet. Moreover, transformer-based models, such as BART [388] and T5 [256], have been successfully applied to the summarization of lengthy documents and articles.

In the field of computer vision, transformers have made significant contributions. In image classification, Vision Transformer (ViT) [389] applies transformers and achieves competitive performance against convolutional neural networks (CNNs) on benchmark datasets. For object detection, DETR (DEtection TRansformer) [390] is a transformer-based model that directly predicts object bounding boxes and class labels. The use of transformer models has also been applied to image generation tasks, such as the VQ-VAE-2 model [391], which combines transformers with vector quantization to generate high-quality images. Additionally, transformers have been used in generative models such as DALL-E [392], which enables the generation of images from textual descriptions.

In robotics, transformers allow capturing long-range dependencies and global context, leading to improved perception capabilities [393], [394]. In robot planning [395], [396], transformers have been utilized for motion planning and task planning, leveraging their ability to capture complex spatial and temporal dependencies [397], [398]. Transformers have also been employed in robot control, learning policies and generating appropriate actions [399], [400], [401].

As illustrated in Fig 9, we have tried to show some of the best and most famous sequence to sequence models, including the recent transformer-based models with different applications.

FIGURE 9.

A general timeline for some of the most important sequence models in history, including pretrained ones. [402].

Show All

SECTION X.

Transfer Learning

Many of the advancements in machine learning techniques make a huge improvement over the existing benchmarks. There are, however, some assumptions and challenges that make it difficult to apply the methods to real-world situations. In many cases, the assumption is that the trained model will be tested on the same feature distribution as the training stage. This assumption usually does not hold as the environment changes. In addition, many promising results are obtained by training models with large datasets. These pre-requisites makes it very challenging to adapt to many different tasks. For many applications, acquiring large amounts of data can be costly, time-consuming or even impossible. The absence of data for specific tasks may not be the only challenge; Massive data collection poses a huge privacy problem in many healthcare and medical applications [403]. In other cases, annotating the data would require an expert and could be expensive, such as low-resource languages [404].

Transfer learning aims to alleviate the mentioned problems. Generally, transfer learning refers to when a learner wants to improve the performance on the target domain by transferring knowledge from the source domain. It derives from the human intuitive ability to share knowledge across different domains and tasks. For example, learning a language might help you learn the second one if there’s some relation in between. The term itself is very general and there have been many extensions to it in recent years.

Transfer learning enables machine learning models to be retrained and reuse their previously learned knowledge. A general definition of the problem is divided into two components: Domain and Task [405].

The Domain is defined as $D =\{\chi,P(X)\}$ , with $\chi $ representing the feature space, and $P(X)$ for each $X = \{x_{1}, \ldots, x_{n}\} \in \chi $ denoting the marginal probability over the feature space. In cases where different domains are encountered, the source domain $D_{S}$ and the target domain $D_{T}$ can assume different feature space or marginal probability distributions [405].

Given a specific domain $D =\{\chi,P(X)\}$ , Task is represented by $T=\{y, f(x)\}$ , where $y$ is the feature space and $f(x)=P(Y|X)$ denotes the function that can be learned from the training data to predict the target, in a supervised manner from the labeled data ${x_{i}, y_{i} }$ , where $x_{i} \in X$ and $y_{i} \in Y$ . In cases where no labels are considered for the data, as is the case for unsupervised algorithms, $y$ can be a latent variable such as the cluster number, or a variable that is produced by an unsupervised algorithm (e.g., the reduced dimensions of the original data) [405]. In light of both the domain D and the task T being defined as tuples, four transfer learning scenarios can be arise.

The first scenario is when the source and target domain are different ${X_{s} \neq Y_{t}}$ . A good example of this is, in the computer vision community, where the source task is an image of a humans, but the target task is an image of an objects. A similar example can be found in NLP when it comes to cross-lingual adaptation.

The second scenario happens when $P(X_{s}) \neq P(X_{t})$ , the marginal probability distributions of source and target domain are different. This scenario is generally known as domain adaptation. An example could a detection problem where the source and target has different kind of cars.

The third occurs when $Y_{s} \neq Y_{t}$ , the label spaces between the two tasks are different. For example, consider a detection problem where the source task considers the detection of cars, while the target task considers animals.

The last is when $P(Y_{s}|X_{s}) \neq P(Y_{t}|X_{t})$ , the conditional probability distributions of the source and target tasks are different. The imbalancness of data between source and target tasks is a very common example.

A number of surveys have been conducted to categorize the available methods [405], [406], [407]. In [406] the available methods are categorized into three different sections, transductive, inductive, and unsupervised transfer learning. In [407] the available methods are categorized in more detail based on the data or model perspectives. Although this categorization could give some insights, there are many newer methods that cannot fit in those categories or belong to more than one, zero-shot transfer learning [408], reinforcement transfer learning [409], and online transfer learning [410] are among these methods.

Several sub-categories of transfer learning can be considered for each of these main categories based on the nature of the knowledge transfer. In the following subsection, the most prominent sub-categories of transfer learning will be discussed.

Instance-based:
Despite differences in the source and the target domains, an instance based transfer learning, such as TrAdaBoost [411] or Bi-weighting Domain Adaptation (BIW) [412], adjusts the weights used for a subset of the source instances that are similar to the target domain, to predict the target instances. Since similarity of the selected source instance to those of the target domain play a crucial role in instance based transfer learning, a filter is used to remove dissimilar instances that would otherwise mislead the algorithm [405], [413], [414], [415], [416], [417].
Feature-based
Determining the common denominator between related tasks would allow for defining a representative feature that would apply to all domains and reduce differences between them. In this case, the common feature attempts to identify some partial overlap between the defined tasks. Having a representative feature among different tasks would also allow for a reduction in the overall error [405], [418], [419], [420]. While the source and target domains may have differences between them in their original data space, it is likely that the two would exhibit similarities in a transformed data space. Mapping-based deep transfer learning techniques, such as Transfer Component Analysis [421], create a union between the source and target domain instances by applying a mapping between the two and transforming them into a new data space based on their similarity so that they can be used for deep nets [417], [422].
Network-Based
Different models derived from related tasks can have many similarities and differences. Similar models often have knowledge about the model parameters or the behavior of hyperparameters shared between the individual models. In such cases, it is possible to create a learning algorithm that infers the model parameters and the distributions of its hyperparameters by examining the prior distributions of several other tasks [405], [423], [424]. Similar to the learning and inference process followed by the human brain, where the trained brain cells can ad-hoc to other brain cells in related tasks, the network-based approach aims at using an already trained neural network as part of a much more extensive deep neural network. This approach trains the subnet on its relevant domain data, and the resulting pre-trained network is transferred to a larger deep net [423], [425], [426]. A few examples of network-based deep transfer learning approaches include ResNet, VGG, Inception, and LeNet, which can extract a versatile set of features in the network’s front layers [417].
Relational Knowledge-Based
There are several instances, such as the social network data, where the data are not independent and identically distributed (IID). Relational domains allow for the handling of this scenario. In a relational domain, each entry is represented by multiple relations, not just a single identifier [427]. Unlike other methods discussed before, the cross-domain relational knowledge transfer algorithms, such as TAMAR, use the Markov Logic Networks (MLNs) to transfer the relational knowledge without requiring each data point to be IID [405], [428], [429].
Adversarial-Based
Built on the strong foundation of the GANs, the adversarial-based approaches to transfer learning use a generator challenged by a discriminator to identify the transferable representations. A representation is considered transferable when it discriminates between the different components of the main learning task but does not discriminate the source domain from the target domain [417]. Most approaches use a single domain discriminator to align the source and target distributions, or use multiple discriminators to align subdomains [430], [431], [432], [433].
There is no unified approach where one can use Transfer learning. A very common transfer learning approach is when your target domain does not have sufficient training data. The model first pre-trains on the source data and then fine-tunes on the target data. Many well-known architectures in different communities are being used for related downstream tasks. In NLP, Bert [387], Word2vec [434], and ERNIE [435] are the famous models where a lot of downstream tasks can learn their specific related task with the shared knowledge backbone. Similarly, in the Vision community, Resnet [436], Vision Transformers [437] and ConvNeXt [438] could be used. Also in Speech, Wav2Vec [439], DeepSpeech [440] and HuBERT [441] are among famous models. It is important to note that there are different levels of fine-tuning. With enough data in the target domain, fine-tuning can also alter the entire backbone representation. In many applications, however, this can be done partially (the last few layers) or just for the task-specific heads without changing the backbone representation. The mentioned models these are expected to be general enough so that they can be used for many downstream processes. For instance, Resnet trained to classify images into 1000 different categories. If the model knows to classify cars, the knowledge can be used to detect airplanes or even a different task like semantic segmentation [442].
Other common ways of transfer are when the task is the same, but the domain changes. A helpful example might be applying the knowledge gained from the simulation data to real-world data [443]. In many applications like robotics, and computer vision, acquiring simulation data is very easy and straightforward.
The goal of transfer learning is to adapt the knowledge learned in one domain to another but closely related one. Many recent papers [444], [445] suggested that using pre-trained models and fine-tuning might not be the optimal approach. It is therefore important to know why and what to transfer. Another issue with pre-training solutions is the accumulation of parameters in each sub-task. These networks can have millions or even billions of parameters [445], it is then very impractical to fine-tune the backbone representation for every downstream task. Consider an application where the model should use the representation to do sentiment analysis along with entity recognition. If one wants to fine-tune a separate backbone for every downstream task, it would be very memory inefficient. Multitask Learning [446] aims to learn a shared representation for multiple related tasks which can be generalized across all tasks. As opposed to creating an instance of the backbone for each task, the representation is being shared across multiple tasks to improve efficiency [447].
There are also instances where transfer learning occurs in the feature space; instead of transferring the representation to the new task, a related but fixed representation can be used. In this case, the main representation remains intact and a small network learns the representation specific to the target task. Having common latent features acts as a bridge for knowledge transfer. In [448] the authors trained a lightweight CNN module on top of a generic representation called mid-level representation. In comparison to training a complex CNN module which also learns the representations, they achieved superior performance in terms of accuracy, efficiency, and generalization with the method.

A. Applications

As mentioned, the use of transfer learning does not follow any conventional approach. Therefore, one should precisely study examples of how researchers can use transfer learning in their problems.

When it comes to medical applications, both privacy and expert labeling are key issues that make data availability difficult. In [449], [450], [451], and [452] the authors try to transfer the knowledge learned from the pre-trained models, Resnet [453] or ALexNet [454] trained on ImageNet, and transfer it for different tasks like Brain Tumor Segmentation, 3d medical image analysis and Alzheimer. Reference [455] found that due to the mismatch in learned features between the natural image, e.g., ImageNet, and medical images the transferring is ineffective and they propose an in-domain transferring approach to alleviate the issue.

There was a great deal of success with transfer learning in the field of NLP. Several reasons exist for this, but largely it is because it is easy to access the large corpus of texts. It is inherent for pre-trained models to generalize across many domains due to the millions or billions of text data that they are trained on [46], [456], [457], [458], [459]. These representations can be transferred in different areas such as sentiment analysis [460], [461], [462], Question Answering [463], [464], [465], and Cross-lingual knowledge transfer [460], [466], [467].

There are many speech recognition applications that are similar to NLP because of the nature of language. These applications are discussed in [468], [469], [470], and [471]

The progress of transfer learning in various domains have motivated researchers to adapt explored approaches for time series datasets [472]. For time-series tasks, transfer learning applications range from classification [473], anomaly detection [474], [475] to forecasting [476], [477].

Transfer learning has also been applied to various fields, ranging from text classification [478], [479], [480], spam email and intrusion detection [481], [482], [483], [484], recommendation systems [485], [486], [487], [488], [489], [490], [491], [492], [493], [494], biology and gene expression modeling [495], [496], to image and video concept classification [497], [498], [499], [500], human activity recognition [501], [502], [503]. While these fields are vastly different, they all benefit from the core functionalities of transfer learning, in applying the knowledge gained under controlled settings or similar domains, to new areas that may otherwise lack this knowledge.

B. Challenges

Despite many successes in the area of transfer learning, some challenges still remain. This section discusses current challenges and possible improvements.

1) Negative Transfer

One of the earliest challenges discovered in transfer learning is called negative transfer learning. The term describes when the transfer results in a reduction in performance. One of the reasons could be the interference with previous knowledge [504] or the dissimilarity between the domains [444], [445] could be one of the reasons. There might be some cases where the transfer does not degrade, but doesn’t make full use of its potential to obtain a representative feature. In [504] has been shown that contrastive pre-training on the same domain may be more effective than attempting to transfer knowledge from another domain. Similarly, in [505] the study was conducted to explore which tasks will gain from sharing knowledge and which will suffer from negative transfer and should be learned in a separate model. In [506] the authors proposes a formal definition of negative transfer and analyzes three key aspects, as well as a model for filtering out unrelated source data.

2) Measuring Knowledge Gain

The concept of transfer learning enables remarkable gains in learning new tasks. However, it’s difficult to quantify how much knowledge is transferred. A mechanism for quantifying transfer in transfer learning is essential for understanding the quality of transfer and its viability. In addition to the available evaluation metrics, we need to assess the generalizability/robustness of the models, especially in situations where class sets are different between problems [507]. There was an attempt in [506], [508], and [509] to formulate the problem so that transfer learning related gains could be quantified.

3) Scalability and Interpretability

Although many works demonstrate the ability of tasks to be transferred and their effectiveness, there is no guideline on how and what should be transferred. It has been shown that transfer learning can be effective only when there is a direct relationship between source and target; however, there have been many instances where transfer learning has failed despite the assumption of reletivity. Furthermore, as pre-trained models are becoming more widespread, with millions or billions of papameters, it would not be feasible to try all of the available methods to see which transfer could be helpful. Moreover, this requires a tremendous amount of computation, resulting in a large carbon footprint [510], [511]. It is critical that models are interpretable not only for their task, but also in terms of their ability to be transferred to other tasks. This work [512] defines the interpretable features that will be able to explain the relationship between the source and target domain in a transfer learning task.

4) Cross-Modal Transfer

In general, transfer learning is used when the source and target domains have the same modalities or input sizes. However, in many scenarios, this assumption could present a problem in adopting knowledge. Our ability to transfer knowledge from different modalities is crucial, since many tasks in our daily lives require information from multiple sources (perception and text or speech). One of the most recent studies, Bert [513] and ViLBert [514], attempts to transfer knowledge between text and image data. Additionally, we should be able to transfer knowledge regardless of the difference between input sizes in the source and target domain. An example could be transferring knowledge from 2D to 3D datasets [515], [516].

5) How to build Transferable models

The development of neural networks and deep learning models often requires significant architecture engineering. In addition, these models are engineered to outperform the existing models on the target dataset. As a result of the performance gain, the model’s ability to generalize is usually degraded. We should be able to build models that enable transferability and reduce the dataset bias. As shown in [517], deep features eventually transition from general to specific along the network, which make the feature transferability drops significantly in higher layers. Works in [509], [517], [518], [519], [520], and [521] try to build the model with the focus of the transferability across domains.

SECTION XI.

Neural Radiance Fields

A. Definition and Applications

Several contributions in computer graphics have had a major impact on deep learning techniques to represent scenes and shapes with neural networks. A particular aim of the computer vision community is to represent objects and scenes in a photo-realistic manner using novel views. It enables a wide range of applications including cinema-graph [522], [523], video enhancement [524], [525], virtual reality [526], video stabilization [527], [528] and to name a few.

The task involves the collection of multiple images from different viewpoints of a real world scene, and the objective is to generate a photo-realistic image of such a novel view in the same scene. Many advancements have been made, one of the most common is to predict a 3D discrete volume representation using a neural network [529] and then render novel views using this representation. Usually these models take in the images and pass them through a 3D CNN model [530], then the model outputs the RGBA 3D volume [531], [532], [533]. Even though these models are very effective for rendering, they don’t scale since each scene requires a lot of storage. A new approach to scene representation has emerged in recent years, in which the neural network represents the scene itself. In this case, the model takes in the $X,Y,Z$ location and outputs the shape representation [534], [535], [536], [537]. The output of these models could be distance to the surface [534], occupancy [535], or a combination of color, and distance [536], [537]. As the shape itself is a neural network model, it is difficult to optimize it for different renderings. However, the key advantage is the shapes are compressed by the neural network which makes it very efficient in terms of memory. Nerf [538] combines these ideas into a single architecture. Given the spatial location $X,Y,Z$ and viewing direction $\theta,\phi $ , a simple fully connected outputs the color $r,g,b$ and opacity $\sigma $ of the specified input location and direction.

A very high level explanation of Nerfs could be think of as function that can map the the 3D location($x$ ) and ray direction($d$ ) to the color($r,g,b$ ) and volume density($\sigma $ ).\begin{equation*} \mathrm {F}(\underline {\mathbf {x}}, \underline {\mathbf {d}})=(\mathrm {r}, \mathrm {g}, \mathrm {b}, \boldsymbol {\sigma })\end{equation*} View Source

During the training stage, given a set of image from different views(well-known camera poses) an MLP is trained to optimize it weights.

In order to generate a realistic photo, we have to hypothetically place the camera(having the position) and point it to a specific direction.

Consider from the camera we shoot a ray and we want to sample from the NeRF along the way. There might be a lot of free space, but eventually the ray should collide with the surface of the object. The summation along the ray should represent the pixel’s color at the specific location and viewing direction. In other words, the pixel value in image space is the weighted combination of these output values as below.\begin{equation*} C \approx \sum _{i=1}^{N} T_{i} \alpha _{i} c_{i} \tag{38}\end{equation*} View Source where $T_{i}$ , can be think of as weights, is the accumulated product of all of the values behind it:\begin{equation*} T_{i}=\prod _{j=1}^{i-1}\left ({1-\alpha _{j}}\right)\tag{39} \end{equation*} View Source where $\alpha _{i}$ is:\begin{equation*} \alpha _{i}=1-e^{-\sigma _{i} \delta t_{i}} \tag{40}\end{equation*} View Source

In the end, we can put all the pixels together to generate the image. The whole process, including the ray shooting, is fully differentiable, and can be trained using the total squared error:\begin{equation*} \min _{\theta} \sum _{i}\left \|{\mathrm {render}_{i}\left ({F_{\theta} }\right)-I_{i}}\right \|^{2} \tag{41}\end{equation*} View Source where $i$ representing the ray and the loss minimizing the error between the rendered value from the network, $F_{\theta} $ , and the $I$ is the actual pixel value.

B. Challenges

In spite of the many improvements and astounding quality of rendering, the original Nerf paper left out many aspects.

One of the main assumptions in the original Nerf paper was static scenes. For many applications, including AR/VR, video game renderings, objects in the scenes are not static. The ability to render objects with respect to time along with the novel views are essential in my applications. There are some works attempting to solve the problem and change the original formulation for dynamic scenes and non-rigid objects [539], [540], [541], [542].

The other limitation is slow training and rendering. During the training phase, the model needs to qeury every pixel in the image. That results about 150 to 200 million queries for a one megapixel image [538], also, inference takes around 30 sec/frame. In order to solve the training issue, [543] proposes to use the depth data, which makes the network to need less number of views during training. Other network properties and optimizations can be change to speed-up the training issue [544], [545]. Inference also needs to be real-time for many rendering applications. Many works try to address the issue in different aspects; changing the scene representation to voxel base [546], [547], separate models for foreground and background [548] or other network improvements [541], [549], [550], [551].

A key feature of the representation for real-world scenarios is the ability to generalize across many cases. In contrast, the original Nerf trained an MLP for every scene. Every time a new scene is added, the MLP should be retrained from scratch. Several works have explored the possibility of generalizing and sharing the representation across multiple categories or at least within the category [552], [553], [554], [555].

For the scope to be widened to other possible applications, we need control over the renderings in different scenarios. The control over the camera position and direction was examined in the original Nerf paper. Some works attempted to control, edit, and condition it in terms of materials [556], [557], color [554], [558], object placement [559] and [560], facial attributes [561], [562] or text-guided editing [563].

SECTION XII.

The Challenges of Representation Learning

Numerous challenges shall be addressed while learning representations from the data. The following section will provide a brief discussion on the most prominent challenges faced in the deep representation learning.

A. Interpretability

There is a fine distinction between explainability and interpretability of a system. An explanation can be defined as any piece of information that helps the user understand the model’s behavior and the process that it goes through to make the decision. Explanations can give insights on the role of each attribute in the overall performance of the system, or rules that determine the expected outcome, i.e., when a condition is met [564]. Interpretability, however, is considered as a human’s ability to predict what the model result would be, based on the decision flow that the model follows [565]. A highly interpretable ML model is an easily comprehensible one, but the deep neural networks miss this aspect. Despite the promising performance of deep neural networks in various applications, the inherent lack of transparency in the process by which a deep neural network provides an output is still a major challenge. This black-box nature may render them useless in several applications, such as in situations where high degree of safety [566], security [567], fairness and ethicality [568], or reliability [17], [281], [569], [570], [571] are critical.

Therefore, design and implementation of problem-specific methods of interpretability and explainability is necessary [572]. Although the conventional methods of learning from data, such as decision trees, linear models, or self-organization maps [573], may provide visual explainability [15], deep neural network require post-hoc methods of interpretation. From a trained model, the underlying representation of the input data, may be extracted and presented in understandable formats for the end-users. Examples of post-hoc approaches are [15]: sentences generated as explanations [15], visualizations [574], explanation by examples [575]. Granted that, the post-hoc approaches provide another representation of the captured features, they do not directly reveal the exact causal connections and correlations at the model parameters level [15]. Nonetheless, it increases the reliability of the deep models.

B. Scalability

Scalability is an essential and challenging aspect of many representation learning models, partly because getting models to maintain the quality and scale up to real-world applications relies on several different factors, including high-performance computing, optimized workloads distribution, managing a large distributed infrastructure, and Generalization of the algorithm [576], [577], [578]. Reference [579] has classified big-data machine learning approaches based on distributed or non-distributed fashion. In general, the scalability of representation learning models faces multiple dimensions and significant technical challenges: 1) availability of large amount of data 2) scaling the model size 3) scaling the number of models and/or computing machines 4)computing resources that can support the computational demands [577], [578], [580].

The huge amount of data can be accessed from a variety of sources, including internet clicks, user-generated content, business transactions, social media, sensor networks, etc [581]. Despite the growing pervasiveness level of big data, there are still challenges to accessing a high-quality training set. Data sharing agreements, violation of privacy [582], [583], noise problem [584], [585], poor data quality(fit for purpose) [586], imbalance of data [587], and lack of annotated datasets are number of challenges businesses face seeking raw data. Oversampling, undersampling, dynamic sampling [588] for imbalanced data, Surrogate Loss, Data Cleaning, finding distribution in solving the problem of learning from noisy labels for noisy data sets, and active learning [589] for lack of annotated data are a number of methods have been proposed to alleviate these problems.

Model scalability is one of the other concerns in which tasks may exhibit very high dimensionality. To efficiently handle this requirement, different approaches are proposed that cover the last two significant technical challenges mentioned earlier: using multiple machines in a cluster to improve the computing power (scaling out) [590] or using more powerful graphics processing units. Another crucial challenge is managing a large distributed infrastructure that hosts several deep learning models trained with a large amount of data. Over the last decade, there have been several types of research done in the area of high-performance computing to alleviate open research problems in infrastructure and hardware, Parallelization Methods, Optimizations for Data Parallelism, Scheduling and Elasticity, Data Management [576], [591], [592], [593], [594], [595]. While building large clusters of computing nodes may face several problems, such as communication bottlenecks, on the other hand, attempts to accelerate the performance of GPUs capable of implementing energy-efficient DL execution run across several major hurdles [595], [596]. Though we are able to train extremely large neural networks, they may optimize for a single outcome, and several challenges still remain

In addition, model pruning techniques [597] can help improve scalability by reducing model size and computational requirements. Pruning removes redundant or non-critical connections in neural networks to obtain a smaller, efficient model that maintains accuracy. This helps address hardware constraints and improves inference speed. Some of the most important pruning techniques include: Structured Pruning [598], which focuses on removing entire structured sections like layers or channels, producing more regular, hardware-friendly architectures; Unstructured Pruning [599], which removes individual weights from the network, leaving the overall architecture unchanged but with sparser connections; and Magnitude-based Pruning [600], a method where weights below a specified magnitude threshold are pruned, offering an optimal balance between simplicity and efficacy.

C. Security, Robustness, Adversarial Attacks

Machine learning is becoming more widely used, resulting in security and reliability concerns. Running these AI workflows for real-world applications may be vulnerable to adversarial attacks. AI models are developed under carefully controlled conditions for optimal performance. However, these conditions are rarely maintained in real-world scenarios. These changes could be both incidental or intentional adversity, both could result in a wrong prediction. Efficacy in detecting and detecting adversarial threats is referred to as adversarial robustness. A major challenge in robustness is the non-interpretability of many advanced models’ representations. In [601] the authors show that there’s a positive connections between model interpretability and adversarial robustness. In some cases [602], [603], [604], researchers attempt to interpret the results, but they usually pick examples and show the correlation between the representations and semantic concepts. However, such a relationship may not exist in general [605], [606]. The discontinuity of the representation first introduced in [607], where deep neural networks can be misclassified by adding imperceptible, non-random noise to inputs. For a more detailed discussion of different types of attacks, readers can refer to [608] and [609]. It is worth mentioning that Research on adversarial perturbations and attack techniques is primarily carried out in image classification [610], [611], [612], the same behavior is also seen in NLP [613], [614], speech recognition [615], [616], [617], and time-series analysis [618], [619]. For the systems based on biometrics verification [620], [621], [622], an adversarial attack could compromise its security. The use of biometrics in establishing a person’s identity has become increasingly common in legal and administrative tasks [623]. The goal of representation learning is to find a (non-linear) representation of features $\mathrm f: \mathcal {X} \rightarrow \mathcal {Z}$ fro from input space $\mathcal {X}$ to feature space $\mathcal {Z}$ so that $f$ retains relevant information regarding the target task $\mathcal {Y}$ while hiding sensitive attributes [624]. Despite all the proposed defenses, deep learning algorithms still remain vulnerable to security attacks, as proposed defenses are only able to defend against the attacks they were designed to defend against [625]. In addition to the lack of universally robust algorithms, there is no unified metric by which to evaluate the robustness and resilience of the algorithms.

SECTION XIII.

Conclusion

In this survey, we have explored the importance of deep representation learning in achieving competitive performances in state-of-the-art architectures. The methods of representing data serve as the foundation for the proposed techniques, making it crucial to understand the major approaches for learning representations. Since many of the state-of-the-art architectures rely on variants of neural networks to achieve competitive performances, the methods of representing data can be considered as the building blocks of the proposed methods. To achieve competitive performance in deep neural network architectures, it is essential to understand the major methods for learning representations. Our objective was to present each topic in a concise manner, while also providing detailed references and real-world applications to facilitate a deeper understanding for interested readers. As deep representation learning continues to be an active area of research, it holds great potential for impacting a wide range of applications. It is worth noting that the field of deep representation learning is dynamic and constantly evolving. As new advancements are made, further research may uncover more efficient and effective methods for learning representations from data.

ACKNOWLEDGMENT

(Amirreza Payandeh and Kourosh T. Baghaei are co-first authors.)

References is not available for this document.

Deep Representation Learning: Fundamentals, Technologies, Applications, and Open Challenges

Alerts

Abstract:

Metadata

Abstract:

Introduction

Multi Layer Perceptron

Generative Models

A. Boltzmann Machines

1) Boltzmann Machine

2) Restricted Boltzmann Machine (RBM)

3) Deep Belief Network (DBN)

4) Other Variants

B. Auto-Encoders

1) Undercomplete Autoencoder

2) Denoising Autoencoder (DAE)

3) Sparse Autoencoders (SAE)

4) Variational Autoencoder (VAE)

5) Contractive Autoencoder (CAE)

C. Generative Adversarial Networks

D. Applications

Graph Neural Networks

A. Basics of GNN

B. Taxonomy of GNN

C. Spectral-based GNN

D. Spatial-based GNN

E. Applications

Bayesian Deep Learning and Variational Inference

A. Hidden Markov Model

B. Markov Chain Monte Carlo

1) Markov Property

2) Time-homogeneity

3) Stationary distribution

4) Irreduciblity

C. Variational Inference (VI)

D. Applications

Convolutional Neural Network

A. Applications

Word Representation Learning

A. Applications

Sequential Representation Learning

A. Recurrent Neural Network

B. Long Short-Term Memory (LSTM)

1) Input Gate

2) Forget Gate

3) Output Gate

C. Gated Recurrent Unit (GRU)

1) Update Gate

2) Reset Gate

D. Applications

Attention-Based Models

A. Learning to Align and Translate

B. Transformers

Step 1:

Step 2:

C. Extra Large Transformers

1) Recurrence

2) Reduced Dimenssions/ Kernels/ Low- rank methods

3) Sparse attention

D. Pre-trained Models

E. Recurrent Cell to Rescue

1) RNN vs. Transformer

2) Combining Recurrent and Attention

F. Applications

Transfer Learning

A. Applications

B. Challenges

1) Negative Transfer

2) Measuring Knowledge Gain

3) Scalability and Interpretability

4) Cross-Modal Transfer

5) How to build Transferable models

Neural Radiance Fields

A. Definition and Applications

B. Challenges

The Challenges of Representation Learning

A. Interpretability

B. Scalability

C. Security, Robustness, Adversarial Attacks

Conclusion