A Systematic Literature Review on Multimodal Machine Learning: Applications, Challenges, Gaps and Future Directions

Multimodal machine learning (MML) is a tempting multidisciplinary research area where heterogeneous data from multiple modalities and machine learning (ML) are combined to solve critical problems. Usually, research works use data from a single modality, such as images, audio, text, and signals. However, real-world issues have become critical now, and handling them using multiple modalities of data instead of a single modality can significantly impact finding solutions. ML algorithms play an essential role in tuning parameters in developing MML models. This paper reviews recent advancements in the challenges of MML, namely: representation, translation, alignment, fusion and co-learning, and presents the gaps and challenges. A systematic literature review (SLR) was applied to define the progress and trends on those challenges in the MML domain. In total, 1032 articles were examined in this review to extract features like source, domain, application, modality, etc. This research article will help researchers understand the constant state of MML and navigate the selection of future research directions.


I. INTRODUCTION
Artificial intelligence (AI) has progressed rapidly in the last few decades. It impacts human livelihood, health care, science and technology. With the flow of progress, AI needs improvement in its techniques to tackle more critical realworld problems. ML is an application of AI that gives a system the capability to learn and improve from its experience automatically. The current trend of ML is high and involves several issues to provide solutions. It is rich in algorithms and uses them to build models that can process different kinds of data. Data are ubiquitous and hold information such as official reports, medical or financial records etc. The importance of data is increasing with the progress of AI. It contains information in several forms, such as numeric, text, signals etc. Data can come from a distinct range of modalities where modality means how something The associate editor coordinating the review of this manuscript and approving it for publication was Tomasz Trzcinski . is experienced or happens [1]. Visual, auditory, haptic, physiological signals, etc., are examples of modality. Data can be defined as multimodal when multiple modalities are involved together [1]. Speech recognition is a multimodal example in which audio and visual data are combined to recognize what a person is saying [2]. The blend of multimodal data and ML frames the notion of MML, which focuses on building models to process multimodal data from multiple modalities. Data from various modalities are always heterogeneous. Heterogeneous data are always ambiguous, and learning from these multimodal data provides an opportunity to understand the relationships between modalities [3]. Five core technical challenges covered MML: representation, translation, alignment, fusion, and co-learning [1]. Figure 1 depicts the classification diagram of MML.
Representation is the first challenge, which means presenting data using information from modalities. Representing multiple modalities is crucial because data come from heterogeneous sources, contain noise, and may have missing information [4]. It has two types: joint representation, which merges unimodal data into the exact representation, and coordinated representation, which means coordinating each modality through a constraint [1]. The second challenge is translation, which means translating or mapping an entity of one modality to a different modality [5]. Multimodal translation approaches are usually modality specific because they share several unifying factors. It is categorised into two types: example-based and generative [1]. Example-based models translate modalities using a dictionary; on the other hand, generative models build a model that can translate. Alignment is the third challenge, where it tries to identify relationships and consistency between subelements of two or more separate modalities. Multimodal alignment has two types: explicit and implicit, where explicit aligns subcomponents of modalities, and implicit alignment is the intermediate step for another task [1]. The fourth challenge is fusion, which has a broad range of applications. Fusion joins information from multiple modalities for prediction [6]. Multimodal fusion is classified into two categories: model-agnostic and model-based [1]. In the model-agnostic, fusion is performed before applying the ML method. In contrast, in the model-based method, fusion takes place during the construction of the ML method. The final challenge is co-learning, where one model transfers knowledge to another [7]. Multimodal co-learning is categorised into parallel, nonparallel and hybrid [1]. In parallel, modalities share a set of instances, but in nonparallel, concepts or categories are shared instead of instances. In the hybrid, two nonparallel modalities are linked up by a shared modality. This SLR explores the recent adoption of these five core challenges to seek answers to research questions.
The remainder of this survey is organised as follows. Motivation and contribution of this study are presented in section II. Section III summarizes reviews of MML and its challenges. Details of the employed research methodology are described in Section IV. Section V presents the results with figures. Section VI provides a discussion of the performed analysis. Finally, Section VII concludes this article with a summary. The structure of this survey article, with all sections and subsections, is presented in Figure 2.

II. MOTIVATION AND CONTRIBUTION
There are plentiful surveys available related to MML and its challenges. Most of the surveys have a limitation in covering modalities and challenges. For example, a significant number of studies covered only specific domains. There is a lack of studies which cover not only all challenges but also different modalities. This constraint motivated the conduct of this study. In the section on related studies, several publications are listed and contrasted with this study. The primary focus is on seeking literature on MML and modalities and finding insights thought to give future directions. To summarise, the contributions of this article are the following: • identified 374 articles on MML and its challenges. • compared with 39 related studies.
• describes the definition of MML. • outline research challenges and gaps. • depicts the results to understand the current trend of research.
• points out the domains and applications used in MML • highlight available modalities and their combination.
• clearly shows the algorithms used to build a model for MML.

III. RELATED STUDIES
This section presents existing survey or review works on MML and its challenges. We highlighted the differences and compared them with our proposed study. MML is an advanced research area and is growing very fast. A research team from Carnegie Mellon University conducted excellent survey research on MML [1], where they classified all the challenges of MML into several sections. The survey focuses on audio, video, and text modalities. Another article [8] reviewed applied methods and applications in multimodal deep learning, where the authors concentrated on a few common deep learning (DL) methods and applications. Apart from this article, we discovered a few surveys that discussed how MML could solve different problems related to modalities. In [9], [10], and [11], they surveyed meme classification, sentimental analysis, and content understanding by utilizing MML accordingly. Visual and language analysis is an attractive research area, and with MML, new possibilities are growing rapidly. Survey papers [12], [13], and [14] show advances and trends in computer vision, language and image analysis using MML techniques.
Researchers not only focus on MML but also on each core challenge to improve the quality of the research. All challenges have a specific role in MML. We encounter several surveys on each challenge, where most focus on applying techniques of challenges to handle data from modalities. The  first challenge of MML is the representation, which means representing and summarizing data to point out the complementarity and synchrony within modalities [1]. Two recent survey works [4], and [15] presented an overview of representation learning with proposed approaches. Article [16] reviewed representation learning development in unsupervised and deep learning. In [17], they introduced two categories for multi-view representation learning and presented an investigation of various essential applications. [18] performed a survey on representation to quantify its techniques in human affect recognition.
Translation is the second challenge of MML, which means translating or mapping data from one modality to another. Translation approaches are broad and often specific to the modality [1]. Until now, no survey article has entirely focused on multimodal translation techniques and advances. A recent survey [5] presented different aspects of multimodal translation on visual and speech datasets.
The third challenge is alignment, which identifies linear relationships in elements of two or more separate modalities [1]. Like translation, there is a lack of sufficient survey articles focused on multimodal alignment approaches. The paper [19] introduced a kernel method for manifold alignment (KEMA) that can match a random number of data sources without having similar pairs. In [17], they discussed multi-view representation alignment and its applications as one category of multi-view representation.
Fusion is the fourth and most studied challenge in MML. Joining information from two or more modalities is the primary purpose of multimodal fusion [1]. References [20], [21], and [22] are three recent survey works of multimodal fusion, and they classify approaches including different modalities. Articles [6], [23], [24], and [25] present an overview of the methods of fusion with challenges and prospects. Instead of focusing on fusion processes, there are surveys related to the use of fusion in several domains. Health care is an eye-catching area and, together with fusion, can create a significant impact. In [26], they conducted a comprehensive survey on the fusion of medical signals to facilitate intelligent healthcare systems. Fusion is famous for combining images, especially medical images such as in [27], [28], [29], [30], and [31], which survey different techniques on medical image fusion. Activity detection and monitoring systems have data from multiple modalities, and fusion techniques help to merge all modalities. In [32], [33], [34], and [35], they presented a survey on activity recognition leveraging fusion techniques. References [36] and [37] presented a survey on biometric systems where fusion techniques were applied to merge multiple data. Fusion is commonly used in audio-visual information fusing. References [38], [39], and [40] are good examples of audio-visual information fusion-related surveys.
Co-learning is the last challenge of MML and transfers knowledge between models to boost prediction [1]. Colearning is not as familiar as fusion and has always received insufficient attention in research. However, its importance has increased in recent years. References [7] and [41] are the two recent survey works on co-learning where they discussed several approaches of co-learning with challenges, applications, recent advances and directions.
Except [1], above discussed related works mainly focused on one specific challenge and its application. Instead of focusing on a particular challenge, this survey included all and discussed their current advances, gaps and challenges. Although article [1] focused on three modalities, this paper contained all the possible modalities. Article [8] discussed the current use of ML methods and applications in MML, but they limited their review by selecting typical ML methods and applications. On the contrary, this study presented all ML algorithms, domains, and applications available in the search range. To compare included related surveys and this work, an analysis of the associated surveys are presented in Table 1. The table clearly distinguishes between the inclusion of modalities and the challenges of MML in the study. Figure 3 shows the trend between the published year and the number of citations of included survey papers. From the figure, it is visible that after the year 2018, researchers are interested in working on MML and its challenges. However, the number of citations is lower than in previous years, but it will increase over time.

IV. RESEARCH METHODOLOGY
A systematic literature review (SLR) is conducted according to the guidelines of the article [42] to accomplish the objective of this research. SLR, also known as systematic review, aims to determine, review, and interpret every available study related to a specific research area or question. Three steps are involved in SLR: planning, conducting, and reporting the review. Figure 4 depicts all steps with substeps. This section discusses in detail all the steps.

A. PLANNING THE REVIEW
The planning phase is the first stage related to the set of tasks for designing and formulating the protocol. It includes identifying the importance of SLR in a specific area, defining research questions that SLR will address, and generating a review protocol for stating review procedures.

1) NECESSITY OF SLR
It is necessary to verify the importance of such a review before initiating. Researchers have recently focused on using MML techniques to solve multimodality problems. However, there is also a lack of articles discussing the techniques of challenges, and conducting procedures does not always provide a solution. It is also necessary to focus on the use of modality. The understanding of MML lies in the relationship between its challenges and used modalities. Therefore, SLR is essential to depict the importance and understanding of MML.

2) RESEARCH QUESTIONS (RQs)
In SLR, specifying RQs is the most crucial part. Analysing prior works on the challenges and understanding MML is the main objective of this SLR. It includes the explanation of MML, approaches to challenges, considered modalities, applied ML models, and gaps for future research. The four following questions, including one subquestion, are proposed to facilitate this research. RQ1: What is the definition of multimodal machine learning? RQ2: What are the challenges adopted when framing multimodal machine learning? RQ2.1: What are the feasible gaps in challenges? RQ3: What are the modalities considered in multimodal machine learning? RQ4: Which machine learning models were applied?

3) DEVELOPING SLR PROTOCOL
The SLR protocol determines the methods to begin a specific review on a particular area. In this review study, a protocol was constructed to obtain the objective. Initially, we searched for primary studies from prominent bibliographic databases. In the second, we make a margin for the selection criteria. Data extraction and study quality assessment took place in the third and fourth steps accordingly. The final section will involve data analysis.

B. CONDUCTING THE SLR
Conducting the SLR is vital and starts once researchers have agreed upon the protocol [42]. All the specified steps in the protocol are executed in this section to obtain the research goal. It is divided into five parts and discussed below.

1) IDENTIFICATION OF RESEARCH
To produce answers to all research questions, we used a few keywords to search underlying studies in renowned online databases in this SLR. Four major online databases were explored to maintain unbiased results during article searches. A rich library of journals and conferences is the main reason behind exploring those four significant databases. The list of four databases is depicted in Table 2.
The next step is the formation of procedures for seeking the scientific and technical articles that these searches provided. The process is divided into two parts: determining search keywords that cover the area of MML and its five challenges and determining queries by placing keywords between Boolean operators AND or OR. All the queries used are presented in Table 3.
We started searching on the 10th of January 2022 and considered published articles from 2009 until the search date 31st of December 2022. The search process included the title, abstract, keywords, and introduction. We also added papers from the references found in the primary studies during the search.

2) STUDY SELECTION CRITERIA
This study used a group of inclusion criteria (IC) and exclusion criteria (EC) to purify the search results and ignored studies that failed to answer the EC. All IC and EC to determine relevant articles are included in Table 4.
This study followed the steps of PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analysis) from [43] to identify research articles. The stages of PRISMA include identification, screening, eligibility, and sorting, and Figure 5 illustrates it using a flow diagram. Initially, by applying keywords, a total of 1009 articles were collected from bibliographic databases, and 23 articles were collected from other sources. After eliminating duplicates, the total number of articles was 800. Of 800 articles, 650 were screened, and 150 were excluded due to unavailability. Initially, 593 articles were considered in the eligibility section, and 57 were excluded because they were not published in any journal or conference. Then, 545 papers were selected because they were scientific or technical, and 48 survey or review-related articles were excluded. From the 545 articles, 338 articles met the requirements of MML and its challenges, as opposed to 207 articles excluded. After that, 319 articles were thoroughly scanned, and 55 papers were included from the recursive reference search. Finally, 374 articles were considered for the review of this study.

3) STUDY QUALITY ASSESSMENT
This section evaluates each selected article depending on the established questions. All the questions were prepared by following the guidelines of articles [42], and [44] and are presented in Table 5. Each question was answered using Yes, Partly, and No, where Yes = 1, Partly = 0.5, and No = 0. The total value stays between 0 (very poor) and 6 (excellent) according to the number of questions in Table 5. For the selection, the scores of each article must be four or above. Table 6 provides an example of the assessment process using five articles.

4) DATA EXTRACTION
This step introduces extracted features, where all extracted features come from a different perspective: metadata, task, representation, translation, alignment, fusion, and colearning. All extracted features from metadata involve articles underlying information. From the task outlook, extracted features will give the idea of applied ML algorithms, selected datasets, and their modalities. The last five outlooks are the challenges of MML and features related to their types. Table 7 shows a list of outlooks and corresponding features with definitions.

5) DATA ANALYSIS
A comprehensive analysis was performed on extracted data from multiple outlooks, as shown in Table 7 by the authors of this article. The source feature of metadata gives the notion of the venues of the published articles. Subsequently, articles were clustered depending on domains and applications. Papers were grouped into algorithm, data, and modality in the task prospect. The algorithm feature shows distinct ML algorithms used to tackle the challenges of MML. The modality feature gives a clear view of sources from which data appear. Additionally, the collected articles were clustered into individual elements of the last five perspectives. All the clustering and investigations performed in this review to understand and summarise current trends on MML.

A. METADATA
This section of the article presents all the metadata extracted from the collected paper. According to the inclusion and exclusion criteria, all selected articles were related to MML and its challenges. From 374 selected papers, 268 reports were published in journals, and 106 were in conferences. A total of 28 publication venues were identified from the selected documents. Figure 6 shows the highest five venues with the total number of journals and conferences. Most of the relevant articles were published in the IEEE Xplorer. ELSEVIER is another popular venue after IEEE Xplorer for publishing similar journals. The number of articles from different countries is analysed in this paper. In total, 51 countries were encountered; among them, China, the USA, Germany, and the UK dominated the most. Figure 7 presents the top ten countries with the most published articles in journals and conferences.
The total number of articles and citations according to the years 2009 to 2022 is displayed in figure 8. From the figure, it is visible that every year, research interest in MML is growing. The rapid growth is evident from 2018. Increment in citations each year is also apparent in the figure.

E. ALIGNMENT
Multimodal alignment learning is split into two kinds: explicit and implicit. Each category is subdivided into two types: supervised and unsupervised related to explicit and graphical models, and neural networks, related to implicit models. The flow of alignment learning in the Sankey diagram is presented in Figure 11. Eighteen alignment learning-related articles VOLUME 11, 2023 were found from the total number of articles. Of 18 articles, 11 were related to explicit alignment, and 7 were related to implicit alignment. In explicit alignment learning, six articles were related to supervised [52], [96], [19], [157], [379], [381] and five articles were related to unsupervised [53], [140], [259], [395], [406]. There are three articles on graphical models [54], [147], [407] and four about neural networks [130], [158], [240], [332] in implicit learning.
In Figure 13, the number of articles related to five MML challenges between the years 2009 to 2022 is depicted. The figure shows that the number of publications increases each year.

VI. DISCUSSION
Initially, the primary search using keywords shows that interest in research on MML is gaining. However, one or two challenges of MML were known to researchers more than ten years ago, but it came into the spotlight after 2016 as a core research topic. All the defined research questions mentioned in Section IV are discussed in this section and introduce perceptions. Figure 14 shows various contributing parts of this review study.

A. INSPECTION OF RQ1
The first research question is made up to investigate the definition of MML. MML stands for building models that can process and describe data from multiple modalities. It is necessary to understand the relation between multimodal and ML to determine the definition of MML more precisely. Multimodal or multimodality is variously used in academic literature. According to [1], multimodality means including multiple modalities such as image, text, audio, etc. Based on the definition of multimodality from [1], MML handles data from multiple modalities using ML. A recent article [411] defines MML as a machine learning system that receives and processes data from multiple modalities. They also characterize multimodality in three types: human, machine and task centred. Human-centred means the way data communicate through humans. Machine-centred means data are encoded using the ML system before being processed. Task-centred is related to the job an ML system needs to perform, and based on that, data input and output are represented differently. Human and machine-centred definitions aim to catch the summary of multimodality in a task-agnostic manner. On the other hand, the task-centred approach attempts to discover the involvement of each input depending on the relations to the task. Therefore, from the above definitions, article [411] argued that the purpose of MML is not only to build ML models to process multiple modalities but also to focus on the relation between modalities and the given task. Depending on the overhead discussion, two statements can be possible to establish for MML. First, in MML, ML models build to process data; the second is those data come from multiple modalities and is related to tasks.

B. INSPECTION OF RQ2
RQ2 is designed to inspect the use of five challenges of MML in research. Each challenge of MML solves specific tasks; for example, fusion techniques are used to fuse information from two or more modalities, and translation maps one modality to another to translate information. The first challenge of MML is representation, which is crucial because of the difficulties in representing heterogeneous data in a meaningful way. A good representation of data is necessary to support the performance of ML models, which is visible in recent articles such as [49] and [50]. Paper [1] classified representation learning and extracted articles depending on it, which are presented in Figure 12. Figure 12 shows that neural networks (NNs), a subtype of joint representation, have more consideration than others to represent data. The use of NNs to represent visual, audio and text data is increasing [177], [183], [184], [213]. Graphical models frame representation using implicit random variables where probabilistic graphical models such as deep Boltzmann machines (DBM) [113] are used. The sequential representation uses sequential models where hidden states are considered to represent the data, such as recurrent neural networks (RNNs) [283], [347], [354]. From three subtypes of joint representation, NNs were used  primarily because of their superior performance, but they could not handle missing data. However, graphical models can handle missing data and the whole modality. On the reverse side, the sequential model is used to represent the sequences of data. Similarity and structure are two subtypes of coordinated representation. The similarity models work on the distances of two modalities [169], [330]. Structured models are typically used in crossed hashing [64], such as canonical correlation analysis (CCA) [197], [209]. In comparison, joint representation is best suited for all modalities as opposed to a coordinated representation suitable for application when only one modality is present at test time. More than two modalities are used in joint representation, but coordinate representation is limited to two.
The extracted results on multimodal translation learning and its types are also displayed in Figure 12, according to article [1]. Interest in research on this topic is currently growing. Retrieval is a simple example-based multimodal translation learning where it tries to find the nearest sample in the dictionary to produce a translated result [335], [387]. CNN and kernel canonical correlation analysis (KCCA) are famous for retrieval-based models such as [156], [182]. Conversely, combination-based models combine the piece from the dictionary meaningfully to generate better translation [170], [198]. Grammar-based models produce translation on a specific domain by applying grammatical conditions, as in [364]. The encoder and decoder model first encodes the origin modality to a latent representation and then decodes it to produce the target modality, where CNNs use most of the cases [331], [333], [334], [365]. Continuous generation models have outputs at each timestamp, such as sequence-tosequence translation [239], [288]. RNNs with long short-term memory (LSTM) usually use continuous generation models like [239], [288], and [338]. In contrast, example-based is simple, but it makes the model heavy because the model itself acts as a dictionary and sometimes generates unrealistic translations. Generative models are crucial to constructing since models need the capability to produce sequences of symbols. This reason has led many researchers to choose example-based models to provide the solution. Figure 12 depicts the extracted results on multimodal alignment learning. Explicit and implicit are two kinds of alignment, where each is split into two types [1]. Unsupervised explicit alignment aligns a modality without having any direct labels, as in [53], [157], [259], [395], and [406], where CCA and dynamic time wrapping (DTW) with CCA models are usually applied. Supervised explicit alignment methods, on the other hand, count on the labelled instance, for example, [19], [52], [96], [157], [379], and [381]. Graphical implicit models align modalities based on mapping them, such as aligning images and text [54], [407], or images and signals [147]. Neural network implicit models align modalities using encoder-decoder [130], [158], [240] or cross-modal retrieval [332] techniques. In implicit alignment, latent information is used for alignment to perform tasks that provide better performance. In contrast, explicit alignment focuses on subcomponents of modalities.
Fusion is the most common challenge in MML. Extracted information regarding different types of fusion according to [1] is presented in Figure 12. Model-agnostic fusion has three kinds: early, late, and hybrid fusion, where early and late are used most often. Early fusion unites features directly when they are extracted as in [56], [151], [263], and [305]. On the other hand, late fusion is performed after making decisions such as voting [55], [114] and weighting [272] schemes. Late fusion is also known as decision fusion. Hybrid fusion integrates outputs of early fusion and individual unimodal predictors, for example, [74], [117], [257], and [323]. Multiple kernel learning, graphical models, and neural networks are subtypes of model-based fusion. Fusion using the kernel approach means finding similarities between data points; for instance, kernel support vector machines (SVMs) are used to find the views in the data [45], [57], [106], [186]. Graphical models utilise the local and temporal structure of the modality, such as DBN [61] and dynamic Bayesian networks [62]. Neural network models are largely used in multimodal fusion [412]. In the NN, latent information from layers is fused to achieve better performance, as in [97], [162], [228], [271], and [289]. There are few specific comparisons between model-agnostic and model-based approaches. Depending on modalities and tasks, all the approaches are performed accordingly.
Co-learning is the final challenge of MML, and papers on co-learning are extracted according to article [1] presented in Figure 12. Parallel data co-learning is divided into co-training and transfer learning, where co-training generates more data, as in [207], [213], and [368], and transfer learning transmits information from one model to another to perform well [369]. In non-parallel data co-learning, models share concepts with each other. It is divided into transfer learning, conceptual grounding, and zero-shot learning. Similar to parallel learning, transfer learning is also possible in non-parallel learning [189], [279], [280]. Conceptual grounding shares the semantic concept with modalities where most of them are related to linguistics, such as [216], [324], and [327]. Zero-shot learning classifies data without having any labels, and approaches such as cross-models [189], [221] and autoencoders [391], [392] are used for the solution. Bridging is hybrid co-learning where two non-parallel modalities share information. For example, the article [193], and [194] used bridged non-parallel modalities using neural networks.

1) INSPECTION OF RQ2.1
This research question establishes gaps and future directions in the challenges of MML. From the above discussion, it is visible that it is not easy to handle all the challenges simultaneously. That is why adopting five challenges to solve one problem is not necessary. Addressing one or two challenges relying on the primary task is sufficient to frame MML.
Representation and fusion are more studied challenges than others in MML. The relationship between representation and fusion is very close. When fusion is performed on any data, fused data will change its form to another, which is ultimately a new representation of the source data. Modelbased fusion is similar to the joint representation. For example, NNs fuse two or more modalities and obtain a unique data representation as output. However, representation not always depends on fusion. Data can be represented in new forms while data are preprocessing, converting characteristics, etc. Apart relationship with fusion, representation also relates to translation and alignment. Whenever data is translated or aligned, it always gets a new form of representation. Of the two types of representation, the use of joint representation techniques will increase more because researchers can get facilities to apply multiple modalities.
Fusion is applied more in MML research than others. There are gaps in understanding the types and subtypes of fusion. Based on the definition of MML discussed in RQ1, data must be processed by the ML system. However, model-agnostic fusion means fused data before performing ML methods [1]. Therefore, based on the article [1], it will be wrong if any article claims multimodal fusion by performing model-agnostic fusion approaches. Conversely, suppose a paper argues that data is fused using a late or hybrid model-agnostic fusion approach. In that case, it is possible to consider multimodal fusion because models are usually used during late and hybrid fusion. So, to claim multimodal fusion, authors must prove that they have performed fusion by using models and avoid early fusion. The naming of the fusion approach is also confusing for the model-agnostic process. Early fusion is also known as knowledge fusion and feature fusion though some articles used NNs for fusion and named it to feature fusion. There is a lot of work done using fusion techniques, and more is coming. Combining fusion techniques with other challenges to facilitate research work can be considered a future research direction.
Multimodal alignment finds cross-modal connections and interactions within elements of multiple modalities. It helps align separate modalities, such as video caption generation and image-text classification. Alignment is difficult because of the implicit dataset, and it is difficult to design similar metrics. NNs-based models such as autoencoders are popularly used for visual data alignment. However, researchers can primarily focus on three aspects to solve multimodal alignment problems. They are the identification of the connection between modalities, concept-wise modality representation and tackling the ambiguity in high-dimensional data.
Apart from representation, alignment and fusion, other challenges need attention. A considerable problem in multimodal translation is that methods are very problematic to evaluate. Developing generative approach models, such as encode-decoder and continuous generation models, is critical and always complex. It is easy to build example-based models, but the possibility of unrealistic results is high. Multimodal co-learning is new to most researchers. Subtypes of each type of co-learning are individually known, but their concept related to co-learning is new. Co-learning is taskindependent, and it can help to address other challenges. One common problem of co-learning is that biased training samples often lead to overfitting. There is a lack of enough research works in translation, alignment, and co-learning. Making more robust models for those three challenges can open new research directions.
Aside from focusing on the challenges of MML, researchers can also focus on domains and applications. Figure 9 presents the results of domains and applications extracted from collected data. The figure shows that most problems stayed in medical, human activity and emotion recognition because of heterogeneous data availability. Instead, researchers can focus on those domains where data is less available and more complex. Recognition, classification detection, prediction and analysis are the most prominent application used, according to Figure 9. But after analysing the data, it can be said that the application is not much more impactful than the domain. Because anyhow, the problem must follow an application while solving it. Most of the MML-related surveys focus on how to handle multimodal data or the challenges of MML. Alternatively, researchers can apply MML to emerging concepts like explainable artificial intelligence [413], [414] and digital twin [415] for industry etc. Those arising can act as a new research direction for researchers.

C. INSPECTION OF RQ3
Modalities represent the manner in which something occurs or is perceived [1]. Image, audio, and text are examples of modalities. Extracted information related to modalities is presented in Figure 11. From the figure, it is visible that image and text modalities are used the most. The audio and video modalities come next accordingly. The main reason behind using visual modalities is that most of the MML-related problems are connected to the image and video, which can have multiple modalities. Audio and text modalities are associated with the visual modality. Applications such as caption generation and speech-to-text conversion are examples where video, audio, and text modalities come together to solve specific problems. Sensors, signals, and numeric data are easy to handle, and much research has already been performed.

D. INSPECTION OF RQ4
Different kinds of ML algorithms are used to solve MML problems. A summary of the applied algorithms is presented in Figure 10. It shows that most of the research applied NN-related models to deal with MML-related issues. Two reasons work behind it. One is the modality, and the other is the type of task. Visual, audio, and text modalities are easy to process using NN models such as CNN, RNN, DBM, etc. Types of tasks also play an essential role in using NNs because most problems relate to image-to-text, speech-totext, speech-to-speech generation, etc., where NN models perform better. Apart from NNs, SVMs, ensemble models (EMs), nearest neighbour models (NNMs), tree-based (TB), Bayesian models (BMs), linear models (LMs), K-means, encoder-decoder, genetic algorithms, and graph-based models have been applied by researchers. From figure 10, it is also evident that most applied algorithms are supervised learning. Only for some specific tasks were semi-supervised and unsupervised methods used, such as graphical related models, which use unsupervised methods.

VII. CONCLUSION
This article performed a systematic literature review on MML and its challenges to provide an overview of recent trends. The study was conducted utilizing the PRISMA approach, and its selection procedure was reported in detail. In total, 374 articles were selected from an initial 1032 collected articles depending on their relevance to the four created research questions. The findings of this review reveal that MML depends on not only modalities but also ML algorithms and tasks. The investigation of algorithms and data shows that NNs-based algorithms and image data are the most used. This review also exposes different aspects of multimodal representation, translation, alignment, fusion, and co-learning and their possible gaps. However, this SLR displayed a summary of work done in MML, which needs to be expanded further, providing timely future work opportunities for researchers interested in this interdisciplinary field.

ABBREVIATIONS
In this section, we introduce a list of abbreviations that are used throughout the article. Table 8 contains abbreviations of key methods and algorithms.

DATA
All the supported data used for this review study is available at the following link: https://doi.org/10.5281/zenodo.7615714