Antennas and Propagation Research From Large-Scale Unstructured Data With Machine Learning: A review and predictions

The past century has witnessed remarkable progress in antennas and propagation (A&P) research, which has made dramatic changes to our society and life and has led to paradigm shifts in engineering and technology. Although the underlying theory of electromagnetics is well established and mature, research on A&P will continue to play a paramount role in the Fourth Industrial Revolution. In this article, we present an approach based on natural language processing (NLP) and machine learning (ML) techniques to review A&P research based on large-scale unstructured data from openly published scientific papers and patents and, in turn, provide meaningful summative and predictive information. We particularly screen 159,000 research papers published between 1981 and 2021 and extract a pool of 2,415 significant keywords reflecting past and present key research topics in A&P. We then apply an encoder–decoder long short-term memory (LSTM) network with an integrated attention mechanism to predict the future trends of A&P research in the form of a Gartner’s hype cycle.


INTRODUCTION
In a recently published book by the International Union of Radio Science, for its centennial anniversary celebration [1], a comprehensive review of electromagnetics theory as well as antennas and their applications to wireless communication, radar, and other modern technologies, such as satellite, space, and mobile health, was conducted by a group of prominent A review and predictions.engineers.Similar reviews can be found and are performed regularly by leading experts in their subject areas.Notably, Balanis reviewed antenna theory in [2], and Jensen and Wallace summarized research challenges and opportunities in A&P for multipleinput, multiple-output (MIMO) wireless communications [3].In recent years, there has been an increasing number of publications on topical reviews in subject areas such as antennas for energy harvesting and wireless power transfer [4], ML for antenna design and optimization [5], [6], antennas for mobile handsets [7] and flexible wireless sensors [8], and wearable technologies [9], [10].
With the advent of artificial intelligence and ML technologies, it is widely accepted that mining and extracting useful information from big data in scientific publications will help us analyze the past and current trends of research and predict future directions, assisted by ML algorithms.Among all approaches developed in ML, NLP has been successfully applied for voice recognition and semantic information extraction, both of which have played an important role in conducting new scientific research [11], [12], [13], [14], [15].For example, Kuniyoshi et al. [16] proposed label definitions for material names and properties and built a corpus containing 836 annotated paragraphs for training a named entity recognition (NER) model [17].They achieved a micro-F1 score of 78.1% for the NER model.This model was then applied to analyze 12,895 material research papers, and the trend of inorganic material research was captured by investigating the change of keyword frequencies by year and country.Elton et al. [18] collected textual data from various sources (e.g., journal articles, conference proceedings, the U.S. Patent and Trademark Office, the Defense Technical Information Center and archives on https://archive.org/)and successfully extracted meaningful chemical-chemical and application-chemical relationships by performing computation with word vectors and without hand labeling.Zdravevski et al. [19] demonstrated the applicability of their NLP toolkit [20], showed increasing attention from the scientific communities toward enhanced and assisted-living environments over the past 10 years, and provided new technical trends in specific research topics.Their NLP toolkit was shown to be useful in expediting the fully automated review process while providing valuable insights from the surveying of relevant articles.The review generated by their NLP toolkit successfully included informative tables, charts, and graphs.
In this article, we propose a fully automatic framework that extracts meaningful keywords from a large number of publications related to A&P research, with an architecture as shown in Figure 1.In particular, we build a structured database that has the details of scholarly article attributes (e.g., titles, authors, affiliations, authors' keywords, journal names, numbers of citations, and abstracts) from the literature, using the Scopus application programming interface (API).We retrieved 167,000 papers from between 1906 and 2021; however, for some of the early papers, only limited information (e.g., titles) was searchable, due to a lack of digitization.We therefore focus only on scholarly article attribute data from 159,000 papers published since 1981.These large-scale literature data are then used as an input for extracting keywords and weighting each examined work from the literature.Here, we employ one of the NLP techniques, a domain-independent keyword extraction algorithm that determines key phrases in a body of text by analyzing the frequency of word appearances and co-occurrences with other words in the text [21].Verifying the obtained keywords of the automatic extractor by using authors' keywords significantly reduces the data complexity and enhances the accuracy of selecting meaningful keywords.This gives a total of 2,415 keywords selected from all 159,000 abstracts.We also assign a weight metric to each work from the literature, based on some scholarly attributes (i.e., the number of citations and publication reputation name).We believe this new weighting scheme enables us going beyond simple numerical analysis to assessing the actual influence of each paper on its research field.The number of occurrences of each keyword per year over the period of 1981-2021 is counted, reflecting the weight of each paper.Analyzing the change in frequency of each keyword in the form of time series helps us to comprehend past antenna research trends.We then utilize an encoder-decoder LSTM with attention layers to predict the future trends of A&P research in the form of Gartner's hype cycle.

FIGURE 1.
An overview of the keyword prediction framework.From the abstracts of 159,000 works of antenna literature, the Rapid Automatic Keyword Extraction algorithm extracts meaningful keywords.The weight of each abstract is calculated using the number of citations and SCImago Journal Rank index, which is applied to the extracted keywords of the abstract.The weighted keywords are analyzed by the number of word occurrences (i.e., frequency), and finally, the future trends of antenna research, using encoder-decoder LSTM with an attention layer, are predicted.
Although the underlying theory of electromagnetics is well established and mature, research on A&P will continue to play a paramount role in the Fourth Industrial Revolution.
Within this framework, the methods for indicating article performance are as follows: 1) an average citation for each paper as a measure of the usefulness, impact, or influence of a publication and 2) the SCImago Journal Rank (SJR) indicator [22], developed by SCImago from the widely known algorithm Google PageRank [23], [24].The former is an article-level metric calculated by dividing the total number of citations by the number of years since an article was published.The average citation is a simple metric that can compensate for the time an academic/journal has been active to some extent, which provides a fair comparison for both junior and senior researchers.The latter is a journal-level metric obtained as the average number of weighted citations received per document published in that journal during the previous three-year time window, as indexed by Scopus.This SJR indicator could represent the scientific influence of scholarly journals that accounts for both the number of citations received by a journal as well as the importance or prestige of the sources of the citations.Higher   be noted that the preceding indicators have been designed to address the limitations of the well-known journal impact factor (JIF) [25], [26].Supporting the recent statement by the San Francisco Declaration on Research Assessment (DORA) [27] that the research impact should be shown with other journal-based indicators (Eigen index, SCImago, h-index, publication period, and so on), this article adopts both articlelevel and journal-level metrics, which provide a multifaceted view of research impact without simply relying on the JIF to reflect the influence of individual articles or scientists.In this way, rather than looking only at the number of citations, every paper in the database is given a weight to indicate its overall influence.

INFORMATION COLLECTION AND RETRIEVAL
The proposed framework utilizes a Python package, pybliometrics, to access published papers related to A&P via the Scopus RESTful API, using HTTP requests.Using this package, all the scholarly article attributes (e.g., titles, abstracts, publication names, authors' keywords, numbers of citations, and affiliation names) can be collected.Among different literature types, namely, articles, reviews, conferences, and books, we collected the data from the article type only, which returned 167,000 antenna-related articles published since 1906.Figure 2 gives a statistical overview of each of the attributes.Figure 2(a         and substrate-integrated waveguides.The latter have strong links with the development of millimeter-wave technologies.Equally, studies on lens antennas and phase shifters have made their mark and attracted renewed interest due to the need for hybrid analog and digital beamforming networks for 5G/6G wireless communications.
Antenna arrays, including phased arrays, have direct applications to radars, imaging, and wireless communications.Significant research has been conducted in designing feed antennas, reflector antennas and arrays, and microwave photonics.Notably, antenna measurement techniques have been studied in conjunction with antenna arrays.THz development is supported by new emerging materials, such as graphene for photoconductive antennas, silicon technologies with micromachining, and reconfigurable reflectarrays and metasurfaces.Technologies such as massive MIMO, channel modeling, array signal processing, channel estimation, and new coding/modulation schemes have been highlighted for wireless propagation research with applications related to microaerial vehicles, the Internet of Things, GPS, and so on.
It should be noted that, in general, the articles that were published before the 1980s are available as a PDF or image file, while the articles published since then are mostly available in an HTML format.This means that the articles in HTML can be recognized, and all scholarly article attribute data can be downloaded in an automated manner.For the literature published prior to 1980, we extracted the limited attribute data in HTML, such as titles, authors, affiliations, and journal details.As expected, since around 1980, the number of articles published has increased significantly.In the period of 1981-2021, we observe remarkable changes in ranking, in particular, the amount of literature by affiliation and amount of literature by country.Figure 3(c) and (d) presents a geographic distribution of research papers around the world before and after 1980.It is evident that A&P is now researched globally, with increasing numbers of papers coming from Latin America, Africa, and the Middle East.A&P research is continuously growing in countries such as China, India, Iran, Spain, and South Korea, and

ABSTRACT WEIGHTING AND KEYWORD EXTRACTION
To build a comparable analysis and prediction, each abstract is given a different weight.We use the normalized citation of each paper to remove the paper's time bias and the SJR index to measure its scientific influence on a broader level.Each abstract's publication name is replaced with a corresponding SJR index.
The weight of each abstract (or paper) is calculated using (1), with two hyperparameters, a and : While Figure 4(a) shows the top 20 affiliations based on the weighted abstracts, Figure 4(b) reorders the ranking of the affiliations, using the weighted abstracts with a cutoff (i.e., 0.02) of the weight.This results in the reordering of academic institution rankings in Figure 4(b).Table 1 lists details of seven papers with the highest weight values.
To predict the future trends of past and current antenna technologies, extracting relevant features (e.g., keywords) from all available abstracts is the first and foremost step.Extracting the right keywords with significant information not only leads to achieving high accuracy in prediction but also reduces the time required to search through the large amount unstructured abstract data.We thus use the well-referenced Rapid Automatic Keyword Extraction (RAKE) algorithm, which automatically calculates a sum of the number of co-occurrences and then divides the co-occurrences by their occurrence frequency [35].Based on our previous research using RAKE [36], we can confidently extract meaningful keyword phrases from each paper; however, all these phrases cannot be used for our  Here, the k-means algorithm makes all the 3D keywords categorize into five groups.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
prediction task, due to their volume and complexity.To address this problem, only keywords from each abstract that are overlapped with RAKE's keyword phrases, authors' keywords, or words in the title of a paper are selected.
From a total of 159,000 abstracts, 2,415 keywords are extracted, and these are now embedded into a 200D vector space via word embedding to analyze their relationship.For the vectorization of words, we use a pretrained word embedding model, Mat2Vec, which is trained on 3.3 million scientific abstracts [12].To simplify perception and cognition related to complex keywords, we create keyword clusters by merging words with similar meanings by detecting inherent similarity relationships.Clustering analysis performs an unsupervised mapping and labeling for each keyword, based on preidentified clusters in the vector space.To facilitate this process, the 2,415 200D keywords are compressed into 3D spaces, using principal component analysis (PCA). Figure 5 visualizes the keyword embedding via PCA and shows that all the keywords are clustered into five groups (or categories) by using the k-means algorithm [37], [38].These clusters are the collections of keywords with relatively similar meanings or high a number of co-occurrences.Here, we can specify the representative attributes of each cluster.Table 2   on these keywords, we have attempted to identify a broad topic for each cluster, as described in Figure 5 and Table 2.

ANALYZING PAST TRENDS
Quantifying how the frequency of each keyword changes over time helps us to understand its trend intuitively.To do this, we build a dataset that tracks the occurrences of each keyword from 1981 to 2021 in the form of time series.These are occurrences based on the weight of a published paper that a particular keyword belongs to.The frequency dataset is then normalized using ( 2), which transforms the values of each point in the time series to a common scale.As in (2), normalized frequency (NF) is a normalized value that is computed as the frequency of the given word frequency (WF) divided by the total amount of literature from a certain time period, k: total number of literatures Figure 6 details how the frequencies change with respect to the amount of literature per year, quantifying the rate of change for each keyword over time, for example, "massive MIMO" and "transmit power."We then categorize each keyword into groups according to one of the following trends: increasing, decreasing, and emerging.For example, "transmit power," as in Figure 6(a), shows a steady increasing trend since its first occurrence, while "massive MIMO" shows a sharp increase around 2010, which makes it an overall increasing trend, as in Figure 6(b).We could notice that "transmit power" is an increasing keyword, while "massive MIMO" is an emerging keyword.Figure 7 illustrates the top 30 increasing, decreasing, and emerging keywords.

LEARNING WITH ATTENTION MECHANISM
The time series data from the "Analyzing Past Trends" section are now fed into an encoder-decoder LSTM network for the prediction of the future trends of the keywords.The encoder-decoder architecture of recurrent   neural networks has proved to be powerful for sequence-to-sequence-based prediction problems in various fields, such as NLP, neural machine translation, and image caption generation [39], [40], [41], [42].The encoderdecoder LSTM network consists of three main components, namely, the encoder, intermediate (encoder) vector, and decoder.The encoder and decoder use a multilayered LSTM unit to map the input sequence to a vector of a fixed dimensionality, and then another LSTM unit is used to decode the target sequence from the vector [43].These LSTM units are trained using the input data while maximizing the conditional probability of the target sequence for a given sequence.One of the main drawbacks of this network is its inability to extract strong contextual relations from long input series; that is, if a particular amount of long time series data has some context or relations within its substrings, then a basic encoder-decoder model cannot identify those fixed-length contexts.This limitation could significantly deteriorate the predictive performance of the model.An attention mechanism [44], which is integrated with the LSTM, however, can address this problem.Adding the attention component to the network permits the decoder to utilize the most relevant parts of the input sequence in a flexible manner through a weighted combination of all the encoded input vectors, with the most relevant vectors being given the highest weights [41], [43].There are two well-referenced attention mechanisms (by Bahdanua et al. [44] and Loung et al. [15]) in the literature, and there is a difference in the method of calculating attention scores.Bahdanua's attention mechanism is adopted in this article.Figure 8 illustrates the encoder-decoder LSTM with the attention layer.

MODEL VALIDATION AND PREDICTION
To predict the research trends in the next four years, we feed the time series data into the many-to-one encoder-decoder LSTM, which is modified to have two LSTM units (e.g., the encoder and decoder), and each unit is composed of 300 layers with a rectified linear unit.Due to the limited size of the dataset (i.e., 82 data points) we utilized 90% of the data for training and the remaining 10% for testing.Thus, the majority of the available data was used for training the model.The validation result uses an average of mean square errors less than 0.058, which is computed over 153 selected keywords.As shown in Figure 9, the attention layer we adopted has a positive effect, although with minor magnitudes.It is presumed that this magnitude could increase as the data scarcity is overcome.Figure 9 gives the validation result of the keyword "reflectarray antenna." Figure 10(a) and (b) depict the prediction result of the validated models with eight steps ahead for the period of 2022 to 2026.Here, two steps ahead would mean one year, and eight steps would mean four years.This applies to 153 individual keywords.The prediction of each of these antenna research keywords is then analyzed and plotted ^h is fed into the encoder, whose hidden states hr ^h are exposed to the decoder via the attention layer.These states are weighted to give a context vector (c r ) that is used by the decoder.Attention weights ( ) a are calculated by aligning the decoder's last hidden state hr ^h with the encoder hidden states (S r ) [44].The decoder's current hidden state is a function of its previous hidden state and previous output word as well as the context vector.Attention is passed via the context vector, which itself is based on the alignment of the encoder and decoder states.An alignment score quantifies how well the output at position p is aligned to the input at position q.The context vector that goes to the decoder is based on the weighted sum of the encoder's LSTM hidden states .
hq These weights come from the alignment.The decoder's hidden state is based on its previous hidden state , sr 1 the previous predicted word, and the current context vector.At each time step, the context vector is adjusted via the alignment model and attention.Thus, the decoder selectively attends to the input sequence via the encoder hidden states.LSTM with attention layer models, applied to the "reflectarray antenna" keyword.using Gartner's hype cycle [45], [46] to provide a holistic view of antenna research, representing the maturity of new technologies in a simple and graphical way.This representation can give the readers strong hints and insights to which antenna technologies are potentially relevant to solving real problems and exploiting new opportunities.
The hype cycle graph is divided into five key phases: technology trigger (TT), peak of inflated expectations (PI), trough of disillusionment (TD), slope of enlightenment (SE), and plateau of productivity (PP).The TT phase is when new technologies become apparent or prominent.If a particular technology receives major attention with some success stories, it moves to the PI phase despite doubts or reservations from the community.Soon after a technology innovation reaches the PI phase, we will see some activities beyond early adopters.Frequently, some negative press surfaces in the TD phase.In the SE phase, the technology innovation has undergone plenty of scrutiny, with failures and successes,  updates, and improvements, for the industry to understand an optimal growth trajectory.Finally, in the PP phase, the technology is readily produced and available as off-the-shelf solutions.From our analysis, however, we noticed that in antenna research, some keywords do not slide into the SE or PP phases but, rather, disappear.
When the inflated expectations begin to die down via the phase of TD, they start to decrease, and this trend continues over the phases of SE and PP, as indicated with a red line in Figure 10(c).Among the topics that are triggered by technology, "secrecy capacity," "superwideband," "power management," "heterogeneous network" ("HetNet"), "secrecy rate," "spatial modulation," "stochastic geometry," and "firefly algorithms" are directly linked to wireless communications.Many of these technologies require antennas to be directive and frequency agile and thus require changes to the design, so novel materials and array architectures need to be developed to improve both power and spectral efficiency as well as security in wireless communications.The concurrent operation of macro-, micro-, pico-and femtocells is known as a HetNet [47], and it will require superwideband antennas and perhaps new architectures to enable emerging cell-free wireless communications.As for operating frequencies, the need for THz and optical antennas is still strong, as are some fundamental requirements, such as impedance matching and radiation efficiency.
Technologies related to "metamaterials," "flexible materials," and "wearable antennas" may have reached their peak, although challenges remain in upscaling and industrial uptake.Secured communications will require novel concepts and designs from a system aspect, such as "artificial noise" and "envelope correlation" built into physical layers, antennas, and radio-frequency systems.Interestingly, "effective permeability" and "Landau damping" are two concepts respectively related to functional electromagnetic materials and plasma physics and are reaching the PP stage, together well-known "reflector antenna" technologies.Technologies in the red curve have started "disappearing," while those in blue may also reach technology maturity, though they remain key in the A&P research community.

CONCLUSIONS
To facilitate the advancement of antenna research, we have successfully carried out a study reviewing and predicting A&P research from large-scale unstructured data by using ML.We have seen geographical changes in A&P publications, partially reflecting the outcome of government investment and the strategic aim of society leaders to promote A&P research in developing and underrepresented worlds.Meanwhile, our study provides A&P researchers with information on how future research directions would change in a fully automated and unbiased manner, based on openly published papers/books.In this work, recent advances in big data analytics, NLP, and ML play key roles, and they may benefit A&P researchers with the introduction of new subjects and crossdisciplinary knowledge.To the best of our knowledge, this is the first study in antenna research that utilizes a large amount of unstructured data extracted from over 159,000 abstracts related to antenna research published between 1981 and 2021, extracting informative patterns via keywords and using customized weighting via citations and the SJR index.In a collective manner, the predictions by the proposed framework show a sensible and practical idea of what could be expected while representing the wholistic idea of the science literature.The importance of understanding past and current technological trends and having insight into the future cannot be overemphasized.The predictive results visualized in the form of a Gartner's hype cycle successfully show what the technical trends will be in the next four years.This scientific attempt quantitively addresses one of the important challenges within the A&P community, the provision of future research directions for the scientists in antenna research.
FIGURE 2. (a) and (b) The number of publications, by publication name and fund sponsors.(Continued ) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

FIGURE 3 .
FIGURE 3. (a) and (b) Ranked affiliations by the number of works of literature before and after 1980.The total number of works of literature is 167,000, and the period of 1981-2021 gives 159,000 works.(Continued ) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

FIGURE 3 .
FIGURE 3. (Continued) (c) and (d) The changes in the number of works of literature by country, before and after 1980.Dark colors indicate a greater number of papers.(e) The number of works of literature per country compared through every decade during the period.The right subfigure represents the number of works of literature converted into percentages.
of Works of Literature by Countries Before the Year of 1980 Number of Works of Literature by Countries After the Year of 1980

Figure 3
compares how each country/institution's research activity has changed over time, based on the number of published papers before and after 1980.
Figure 3(a) shows top contributors to A&P research, dominated by U.S. and U.K. universities and companies, with Harvard University and The Ohio State University leading.The post-1980s era has witnessed a surge of research from other countries and areas, including France, Sweden, Finland, Russia, Japan, Hong Kong, and Taiwan, with China dominating the chart in terms of research publications in A&P [Figure 3(b)].

FIGURE 4 .
FIGURE 4. Rankings generated by considering up to the third affiliations on each paper, with each paper weighted using(1), which has two parameters whose values need to be set (α = 0.3 and β = 0.7).In this experiment, a higher weight is given to the prestige of the journal in which a paper is published than to the average number of citations received by the paper, as suggested by DORA[27].The numbers in (a) and (b) indicate the mean of the calculated weight of individual papers published by each institution.(a) Without a cutoff (data from 159,000 works).(b) With a cutoff (data from 23,000 works).

2 FIGURE 5 .
FIGURE 5.All 2,415 keywords in 200D space are plotted into 3D space by using PCA.

FIGURE 6 .
FIGURE 6.The frequency data are amplified at some point when the weight is applied.Past trends for two example technologies in antennas and propagation research: (a) transmit power and (b) massive MIMO.

FIGURE 8 .
FIGURE 8.The input sequence xr

FIGURE 9 .
FIGURE 9.The validation results of the encoder-decoder

FIGURE 10 .
FIGURE 10.The prediction results for the next four years, using the encoder-decoder LSTM with attention models and the hype cycle in 2026.

TABLE 1 .
THE TOP 7 HIGHLY WEIGHTED PAPERS.

TABLE 2 .
presents five clusters and their attributes, with 10 frequent keywords in each category.Based THE 10 MOST FREQUENT KEYWORDS FROM FIVE CLUSTERS.