Uncovering the Educational Data Mining Landscape and Future Perspective: A Comprehensive Analysis

Educational data mining (EDM) enables improving educational systems by using data mining techniques on educational data to analyze students’ learning processes to extract valuable information that helps optimize teaching strategies and improve student achievement. EDM has been an important area of research and application in recent years. The aim of this study is to describe the current situation of the EDM field and reveal its future perspective. The study employs descriptive analysis and topic modeling, utilizing a corpus of 2792 studies indexed in the Scopus database since 2007. Firstly, the study determines the document types, distribution by years, prominent authors, countries, subject areas, and journals of the studies in the field of EDM. Then, using topic modeling analysis, which is an unsupervised machine learning technique, the study determines hidden patterns, research interests, and trends within the field. This study is innovative and the first as it reveals latent research interests and trends in the field of EDM through machine learning-based topic modeling-based analysis. The descriptive characteristics of the study emphasize the continuous development of the field and its multidisciplinary aspect. The outputs of the topic modeling analysis reveal that the studies can be grouped into twelve topics. The most frequently studied topic is “Learning pattern and behavior”, and the topic whose frequency of study increases the most over time is “Dropout risk prediction”. When comparing the frequency of study of the topics over time to other topics, the first topic that stands out is “Performance prediction”. The results of this study can be expected to make significant contributions to the field in terms of revealing the big picture of the current literature in the field of EDM and providing a future perspective. Therefore, the results of the study are expected to give direction to the field and provide important insights or guidance to decision makers and education policy makers.


I. INTRODUCTION
The development of educational technologies has brought about changes in educational processes.The inclusion of internet technologies in educational processes, the diversification of resources and the use of educational software, in short, technology-enhanced learning, created large data pools where data about students are stored [1], [2].These educational data pools, which are increasing day by day, are The associate editor coordinating the review of this manuscript and approving it for publication was Laxmisha Rai .a gold mine for education stakeholders [3], [4].The information that can be discovered in such pools can be used not only to model the learning process, but also to evaluate learning systems and improve the quality of managerial decisions [1].Data mining or knowledge discovery from databases is defined as the automatic extraction of important patterns from such repositories [5].In the field of education, institutions and learning environments generate daily data with large volumes from various learning and teaching activities [6].The increase in data mining applications on educational data has given rise to the concept of Educational Data Mining (EDM).EDM is an emerging discipline in which data mining techniques such as statistics and machine learning are used on educational data [7], [8].
EDM is concerned with the development and application of computerized methods to discover patterns in educational datasets that are difficult or impossible to analyze manually due to the large volume of data they contain [1], [9].From another point of view, EDM is generally applied in the form of developing student models that express students' current knowledge, motivation, metacognition and attitudes [10].EDM is not limited to this, but it is also effectively used to analyze the data produced by any information system related to learning and education [11].These data can be related to the interaction of an individual student with the learning system, or they can be very diverse, such as data regarding the collaboration with other students, school administrative data, demographic data, and data regarding students' cognitive and emotional situation [1], [12].It can be said that research in the field of EDM focuses on discovering useful information for educational institutions to better know and manage their students, as well as to better manage students' learning outcomes and increase their performance [12], [13].On the other hand, EDM also can be used to design better and smarter learning technology and to better inform learners and educators [14].
Although EDM is a relatively new field of research, it has developed rapidly.EDM has a great transformation potential for factors such as discovering how students learn, predicting learning, and understanding actual learning behavior [14].As a matter of fact, many EDM studies can be mentioned in the literature as a data mining application on educational data.Examples of these studies can be grouped under the following categories: Predicting students' academic performance [6], [15], [16], [17], [18], [19], [20], learning behaviors [21], students' dropout process, efficiency and quality of teaching such as potential estimation [22], [23], [24], clustering students to extract typical behavioral patterns and estimating students at risk [25], [26], [27], [28],university learning materials and evaluation for curriculum improvement [29] planning and strategy for administrative decision making [30], proposing an EDM framework to support learning [11].The ultimate goal of these studies is to provide important outputs to improve the quality and delivery of educational systems and propose necessary policies [6], [31].

A. PREVIOUS REVIEWS ON EDM
There are various review articles in the literature that aim to provide a broad perspective on the EDM field at different times.For example, [8] conducted the first study in this field in the early days.In this study, EDM techniques applied in e-learning environments between 1995-2005 were examined.[32] conducted a literature review examining the trends and major changes in EDM research and the reduction in the frequency of relationship mining within the EDM community.[33] reviewed the literature on different stakeholders in education such as students, educators, researchers, institutions, and administrators.The researchers have also provided a list of typical training tasks using EDM techniques.Reference [34] realized a superficial literature review on how data mining can be used for purposes such as student retention and attrition, personal recommendation systems in education, and analyzing lesson management system data.Reference [4] conducted a study to reveal the development in the field of EDM and to organize, analyze and discuss the content of the review based on the results produced by the data mining approach.The content of the study consists of 222 EDM approaches and 18 articles containing EDM tools.Reference [9] carried out a systematic review study of 166 articles published over thirty years  on clustering algorithms and their applicability and usability in the context of EDM.Reference [13] performed a study from a different perspective, examining the most commonly used, accessible, and powerful tools that researchers working in the EDM field can use.[7] conducted a study in which they examined various tasks and applications in the EDM field and categorized them according to their purposes.Reference [35] carried out a review on 72 EDM research articles on the teaching and learning process, considering the educational perspective.Reference [36] conducted a systematic review of 33 articles published in the EDM field between 2007 and the first quarter of 2019.Reference [37] published a new review article in which they updated and enhanced their previous article titled ''Data mining in education'' from 2013.Reference [38] presented a systematic review of 140 EDM studies related to student performance in classroom learning.Reference [39] conducted a bibliometric analysis of the literature on educational data mining published between 2015 and 2019 (n=194).Reference [40] provided a comprehensive review of machine learning approaches, as well as non-performance factors and characteristics, in three different learning environments (Traditional Learning, Blended Learning, and Online Learning), in a systematic review study of 100 articles.Reference [41] conducted a systematic review of 80 studies from 2016 to 2021 that used EDM methods to predict student performance.Reference [42] provided a detailed perspective on student performance prediction by focusing on approximately 260 studies conducted over the past 20 years, from various perspectives.

B. RATIONALE AND IMPORTANCE OF THE PRESENT STUDY
Many bibliometric analyses, systematic reviews, and survey studies provide a narrow or broad perspective on the EDM field.Although these studies have contributed significantly to the field, there is still a need for studies that provide a broad perspective and reveal the big picture of EDM.Methods such as bibliometric analysis, systematic review, and survey studies can have limitations.The difficulty of studies conducted manually on large data sets can also be included in these limitations [43], [44].At this point, topic modeling analysis, a machine learning-based approach, stands out.Thanks to topic modeling analysis, automatic information extraction can be performed from large data sets [45].In this context, topic modeling studies that reveal trends and patterns in a research area and extract hidden patterns have been remarkable in recent years [43], [46], [47], [48], [49], [50].The lack of a topic modeling study that reveals the big picture of the EDM field and uncovers hidden and semantic patterns in the field makes this study necessary and important.In this context, the current study is important as it is the most comprehensive and first topic modeling-based study in the field.Topic modeling analysis, an innovative approach based on unsupervised machine learning, enables the semi-automatic discovery of hidden semantic patterns from large datasets.The topic modeling approach, which enables computer processing of large data sets, has made it easier to extract hidden semantic patterns in research.In this context, this study, which is the first in the field of EDM, is novel in this respect.In this direction, the current study has examined all studies conducted in the field of EDM from 2007 to the present day and extracted the descriptive characteristics of the field.In addition, research interests and trends of the studies have been explored through Latent Dirichlet Allocation LDA-based topic modeling analysis.It is expected that the outputs of the study will guide researchers in the field.

II. METHODS
This section provides information about the methodology of the study, research questions, data collection process, and data analysis.The study is based on descriptive analysis and LDA-based topic modeling analysis.First, the descriptive characteristics of the studies in the literature were revealed with descriptive analysis, then the hidden patterns in the research were discovered with LDA-based topic modeling, and thus research interests and trends were determined.Bibliometric analysis is used to summarize quantitative statistics such as prominent authors, institutions, journals, subject areas, and research years in publications [51].Topic modeling analysis is an unsupervised machine learning approach used to automatically extract hidden patterns from large datasets [45].The topic modeling approach is based on automatically discovering hidden semantic patterns called ''topics'' from large text datasets [45], [49], [52], [53].In this study, the LDA algorithm [54], a probabilistic method, was used for topic modeling.LDA-based topic modeling was used because it provides an efficient way to calculate the coherence score used to determine the ideal number of topics [45].LDAbased topic modeling is effectively used as an innovative approach in many areas, such as natural language processing and literature review of job postings [43], [44], [46], [55].Figure 1 shows the flow of the developmental stages of this study.
As seen in Figure 1, the research problem was first determined.Following the decision to work on EDM, query criteria were created to access the largest data set.The EDM corpus was obtained with this query.Descriptive and topic modeling analyses were applied separately to this corpus.Descriptive characteristics of the corpus were extracted through descriptive analysis.For the topic modeling analysis, first the title, abstract, and keywords of the articles in the corpus were combined into a single text.Then, by following a number of data preprocessing steps, the data set was made ready for topic modeling analysis.The data set, ready for analysis, was subjected to topic modeling analysis, and topics were discovered.Finally, descriptive analysis results and topic modeling analysis findings are reported and presented.

A. RESEARCH QUESTIONS
The aim of this study is to reveal the big picture of the EDM literature.In this regard, the following research questions were addressed to reveal the details of the studies in the EDM field and to determine research interests and trends: RQ1: What are the document types and numbers, and distribution of them by year in the field of EDM? RQ2: Which authors, countries, subject areas, and journals stand out in EDM?
RQ3: What is the distribution of topics of the studies in the field of EDM? RQ4: How have these topics changed and developed over time?

B. STRATEGIES FOR THE CREATION OF THE CORPUS
The first step towards answering research questions is to create a corpus that includes the EDM literature.In this regard, research studies in the literature have been examined, and it has been seen that the Scopus database is suitable and sufficient for this task.Indeed, Scopus is a widely accepted database used to obtain publications in the highest number related to the field in literature review studies [53], [56], [57].Scopus is the largest database of abstracts and citations, covering more than 7,000 publishers and over 240 disciplines, including publications on the Web of Science [58], [59].This feature and its acceptance in the literature have made Scopus the preferred choice.In order to cover the EDM literature, the following primitive query has been created to search for the ''educational data mining'' group in the abstract, title, and keywords:

TITLE-ABS-KEY ( ''educational data mining'' ) AND ( LIMIT-TO ( PUBSTAGE, ''final'' ) ) AND ( EXCLUDE ( PUBYEAR, 2023 ) )
This query was executed on 06.03.2023, and all the studies published by the end of 2022 were reached.The query returned a total of 2831 records.The document types of the returned records were examined, and it was decided to include ''Conference Paper'', ''Article'', ''Book Chapter'', ''Conference Review'', ''Review'', and ''Book'' types in the corpus.After this process, a total of 2815 records were obtained.When the distribution of publications by year was examined, it was observed that there were only 23 publications in 2007 and earlier, which is less than 1% of the total number of publications.These records were excluded, The data analysis process of the study consists of two stages.The first stage is the extraction of descriptive characteristics of the EDM literature.In this stage, the obtained data was presented in figures and tables.The second stage is the discovery and naming of topics and trend analysis using LDA-based topic modeling analysis.Topic modeling analysis is basically an unsupervised machine learning technique, also known as a data/text mining approach [60].Data mining requires some preprocessing steps.The aim of these steps is to get analysis-ready data from raw data.In the preprocessing stage, the combined text consisting of title+ abstract+ keywords of the articles was transformed into plain and clean words.Textual data was converted to lowercase, special characters and punctuation marks were removed, and lemmatization was applied to get the word stems.Then, generic words that do not carry meaning in the text (a, an, the, for, etc.) were added to the stop word list and removed from the text.As a result of these steps, the words in the documents were converted to a word vector according to the ''bag of words'' logic.All these steps resulted in obtaining cleaned data that is ready for analysis.These operations were carried out using the Python language and data processing libraries.
The data, which was preprocessed and made ready for analysis, was subjected to LDA-based topic modeling analysis.These analyses were also carried out using the Gensim data mining libraries of the Python language [61].Topic distributions were observed with initial analyses using Gensim's ldamulticore.The stop word list was checked and additions were made to the list.The words ''education'', ''data'', ''mining'', and ''edm'' were observed in all topics, and since the research was directly related to this field, it was deemed appropriate to add these words to the stop word list.Then, the final analysis was performed.For each K in the range of K = [3-25], a model was created in the final analysis.The c_v coherence score was used to determine the ideal number of topics.c_v coherence score is a good solution for determining the ideal number of topics [43], [45].The topic with the highest c_v coherence score is considered the ideal topic [49], [54].In Gensim's ldamulticore implementation, the alpha and eta (also known as beta) hyperparameters specify the parameters of the prior Dirichlet distribution.The default values for these two parameters are ''symmetric.''Various values that these parameters could take (alpha = [symmetric], eta = [symmetric, auto, none]) were tested.The c_v coherence values were obtained for all models.Some important parameters used in the LDA model include ''alpha'' and ''eta''.These parameters play an important role in shaping the behavior and output of the model.Alpha is a parameter that controls the generalization of the topic distributions of documents.It determines how the topic distribution in each document will vary.Eta controls the generalization of word distributions representing the content of each topic.This parameter determines which words a topic will contain frequently and which words will be found rarely.The number of K topics determines the number of topics in the model.A c_v value is calculated for each K.In the model, K = [3, 25], a c_v coherence value was calculated for each K.The height of the c_v value is used to determine the ideal number of topics [43], [45].The results of the experimental trials were examined, and it was determined that the model with K = 12, alpha = ''symmetric'', eta = ''symmetric'' provided the highest c_v coherence score (c_v = 0.426).As a result of the analysis, it was decided that the ideal model had 12 topics (K = 12; c_v = 0.426) After deciding on the ideal number of topics, the topics were visualized using the pyLDAvis library [62], [63].The visualization was used to name the topics.The lambda value, which shows the importance ranks of the words within the topics, was set to 0.6 as recommended and accepted in the literature [50], [63].A screenshot of pyLDAvis is given in Figure 2.
Two educational technologists, in addition to the researcher, examined the terms that make up the topics and a consensus was constructed on the final names of the topics.After obtaining the topics and the terms that make them up, a matrix was created showing the publication count for each topic over the years, taking into account the number of publications assigned to each topic.With the help of this matrix, the change of topics over time was traced and trend analysis was carried out.

III. FINDINGS
The findings of the study, in which the fifteen-year-old EDM literature was extensively examined and the hidden patterns of this literature were extracted, are presented with two headings to answer the research questions.The first heading includes the findings related to answers of the first two research questions (RQ1 and RQ2), while the second heading includes the findings related to answers of the third and fourth research questions (RQ3 and RQ4).

A. FINDINGS ON DESCRIPTIVE CHARACTERISTICS OF THE EDM LITERATURE
In line with the first research question (What are the document types and numbers, and distribution of them by year in the field of EDM?) the document types and numbers and distributions of them by year in the field of EDM literature were determined.While numerical information on document types is given in Table 1, the distribution of the number of documents according to years is given in Figure 3.
As seen in Table 1, more than half of the documents are conference type.The proportion of journal articles (article + review) is 37.9%.
As seen in Figure 3, it can be said that the number of publications in the EDM field has steadily increased over time.This increase continued until 2019 and peaked in that year.Although there was a slight decrease in the number of publications in 2020 compared to the previous year, the number of publications has started to rise again.
In line with the second research question (RQ2: Which are the prominent authors, countries, subject areas and journals in studies in the field of EDM?), the findings regarding prominent authors, countries, subject areas and journals are given in Table 2, Figure 4, figure 5 and Table 3, respectively.As can be seen in Table 2, Baker R.S., Romero C., and Ventura S. are among the most prolific authors in this field (Baker R.S. appears as Baker R.S.J.D in some publications, since they are the same author, the number of publications is summed up and given as one).
As seen in Figure 4, when the origins of the publications are examined, the publications originating from United States, India and China take the lead.In addition, it is seen that countries in different geographies are among the top ten countries.As can be seen in Figure 5, prominent subject areas in publications highlight the interdisciplinary emphasis.As a matter of fact, the top ten subject areas which stand out range from computer science to energy.It should not be misleading that the sum of the subject area publications is more than the total number of publications.This is due to the fact that a post is tagged under more than one subject area.
This subject area classification is an output of the Scopus database.An article is classified into one or more subject areas.The fact that there are different classes (Decision Sciences, Business, Management and Accounting, Physics and Astronomy, Materials Science and Energy, and others) is an indication that EDM-related studies are carried out in different disciplines and fields.
As seen in Table 3, it can be said that the prominent journals in the field are in the fields of computer science and educational technologies.

B. FINDINGS ON TOPIC MODELING ANALYSIS
In this section, the findings related to the emerging topics and their trends in the studies in the field of EDM for answering the third and fourth research questions (RQ3 and RQ4) are given.The results of the analysis revealed that twelve topics emerged in the field of EDM.These topics, 120198 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the terms that make up the topics and the volume ratios of the topics are given in Appendix-A.In addition, the number of publications and accelerations of the topics by years are also given in Appendix-B.Firstly, the distribution of topics (for answering RQ3) is listed in Figure 6 in order of volume.
As can be seen from Figure 6, the most voluminousin other words, the most studied -top three topics in the EDM field are ''Learning pattern and behavior'', ''Recommendation systems'' and ''Sentiment analysis'', respectively.The low-volume topics are identified as ''Feature selection'', ''Dropout risk prediction'', and ''Unstructured data analysis''.The order of the topics by volume ratios and the order of the topics by acceleration are almost equal (can be confirmed from Appendix-A).In fact, when the order of volume is compared with the order of acceleration, it was found that only the topics ''Learning analytics (Acc=3.78)''and ''Mooc and learning platforms (Acc=2.89)''switched places with each other, and the other ranking remained the same as the order of volume.
To analyze the changes and trends of the topics over time (in response to RQ4), a fifteen-year period has been divided into three-year periods.The percentages of the topics within themselves and compared to other topics over time was obtained by taking into consideration the number of publications in these periods.The basic table where these data were obtained is Table 4, which provides the publication numbers for each period.Accordingly, Table 4 presents the  periods and the publication numbers of each topic during these periods.
Using the data in Table 4, the percentage volume of each topic within each periods and the volume percentages of a topic in any period compared to other topics were calculated.
For example, in order to calculate the frequency of being studied within itself over time regarding the ''Learning pattern and behavior'' topic, a row-based reading was performed.Accordingly, the volume ratio of the relevant topic in each period (number of publications in period i/total number 120200 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. of publications) was calculated as 4.21%, 7.76%, 17.1%, 35.79% and 35.13%, respectively.Column-based reading was used when calculating the study frequency of this topic compared to other topics in periods.Accordingly, the study frequency of the topics in the first period compared to other topics (i.e., the number of publications on this topic in the first period divided by the total number of publications for that period) was calculated as 40.00%.Similar calculations were performed for all topics, and thus the percentages of each topic's frequency of study over time, both in relation to itself and compared to other topics, were determined.In addition, the acceleration of each topic within each period (Acct,p) and compared to other topics (Acct,ot,p) was also calculated.These data are presented in Table 5.
As can be seen in Table 5, the most frequently studied topic is ''Dropuot risk prediction'' (Acct,p=13.48),followed by ''Unstructured data analysis'' (Acct,p=13.16) and ''Performance prediction'' (Acct,p=12.18),respectively.From another point of view, ''Performance prediction'' (Acct,ot,p=1.96)was the topic that increased the frequency of study the most compared to other topic over time.This topic is followed by ''Learning analytics'' (Acct,ot,p=1.62)and ''Sentiment analysis (Acct,ot,p=1.60)''respectively.Using the data in Table 5, the accelerations of the volume ratios of the topics over time within themselves and relative to other topics are given in figures 7 and 8, respectively.
As seen in Figure 7, ''Dropout risk prediction'' has been studied more in recent times.In other words, studies on this topic have mostly been carried out in recent periods.This topic is followed by ''Unstructured data'' and ''Performance prediction'' topics.The slowest accelerating topic was obtained as ''Knowledge tracing''.
As seen in Figure 8, while the study frequency of seven topics increased over time compared to other topics, study frequency of five topics decreased over time compared to other topics.While the most prominent topic over time is ''Performance prediction'', it is followed by ''Learning analytics'' and ''Sentiment analysis'' topics.''Learning pattern 120202 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and behavior'' comes first among the topics that are less studied over time compared to other topics, followed by ''Mooc and online learning platform'' and ''Knowledge tracing''.Finally, considering the increase in volume ratios of topics over time within themselves, it was also found out when each topic started to come to the fore.In this context, the approximate times when the topics come to the fore have been described and visualized in Figure 9.
As seen in Figure 9, while the topic of ''Dropout risk prediction'' started to be studied extensively in the 2020s, ''Performance prediction'' started to gain weight in the 2017's.Thanks to Figure 9, it is possible to see clearly in which years the topics started to become more prominent.

IV. DISCUSSION, LIMITATIONS, AND CONCLUSION
In this section, the results are presented in the light of the findings obtained in the current study and these results are discussed together with the related literature.When the EDM literature was examined, it was seen that conference publications constituted more than half of the corpus.It was observed that the number of documents increased regularly until 2019, and although there was a slight decrease in 2020, it rose again.This situation may be due to interruptions and priority changes in educational researches caused by the Covid-19 pandemic.Indeed, emergency remote education was started with the covid-19 pandemic, studies focused on this area, and interruptions may have occurred in the data processes [64].Among the most productive authors in this field is Baker, R.S. and United States leads the way in the leading countries.These results are parallel to the literature [39].On the other hand, it was observed that there are very different fields from ''Computer Science'' to ''Energy'' and ''Medicine'' when the subject areas of the studies were examined.When the subject area categories of the Scopus database are examined, it is seen that the field of ''Computer Sciences'' takes the lead, still it is possible to say that EDM studies are carried out in many different disciplines.EDM is a field located at the intersection of different disciplines such as computer science, statistics, educational sciences, and psychology.The aim in this field is to use data mining methods to understand student performance by analyzing educational data, improving learning processes, and optimizing educational policies.While computer science provides tools for data analysis, other sciences contribute to the interpretation of educational data and improve the quality of education.In this context, it is natural that EDM studies, which have an interdisciplinary structure, have found application areas in different disciplines.This confirms the interdisciplinary nature of the field [1], [36].In parallel, the emergence of different journals in the fields of educational technologies and computer sciences, especially ''IEEE Access'', can be given as an example of the multidisciplinary of the field.
The topic modeling analysis conducted with the studies in the field of EDM gathered these studies under twelve topics.The top three topics, based on volume, are ''Learning pattern and behavior,'' ''Recommendation systems,'' and ''Sentiment and feedback analysis.''The volume value of these topics also indicates that they are the most studied topics in the field.Numerous studies investigating students' learning patterns, behaviors and strategies in EDM studies draw attention [65], [66].In addition, the increase in learning resources and the fact that students get lost themselves in these contents [67], [68] have made personalization and suggestion systems important and necessary.In this context, recommendation systems are an important field of study in EDM [69].Sentiment analysis is one of the commonly used techniques to express human thoughts and is frequently preferred in educational settings.Therefore, sentiment analysis and student feedback analysis systems, which process students' views and opinions through emotion analysis, are among the most studied topics in EDM [70], [71], [72].In addition to these topics being the most voluminous -most studied topics-in EDM, ''Feature selection'', ''Dropout risk prediction'' and ''Unstructured data analysis'' topics also emerged as unvoluminous topics.Overall, when the volume ranking and acceleration values were compared, it was concluded that the volume ranking and acceleration of the topics are largely the same.
In order to examine the change and development of studies in the field of EDM over time, the fifteen-year time frame was divided into five three-year periods.During these periods, the volume ratios and accelerations of the topics were determined, and the study frequency of the topics within themselves and compared to other topics was determined.When the percentage ratios and accelerations of the volumes of the topics were examined during these periods, the top three topics that have been studied more frequently in recent years were revealed as ''Dropout risk prediction'', ''Unstructured data analysis'', and ''Performance prediction'', respectively.The first two of these topics are low-volume, and the third one is of medium-volume.The fact that the most voluminous topics are relatively present in all periods and that these low and medium voluminous topics have recently started to be studied more may have triggered this situation.Indeed, the years in which these three topics began to gain weight and jump were 2020, 2017, and 2014, respectively.The increase in recent studies aimed at predicting school dropout in both traditional education and Mooc and online environments is remarkable [73], [74], [75].The results of the study support this.In addition, the increase in different data sources such as text, image, video and the concept of ''unstructured data'' that has entered our lives with big data [76], has also been used in the field of EDM in recent years [38].In addition to these, it is not surprising that ''Performance prediction'' is also among the most studied topics recently.The tremendous increase in learning data has increased the use of EDM techniques for better understanding and organizing the learning process [38], [77], [78], [79].
Finally, the volume ratios of the topics in the periods were compared with the other topics.In this way, the frequency of studying the topics compared to other topics was calculated.In this case, while seven topics stood out more over time among other topics, five topics lagged behind.The top three topics that stood out the most among other topics are ''Performance prediction,'' ''Learning analytics,'' and ''Sentiment and feedback analysis,'' respectively.These topics are the top three topics that gradually increase in weight compared to other topics.The first and third of these are among the most studied and the most voluminous topics in time, respectively.The topic of ''Learning analytics'' ranks sixth among the most studied-on topics over time and fifth in terms of volume.This topic, which started to gain weight in the 2014s, is the second most prominent topic compared to other topics.Learning analytics, defined as measuring, collecting, analyzing, and reporting data about students and contexts to understand and optimize learning environments [80], is used to provide insights into learning processes [81], [82], [83], [84].In this context, it is not surprising that the topic of ''Learning analytics'' stands out among other topics.
By visualizing the topics with PyLDAvis, the relationship between the topics in the EDM field was seen more clearly.The size of the circles representing the topics indicates the volume and prevalence of the topic.Accordingly, according to the pyLDAvis output, the top three most voluminous topics are represented by the largest circles, and they are the topics ''Learning pattern and behavior'', ''Recommendation systems'', and ''Sentiment and feedback analysis'', respectively.On the other hand, the relationships between the subjects also emerge through the positions of the circles.The distance between circles indicates the similarity or difference between subjects.Accordingly, topics numbered 1-7-9 stand out as close and related topics.These topics were obtained as ''Learning pattern and behavior'', ''Clustering student's profile'', and ''Rule based algorithm''.These three topics are the first group of topics that are related to each other.In addition, topics 2 and 5 (''Recommendation systems'' and ''Learning analytics'') and topics 3 and 6 (''Sentiment and feedback analysis'' and ''Performance prediction'') are close and related topics.Topic number 12 (''Unstructured data analysis'') draws attention as the topic that has the least relationship with all the topics.
This study aims to identify trends in the EDM literature from the past to the present.The study is unique in that it is the first to identify research interests and trends in the EDM literature using an innovative method, topic modeling analysis.However, the study has a number of limitations.The first limitation is that the corpus consists of journal articles only.In future studies, all document types, such as conference proceedings, book chapters, etc., can be included, and topic modeling can be applied to a more comprehensive dataset.Another limitation is the use of the LDA algorithm.The LDA algorithm is an efficient method for topic modeling and is frequently used in such studies.However, in future studies, experimental studies can be conducted with different algorithms, and the results can be presented comparatively.Another limitation is that in topic modeling-based approaches such as LDA, topic naming is done from the authors' point of view and interpretation.On the other hand, it is important to conduct such studies in the future to see how the field has developed.In addition, although this study is the first of its kind, such automated text mining-based research should be encouraged in the future.In this way, the change and development of existing topics and the emergence of new research areas can be observed.Another limitation of the study is a specific situation specific to the field.Since topic modeling is domain-dependent, the emerging topics may be from different research areas, such as tasks or threads.In this case, the topics discovered by taking the context into consideration can be classified at a higher level, and different perspectives can be revealed.

V. IMPLICATIONS A. FUTURE IMPLICATIONS IN LIGHT OF THE CURRENT SITUATION
The big picture of the EDM field was revealed through the current study.The most voluminous topics in this field and the topics that have been increasingly studied over time both within themselves and in comparison to other topics have been identified.According to the results of the study, the topics with the highest increase in frequency of study over time (the top five topics in terms of growth rate) are low-volume topics such as ''Dropout prediction'', ''Unstructured data analysis'', and ''Feature selection'', as well as ''Performance prediction'' and ''Sentiment and feedback analysis''.In addition, the topics with increasing frequency of study compared to other topics (also the top five topics in terms of growth rate) are ''Performance prediction'', ''Learning analytics'', ''Sentiment and feedback analysis'', ''Rule based algorithms'', and ''Unstructured data analysis'', respectively.Three of the top five topics in both categories (both in itself and in comparison to others) are the same.In addition to high-volume topics, the development of low-volume topics that stand out both within themselves and among other topics should be monitored in the next three to five years.The importance of the current study is evident in terms of understanding the current state and evolution of EDM studies, which is an emerging field.In the light of the current study, similar studies to be conducted in the future will also be important in revealing the evolution of the field.The outputs of current and similar studies are important in terms of guiding both researchers studying in this field and curriculum and policy makers.

B. IMPLICATIONS FOR EDUCATORS AND RESEARCHERS
In the previous section, the current state of research interests and trends in the EDM literature and future perspectives were outlined.This section focuses on the implications for educators and researchers in light of the current results.From the perspective of educators, EDM is known to be used to design better and smarter learning technologies.As a result, learners and educators are better informed.In this context, EDM can be seen as a good tool for educators to make better 120204 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.inferences.Considering that ''Learning patterns and behavior'', ''Recommendation systems'', and ''Sentiment and feedback analysis'' are the most studied topics in the context of the results of the study, it is thought that educators can frequently work on these topics.On the other hand, it can be expected that ''Dropout risk prediction'', ''Learning analytics'', and ''Performance prediction'', which have recently come to the fore, will be the focus of educators' attention in the near future.
In the context of researchers, the results of the study can be expected to provide important outputs and perspectives.Both the identification of the most studied topics and the topics that have come to the fore in recent years offer important opportunities for researchers in this field in the near future, beyond identifying research interests and trends in this field.In the previous section, predictions for the future were presented in a broad manner.In light of these, it is noteworthy that topics such as ''Dropout prediction'', ''Unstructured data analysis'', and ''Feature selection'', although low in volume, have increased in intensity over time.On the other hand, topics such as ''Performance prediction'', ''Learning analytics'', ''Rule based algorithms'', and ''Unstructured data analysis'' that stand out compared to other topics may be interesting to follow in the near future.

FIGURE 1 .
FIGURE 1. Flow chart of the study.

FIGURE 3 .
FIGURE 3. Distribution of documents by years and slope line.

FIGURE 4 .
FIGURE 4. Top ten origin countries of publications in the field of EDM and the number of publications.

FIGURE 5 .
FIGURE 5. Prominent subject areas and number of publications in the field of EDM.

TABLE 3 .FIGURE 6 .
FIGURE 6.The order of the topics according to their volume ratios.

TABLE 4 .
Publication numbers of topics in five three-year periods.

FIGURE 7 .
FIGURE 7. The change of the volume ratios of the topics within themselves in periods.

FIGURE 8 .
FIGURE 8. Changes in volume ratios of topics compared to other topics over periods.

FIGURE 9 .
FIGURE 9. Timeline of approximate emergence of topics.

TABLE 1 .
Types of documents that make up the EDM literature, their numbers and percentages.

TABLE 2 .
Top ten authors and their publication numbers in the field of EDM.

TABLE 5 .
The volume ratio and acceleration value of each topic in the periods and in comparison to other topics.

TABLE 6 .
Featured topics in the field of edm, top fifteen terms representing topics and volume ratios.

TABLE 7 .
Distribution and acceleration of the number of documents pertaining to each topic years.