Portrait of College Students’ Online Learning Behavior Based on Artificial Intelligence Technology

With the development of Internet technology, online learning is becoming more and more popular. However, college students have different online learning behaviors and attitudes toward artificial intelligence (AI) learning tools. In this paper, a portrait model is proposed for college students, which focuses on their online learning behavior and attitudes toward AI learning tools. Moreover, the proposed portrait model is built based on AI technology, i.e., random forest algorithm and long short-term memory (LSTM) algorithm are applied. In this model, there are three main parts: data pre-processing, building a multi-dimensional label system, and portrait model of college students. Firstly, the information collected through the questionnaire is quantified and its quality is improved by deleting invalid data. Then, a multi-dimensional label system is built for college students’ portraits, including basic attributes, online behavior attributes, behavioral attributes of using learning software, and psychological attributes of AI learning tools. Since each label consists of multiple indexes, the variance-based filtering method is used to streamline the indexes of online behavior attributes and behavioral attributes of using learning software, the random forest algorithm is applied to reduce the dimension of psychological attributes of AI learning tools. Next, the portrait of college students’ online learning behavior is realized by the K-means clustering algorithm, and the LSTM algorithm is performed to get the mapping mechanism between data and portrait categories. Through the mapping mechanism, the portrait of any new college student who is not included in the original dataset can be obtained quickly. Finally, the validity of the proposed model is verified by analyzing the questionnaire results of college students. Additionally, the portrait results provide a data basis for the development and popularization of AI learning tools.


I. INTRODUCTION
With the rapid development of Internet technology in recent years, each aspect of people's lives is related to the Internet, including clothing, food, housing, transportation, and education [1].Particularly, relying on the Internet for The associate editor coordinating the review of this manuscript and approving it for publication was Essam A. Rashed .
study and work has become more common during the COVID-19 pandemic.However, there are different online learning behaviors for college students, and these college students have different attitudes to artificial intelligence (AI) learning tools.Therefore, it is important to analyze college students' online learning behavior for the development and popularization of AI learning tools.Portrait technology is a useful tool for analyzing users' characteristics.
The concept of user portrait technology was proposed by Alan Cooper, which means a virtual representation based on real data, like social attributes, consumption behavior, and daily habits [2].At present, user portrait technology has been used in various fields.A user portrait analysis framework is proposed to analyze the preference of users with different brand phones for different APPs [3].In the short video industry, a user-portrait-based recommendation algorithm is proposed and the application model of the user portrait under the short video is expounded to promote the short video industry and enhance the security management effect of the short video network platform on user privacy in reference [4].With the help of user portrait technology, the Internet card user churning prediction is finished by a churn prediction model [5].In reference [6], a user portrait technology with a periodically changing database is built to accommodate the changing interests of the user.Additionally, user portrait technology can be applied to assess the creditworthiness of bank customers [7], [8].In the field of e-commerce, the best products are recommended to users based on users' portrait results, which are obtained from users' opinions and purchase history [9], [10].Users' portrait results of different groups can be described by clustering user behavior patterns [11].In the power industry, an AI based enterprise profiling model is built to characterize the power consumption profile geographically, temporally, and industrially of enterprises and provide customized power services [12].In addition, portrait technology based on neural networks is applied in tourism user portraits, online user portraits, and industrial market portraits [13], [14], [15].
There are several studies about college students' portrait analysis.In the reference [16], the dispositional and cultural predictors of portrait-editing intentions between Chinese and U.S. college females are explored through an online survey.A Hadoop campus early warning and decisionmaking system is built to generate the students' portrait, realizing the exception event monitoring, early warning, root cause analysis of closed loop management [17].For improving the quality of education, 19 educational features are constructed by using the data of students from enrollment to graduation, and the K-Prototype method based on the mixed measurement of Hamming distance and Euclidean distance is applied to study the student portrait [18].The user portrait from multi-source campus data is studied and the conclusion that the fusion of multi-source data has better experimental results is obtained [19].Additionally, the two aspects, i.e., consumption and learning, of college students are analyzed by the proposed improved K-means algorithm [20].For analyzing characteristics of student learning behavior, a label with 5 dimensions is proposed and the clustering algorithm based on EM-FCM is used in the reference [21].Additionally, there are several studies about students' online learning behavior.Reference [22] analyzed the correlation between students' online learning behavior features and course grades.Reference [23] proposed a behavior classification-based e-learning performance prediction framework, which considered the inherent correlation between e-learning behaviors.Reference [24] used multiple machine learning algorithms to predict, test, and provide reasons for the decline of students' performance and compared the databases with data related to online learning and with data on relevant offline learning properties.
In this paper, the portrait model of college students about online learning behavior is proposed, which applies several AI technologies.Additionally, there are three contributions: (1) A framework of college students' online learning behavior portrait is proposed, which consists of three parts, i.e., data pre-processing, building a multi-dimensional label system, and portrait of college students' online learning behavior.
(2) A multi-dimensional label system is built based on the quantified indexes.The variance-based filtering method and random forest algorithm are used to obtain effective indexes for each label.(3) The portrait model is proposed based on the K-means algorithm and long short-term memory (LSTM) algorithm, which realizes quickly obtaining the portrait results of any college student.
The rest of this paper is set as follows.Section II introduces the framework of college students' online learning behavior portrait.Section III introduces the data pre-processing, including the quantitative method of questionnaires, guidelines for deleting invalid data, and reliability analysis of questionnaires.Section IV proposes the multi-dimensional label system and the method to determine the final indexes of this label system.Section V introduces the portrait model based on the K-means algorithm and LSTM algorithm.Section VI presents the case study.Finally, Section VII concludes this paper.

II. FRAMEWORK OF COLLEGE STUDENTS' ONLINE LEARNING BEHAVIOR PORTRAIT
The whole process of college students' online learning behavior portrait is shown in Figure 1, which consists of three parts, i.e., data pre-processing, building a multi-dimensional label system, and portrait of college students.
Firstly, quantify the collected questionnaires based on the characteristics of the questions in the questionnaires.In order to improve the quality of the data used for portrait analysis, delete these invalid data samples which means conflicting answers to similar questions, or vague choices.Additionally, the reliability analysis is performed to ensure the quality of data sample.
Then, a multi-dimensional label system is built based on the processed data sample.Four labels are selected, that is, basic attributes, online behavior attribute, behavioral attributes of using learning software, and psychological attributes of AI learning tools.There are multiple indexes on each label.To ensure the full effectiveness of each index, the filtering method and random forest algorithm are performed to obtain the final multi-dimensional label system.
Next, cluster data samples in each label by the K-means algorithm to obtain the portrait result.Additionally, the LSTM is applied to train the mapping mechanism between the data sample and the portrait.Finally, according to the trained mapping mechanism, the portrait of any college student can be quickly obtained.

III. DATA PRE-PROCESSING
The original questionnaires collected are not numerical data and could not be directly used for portrait analysis.Therefore, these questionnaires should be quantified.Additionally, there inevitably are invalid data that should be deleted.Finally, reliability analysis is performed to ensure the effectiveness of these data samples.

A. QUANTITATIVE METHOD OF QUESTIONNAIRE
Questions in questionnaires are multiple-choice tests, including questions with selecting single choice and questions with selecting multiple choice.In this paper, different quantitative methods are proposed for numerical processing.
(1) In terms of the question with selecting single choice, choices are numbered in turn.These numbers are used only to distinguish categories, and their numerical values do not represent an evaluation of the problem.For example, if a question has 5 choices, i.e., A∼E, it can be expressed as 0∼4.
(2) In terms of the question with selecting multiple choices, each choice is quantified as a binary variable.If the choice is selected, it is expressed as 1, otherwise, it is expressed as 0. Therefore, the answer to each question with selecting multiple choices is a sequence of 0 and 1.For instance, if a question has 5 choices and the college student selects the first and third choices, the answer can be expressed as 10100.

B. GUIDELINES FOR DELETING INVALID DATA
In order to exclude invalid answers caused by subjective factors of college students and ensure the reliability of analysis results, it is necessary to find and delete invalid data.For example, one college student chooses both 'Surfing the Internet on my laptop in my dorm room' for Question 2 and 'Not get online' for Question 3, which is evident an invalid answer.Therefore, two criteria for identifying invalid data are proposed in this paper, which is shown as follows.
(1) Conflicting answers to similar questions.In the design of the questionnaire, some questions will be set into consecutive similar questions to guide college students to fill in.If the answers to these similar questions are contradictory, the data sample of the questionnaire should be deleted.
(2) Fuzzy choices are selected for all questions.Considering that college students have different levels of understanding of choices, fuzzy choices can be set for students to choose when they are in a dilemma between extreme choices.If one student chooses fuzzy choices for all the questions, it is considered that he doesn't take the questionnaire seriously, and then the data sample of this questionnaire needs to be deleted.

C. RELIABILITY ANALYSIS OF QUESTIONNAIRE
Before analyzing college students' portraits based on these data, it is necessary to evaluate the quality and effectiveness of the data.At present, the questionnaire has been quantified and some invalid data have been deleted, so reliability is performed to analyze the quality of data.
Reliability reflects the value of random error in data and intrinsic consistency between problems of questionnaire.Generally speaking, the value of reliability coefficient should be between 0 and 1.If the value of reliability coefficient is larger than 0.9, the reliability of data is good.If the value of reliability coefficient is between 0.8 and 0.9, the reliability is acceptable.If the value of reliability coefficient is between 0.7 and 0.8, some questions in the questionnaire should be revised.And if the value of reliability coefficient is lower than 0.7, some questions in the questionnaire should be deleted.
In this paper, Cronbach-α is selected to analyze the reliability of data, which can be calculated by the following equation.
where n is the number of questions in the questionnaire, σ 2 i is the variance of question i, σ 2 total is the variance of all questions.x i j is the result of question i in questionnaire j, xi is the average of question i, J is the total number of questionnaires.

IV. MULTI-DIMENSIONAL LABEL SYSTEM
According to the questions of the questionnaire, four labels are extracted, which are basic attribute, online behavior attributes, behavioral attributes of using learning software, and psychological attributes of AI learning tools.For each label, there are multiple indexes.However, similarity might be among these indexes and some indexes might not work.Thus, the filter method and random forest algorithm are applied to delete some indexes, improving the effectiveness of indexes and labels.The final indexes and labels obtained are the multi-dimensional label system.

A. EXTRACT LABEL
For analyzing the college students' online learning behavior portrait, corresponding labels and indexes should be extracted.In this paper, a multi-dimensional label system is proposed, which consists of four labels, i.e., basic attribute, online behavior attributes, behavioral attributes of using learning software, and psychological attributes of AI learning tools.Additionally, each label includes multiple indexes, as shown follows.

B. FILTER METHOD
In order to fully investigate the characteristics of college students online learning behavior, three indexes and four indexes are set for online behavior attributes and behavioral attributes of using learning software, respectively.However, not all of these indexes will work for the portrait of college students' online learning behavior.These indexes need to be selected based on the collected questionnaire results.
Variance-based filter method selects the effective indexes by the variance of the indexes itself.A small variance in the index indicates that the label has little difference on this index.If most of the values of the index are the same, or even the values of the whole index are the same, then this index does not affect sample differentiation.Thus, the index whose variance is less than the threshold needs to be deleted.

C. RANDOM FOREST ALGORITHM
In terms of the psychological attributes of AI learning tools, there are 11 indexes.For improving the portrait results of college students' online learning behavior, the number of indexes is reduced by the random forest algorithm.That is, more effective indexes are selected from the original indexes, reducing the dimension of portrait model of college students' online learning behavior.
Random forest algorithm is a new machine learning algorithm, which is a classifier using multiple decision trees for sample training and integrated prediction [25].The algorithm applies Boot -strap resampling technology to randomly extract data from the original sample to construct multiple samples.Then, the random splitting technique of nodes is used to construct multiple decision trees for each resampled sample.Finally, multiple decision trees are combined and the final prediction result is obtained through voting.Its process is shown in Figure 2. Additionally, the out-of-bag error is used in the random forest algorithm to calculate the relative importance of indexes, sort, and filter indexes [27].

V. PORTRAIT MODEL OF COLLEGE STUDENTS' ONLINE LEARNING BEHAVIOR
It is clear that the data sample can be divided into four categories, i.e., freshman, sophomore, junior, and senior.Considering the large differences among college students in different grades, the characteristics and categories of the other three labels are analyzed under each grade to ensure the validity of the portrait analysis.Additionally, the framework of the portrait model is shown in Figure 3.
Firstly, the K-means clustering method is applied to indexes of each label in each grade to obtain corresponding categories that the data sample belongs to.Categories of all labels constitute the whole portrait of college students' online learning behavior.Then, the LSTM algorithm is used to train the mapping mechanism between the data sample and its  label categories.Finally, the portrait results of any college student can be quickly obtained through the trained mapping

A. CLUSTERING METHOD FOR PORTRAIT MODEL
Although the online behavior of college students different, there might be similarities for one label.Moreover, the number of choices corresponding to the questions in the questionnaire is limited.Therefore, we can use the K-means clustering algorithm to cluster the data sample for each index in labels.Then, several categories of each index are obtained, and categories of all labels constitute the college student's portrait result.
The main principle of the K-means clustering algorithm is: firstly, K samples are randomly selected from the sample set as cluster centers, and the distance between all samples and these K cluster centers is calculated [27].For each sample, it is divided into the cluster whose center is closest to it, and the new cluster center of each cluster is calculated for the new cluster.Additionally, the detailed process of using the K-means algorithm to cluster each label is shown as follows.
where µ is the set of cluster centers.y a is the sample a, C a is the cluster that y a belongs to, µ C a is the cluster center of cluster C a , and A is the number of samples.
Step 3.1: For each sample y a , assign it to the nearest center µ C t a .
Step 3.2: For each class center, the center of the class is recalculated.
where b is the number of samples in class C t a .

B. POTRAIT MODEL BASED ON LSTM ALGORITHM
For each grade, all categories that each data sample belongs to will be obtained through the clustering method.For each label, the data samples with categories can be used as the training data set, and the LSTM algorithm can be applied to train the mapping mechanism between the data sample and categories.Through training the LSTM algorithm, all mapping mechanisms of all labels are obtained.Then, all mapping mechanism forms a trained portrait tool, which can quickly obtain the portrait result of any college student's online learning behavior.The LSTM algorithm is first proposed by Sepp Hochreiter and Jurgen Schmidhuber, which is a specific form of recurrent neural network [28].The LSTM algorithm has advantages in handling problems that have dependencies, for example, there are certain correlations among different indexes.A single 6322 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
LSTM unit has a complex internal structure, inputs and outputs, as shown in Figure 4.There are three gates in the LSTM unit, i.e., forget gate, input gate, and output gate [29].The main principles of the three gates are: the forget gate determines how much of the previous state C t−1 results are forgotten, the input gate controls how much input X t to add for the current state, and the output gate determines how much of the current state C t are output.The corresponding equations are shown in ( 6) - (11).
where σ is the sigmoid function.

VI. CASE STUDY
In this section, the case study is analyzed to prove the effectiveness of the proposed methods.Firstly, the basic data is introduced, and results of data pre-processing are shown and analyzed.Then, the final multi-dimensional label system is obtained based on the filtering method and random forest algorithm.Next, the portrait database is gotten by performing the K-means clustering algorithm.Finally, the trained portrait tools are obtained based on the portrait database, and one college student's portrait result is visualized.

A. BASIC DATA
The questionnaire containing 20 questions was distributed in the university to achieve data collection.Finally, 4605 questionnaires are collected, which are the basic data to analyze the portrait of college students' online learning behavior.Additionally, these questionnaires are completed by freshmen, sophomores, juniors and seniors.

B. RESULTS OF DATA PRE-PROCESSING
In this subsection, examples of quantitative results are shown.The final number of valid questionnaire results is given.Additionally, results of reliability analysis are analyzed, which proves the effectiveness of the remaining data samples.

1) QUANTITATIVE RESULTS OF QUESTIONNAIRE
Through the proposed quantitative methods, choices of all questions can be quantified.Table 2 shows the results of one question with selecting single choice and one question with selecting multiple choices.In the question of grade, there are four choices, i.e., freshman, sophomore, junior, and senior.Thus, the quantitative results are 0,1,2,3, respectively.In terms of the question 'What concerns you most about this form of learning with learning software?', is a question with selecting multiple choices, and its choices are quantified by binary variables.'1011' in Table 2 means the answer is 'Effectiveness, Easy to use, and Cost'.

2) DELETED DATA
It is evident that the question 'The way of surfing the Internet' and the question 'Time spent online' are related.
If the college student chooses a way to surf the Internet in the former question, but he chooses 'Not get online' in the latter question, the conflict exists.The data sample with this situation should be deleted.Additionally, there is a fuzzy choice in questions 2, , the data sample with fuzzy choices of these eight questions is deleted based on the guidelines of delete invalid data.Table 3 shows the number of deleted data.It is clear that more data is deleted due to conflicting answers than due to fuzzy choices of all questions.
TABLE 3. The number of deleted data.

3) RESULTS OF RELIABILITY ANALYSIS
On the basis of the data sample after deleting invalid data, the Cronbach-α based on the standardized term is calculated by (1) ∼ (2), whose value is 0.872.The results show the questionnaire used in this paper has very high reliability and can provide highly stable results for subsequent analysis.

C. FINAL MULTI-DIMENSIONAL LABEL SYSTEM
The final indexes of online behavior attributes and behavioral attributes of using learning software are obtained by the filtering method and shown in this subsection.In terms of the psychological attributes of AI learning tools, its dimension is reduced by the random forest algorithm, improving the computation efficiency.

1) ONLINE BEHAVIOR ATTRIBUTES OBTAINED BY FILTERING METHOD
The filtering method is applied in data samples of each grade's online behavior attributes, and results are shown in Table 4.It is clear that the variance of index 2 'The way of surfing the Internet' is small compared to that of the other two indexes.The phenomenon shows that the answers to this index are mostly the same in the data sample.Therefore, index 2 has little effect on the portrait result, and we delete this index from the multi-dimensional label system.

2) BEHAVIORAL ATTRIBUTES OF USING LEARNING SOFTWARE OBTAINED BY FILTERING METHOD
Similarly, the filtering method is applied in the data samples of each grade's behavioral attributes of using learning software.Table 5 shows the variance of each index in different grades.It can be seen that the variance of index 5 and index 7 is smaller than those of other indexes.Therefore, index 5 and index 7 with little effect on portrait results are deleted from the multi-dimensional label system.

3) PSYCHOLOGICAL ATTRIBUTES OF AI LEARNING TOOLS OBTAINED BY RANDOM FOREST ALGORITHM
The random forest algorithm is used to perform unsupervised classification training for each grade's database, and the importance of each index to the classification result is obtained, as shown in Figure 5.It is evident that the importance of index 10, index 14, index 16, index 17, and index 18 is smaller.The result shows that these indexes have little effect on the classification result.Therefore, these indexes are deleted from the psychological attributes of AI learning tools.
After deleting these indexes, the final multi-dimensional label system is obtained, which is shown in Table 6.The next analysis will be performed based on it.

D. PORTRAIT DATABASE
The K-means algorithm is performed based on the final multi-dimensional label system.Figure 6, Figure 7, and    7. From Table 7, these cluster centers represent the characteristics of the category.It is evident that the cluster centers of some indexes are the same in different grades, like, index 3, index 8, indexes 11, index 12, index 13, index 15, and index 19.However, the cluster centers of index 4, index 6, index 9, and, index 20 are different in different grades.The reason could be that college students of different grades have different needs for learning tools and different levels of understanding of AI technology.
According to the characteristic of clustering centers for each index, the categories of each label and portrait database are obtained, as shown in Figure 9.There are four categories for basic attribute, i.e., freshman, sophomore, junior, and senior.In terms of online behavior attributes, there are six categories based on its two indexes, i.e., [shorter time spent online per week, mainly used for recreational amenities], [shorter time spent online per week, mainly used for recreational amenities and learning], [moderation time spent online per week, mainly used for recreational amenities], [moderation time spent online per week, mainly used for recreational amenities and learning], [longer time spent online per week, mainly used for recreational amenities], [longer time spent online per week, mainly used for recreational amenities and learning].Similarly, there are 12 categories for behavioral attributes of using learning software, and 16 categories for psychological attributes of AI learning tools, which shown in Figure 9.It is noted that index 11, index 12, index 13, and index 15 are integrated into one new index, i.e., attitude to AI learning tools based on the characteristics of their clustering centers.
In addition, these data samples with categories are used as the basic data to train the LSTM.Finally, the corresponding mapping mechanism between the original data and category is obtained.

E. PORTRAIT RESULTS OF A COLLEGE STUDENT
According to the trained mapping mechanism, the portrait results of a college student are obtained based on his questionnaire.The portrait results are shown in Figure 10.It can be seen that the college student is a freshman.He spends a longer time online per week and is mainly used for recreational amenities.In terms of the behavioral attributes of using learning software, he has an active attitude to using learning software, he thinks the advantages of learning software compared to classroom teaching are that it can repeat learning and contain comprehensive information, his concerns about this form of learning with learning software is learning results.Finally, he has the idea of using AI learning tools to finish homework, quiz, paper, and agrees with using it by college students.He thinks an important aspect of AI learning tools is the timeliness of knowledge.Additionally, he thinks the features AI learning tools should have good performance and rich knowledge.The portrait results can indicate that the college student is a potential customer for learning software and AI learning tools.

VII. CONCLUSION
In this paper, the portrait of college students' online learning behavior based on AI technology is proposed in this paper, which contains three steps, i.e., data pre-processing, building a multi-dimensional label system, and the portrait model of college students' online learning behavior.Through case studies, we obtained the multi-dimensional label system with 12 indexes.Additionally, there are 4 categories for basic attribute, 6 categories for online behavior attributes, 12 categories for behavioral attributes of using learning software, and 16 categories for psychological attributes of AI learning tools.We found that college students' answers to some questions are similar, but different grades have different attitudes toward the same question.Therefore, AI learning tools with different functions can be designed for different grades.Moreover, according to the clustered data, the portrait database and corresponding mapping mechanism are obtained, which can be used to analyze the characteristics of students.A new college student's portrait results are obtained based on the mapping mechanism, and we can conclude that he is a potential customer of AI learning tools.This example also shows the effectiveness of the proposed model.In future work, more indexes and labels will be analyzed to enrich the portrait database of college students.

FIGURE 1 .
FIGURE 1. Process of college students' online learning behavior portrait.

FIGURE 2 .
FIGURE 2. Process of random forest algorithm.

FIGURE 5 .TABLE 6 .
FIGURE 5. Importance of each index in psychological attributes of AI learning tools.

FIGURE 6 .
FIGURE 6.The clustering results of index 'What to do online' for freshman.

FIGURE 7 .
FIGURE 7. The clustering results of index 'What do you think are the advantages of learning software compared to classroom teaching?' for freshman.

FIGURE 8 .
FIGURE 8.The clustering results of index 'Do you agree with the use of AI learning tools by college students?' for freshman.

Figure 8
Figure8show the clustering results of part indexes for freshmen.It can also be seen that there is a large gap between the various categories, indicating the effectiveness of clustering.In addition, the cluster centers for each grade are shown in Table7.From Table7, these cluster centers represent the characteristics of the category.It is evident that the cluster centers of some indexes are the same in different grades, like, index 3, index 8, indexes 11, index 12, index 13, index 15, and index 19.However, the cluster centers of index 4, index 6, index 9, and, index 20 are different in different grades.The reason could be that college students of different grades have different needs for learning tools and different levels of understanding of AI technology.According to the characteristic of clustering centers for each index, the categories of each label and portrait database are obtained, as shown in Figure9.There are four categories for basic attribute, i.e., freshman, sophomore, junior, and senior.In terms of online behavior attributes, there are six categories based on its two indexes, i.e., [shorter time spent online per week, mainly used for recreational amenities], [shorter time spent online per week, mainly used for recreational amenities and learning], [moderation time spent online per week, mainly used for recreational amenities], [moderation time spent online per week, mainly used for recreational amenities and learning], [longer time spent online per week, mainly used for recreational amenities],

FIGURE 9 .
FIGURE 9.The portrait database obtained by clustering method.

FIGURE 10 .
FIGURE 10.The portrait results of one college student.

TABLE 1 .
Labels and their indexes in multi-dimensional label system.
b o is the parameter of forget gate, input gate, and output gate, respectively.Ct is a new state candidate vector, and its value range is [−1,1].

TABLE 2 .
Examples of quantization result.

TABLE 4 .
The variance of indexes in online behavior attributes.

TABLE 5 .
The variance of indexes in behavioral attributes of using learning software.

TABLE 7 .
Cluster centers of each grade.