Discover Customers’ Gender From Online Shopping Behavior

Gender information is very important for the recommendation system in the online shopping website. However, gender data often face label missing and incorrect labelling problems caused by consumers’ unwillingness to actively disclose personal information, which leads to gender estimation results that cannot meet the needs of the product recommendation system. To discover the customers’ gender information, we explore the customers’ online shopping behavior, especially the items viewed in the shopping session, from the dataset provided by Vietnam FPT Group. The dataset is very imbalanced while the number of female samples is $3\times $ of the male samples. To address the imbalance issue, we cluster the female samples into three subsets and then train a two-layer classifier model to estimate the customers’ gender. Experimental results demonstrate that our proposed method could achieve a combined accuracy 78% on average, and takes less than 6 seconds on average. As a data mining model for gender prediction, our approach has a lightweight network structure and less time consumption.


I. INTRODUCTION
We have witnessed the rapid growth of the online shopping in the recent decade. As COVID-19 hit the world, more and more customers prefer shopping online instead of visiting the stores in person. The online shopping websites make large profits from the customers and also learn from customers' behavior for improving their shopping experience [1]- [4]. One important feature of a customer is the gender information. If the shopping website knows the customer's gender, more accurate recommendation would be performed. For instance, when a customer is searching for a keyboard, the results might be different for different gender. If the customer is a male, his possibility of purchasing a mechanical keyboard is higher than the female customer.
Despite the importance of the gender information, it is hard for the shopping websites to collect customers' gender information because of the privacy issue. Sometimes, the customers even fill in the wrong gender information in the shopping website to protect their own privacy. Given our proposed model, we can precisely estimate a customer's gender, which is very beneficial for improving the performance of the The associate editor coordinating the review of this manuscript and approving it for publication was Massimo Cafaro . recommendation system in the shopping websites. In addition to the advantage our design brings to the recommendation system, another application of our design is that we can use our model to test the privacy protection mechanism in a shopping website. For example, if a shopping website wants to show the performance of its privacy protection mechanism, it could run our design on their customers' data. If our design could output the correct gender, the website's privacy protection mechanism does not work well. Otherwise, the website is good at protecting the customers' privacy.
Nevertheless, estimating customers' gender from the customers' viewing log is not trivial. The dataset containing gender information is very rare to find in the public sources since the gender information is very sensitive to the customers' privacy. Our design is based on the consumers' viewing log provided by Vietnam FPT Group, which is a leading information and communication enterprise, operating a number of B2B2C services. In general, online shopping log data includes information such as the buyer's product browsing and purchasing activities and the seller's product portfolio. The technical challenges in this paper include (1), the imbalanced data. In the FPT group's dataset, the number of female samples is three times of the male samples. Straightforwardly applying the existing data mining model only results in a very poor estimation. (2), the size of dataset is small. The FPT group's dataset is only a few Megabytes, containing less than 20000 samples. It is hard to apply some deep learning model on such small size dataset. In this paper, we propose a data mining model, consisting of clustering, decision tree, and random forest models, to overcome these challenges and discover the customers' gender information from the customers' behavior, especially which products are viewed in a shopping session. We observe that despite of the imbalanced male/female samples, we could further divide the female samples into 3 subsets via clustering, indicating the customers' gender is not just related to the customers' physical gender, but also related to the customers' psychological gender. The main contributions of this article are as follows: • We discover the correlation between personality diversity and gender in online shopping behavior, and explain the characteristics of customer shopping behavior in a specific web browsing log data set. These features are combined into feature combinations as candidate combinations for gender classification.
• We use personality diversity and data visualization to solve the problem of sample imbalance in the FPT group's online shopping behavior dataset. Based on the balanced sample set, the optimal classifier is selected for each layer of the designed gender classification network to improve the performance of gender classification.
• We conducted experiments using a large-scale data set provided by FPT Group and get the estimation accuracy of 78% within less than 6 seconds. The results prove the lightweight and high-efficiency of the proposed gender classification model.

II. RELATED WORK
In the recent years, researchers realize the importance of gender information and believe that gender information is the key factor in solving this contradiction [5], [6]. Empirical research shows that gender, as a typical feature of online shopping behavior that promote customers' demands analysis [7] and personalized recommendations technologies development [8], play an important role in increasing customers satisfaction [9] and online shopping recommendation systems performance [10]. Chen et al. [11] analyzed the moderating effect of gender on customers' shopping behavior based a benefit-risk paradigm model, and found that gender has a significant impact on online shopping willingness. Sohab et al. [12] studied the moderating effect of consumer cognitive innovation on the influencing factors of iTrust (interpersonal trust) on online purchase intention of new products, and found that gender information is helpful for the product display design of online websites. Lin et al. [13] did research on the gender differences of customers' online shopping psychology and behavior, and showed that gender information can promote the improvement and benefits of online shopping websites. Due to the gender information is essential to improve product recommendation performance, some researchers had proposed personalized recommendation algorithms or techniques based on gender information to improve online shopping recommendation systems, for example, Liu et al. [14], Karthik and Ganapathy [15], Hammou et al. [16], Wu and Yu [17] and Liu and Wei [18]. All these personalized recommendation algorithms or technologies provide many references for online shopping companies to improve their online shopping recommendation systems in time.
It is worth noting that no matter what kind of personalized recommendation algorithm or technology is, it needs to be based on real customers' gender information to play its corresponding effect. This is because based on accurate customer gender [19], it is possible to better discover customer preferences [5], improve the accuracy of product recommendations [6], thereby promoting the development of gender marketing [20], [21], as well as increasing online merchants' Income [22], [23]. For this reason, many researchers have done a lot of research from the perspective of customers' gender information collecting technologies and methods [24]- [26]. Gender information can be collected through questionnaires [27], recruiting volunteers [28], [29] and the information registered by the user [30]. Nonetheless, the gender information collected through these collected methods is far less than enough to contribute to the online shopping recommendation system [31]. However, customers may not want to actively disclose their privacy information [32], so they will ignore or randomly select gender during website registration and set account privacy [33], [34], resulting in incomplete and untrue gender information collected by online shopping system [35]. Therefore, the estimation of customers' gender becomes necessary. Unfortunately, despite the advanced algorithms and technologies for personalized products recommendation are proposed, it is very hard for online shopping companies to change this totally, because the cost of hiring people to check the consumers' gender is unaffordable. Thus, it is urgent to find an effective approach to estimate the true gender of consumers in online shopping recommendation systems. There are also some papers using facial recognition methods to detect the users' gender [57], [58]. However, these methods only work if the users permit the access to the cameras. In contrast, our paper aims at discovering customers' gender from their online shopping behavior, instead of their photos, preserving the customers' privacy.
Despite the lacking of research on mining the customers' gender given the unreliable online shopping system gender data, there are many models based on mining customers online shopping browsing log and purchase log data were proposed to estimate the customers' gender. Zhou et al. [36] using the RFMT model to derive 7 characteristic customer clusters from a large dataset retrieved on a global retailer's website, and estimated customers' gender and personalized products preferences by the cluster analysis. Wan et al. [37] used large-scale online shopping transaction log modeling to mine consumer personalized preferences for gender estimation. However, their approach mainly relies on the analysis of the users' click behaviors, and ignore the female personality diversity or male personality diversity, and the samples imbalanced issue, which may not be reliable and accurate.
In summary, the research on gender estimation mainly focuses on the impact of customers' shopping behavior and personalized preferences on models [38]- [40]. In order to ensure the effectiveness of the model, the authenticity of gender data in the online shopping system is very critical. Although the gender information can reflect the customers' shopping behavior and preference [41]- [43], it is impossible to distinguish the fake gender information users registered [44]- [47]. The lack or fake of gender data leads to the unreliability of the gender estimation model and the effect that the recommendation system cannot provide consumers with the most needed products [48], [49]. If this continues, it will cause the performance of the online shopping recommendation system to decline, and also affect the economic benefits of e-commerce companies.

III. GIVEN DATA FORMAT AND RESULTS MEASUREMENT
The data we studied is a customers' online product browsing log in a specific time period provided by Vietnam FPT Group. These training data and their corresponding gender labels and test data are all from the PAKDD'15 data mining contest website. In addition, the website also announced the final results and rankings of the competition. In this paper, these data are used as the training set and the test set, and gender estimation is performed on this basis. At the same time, the average combined accuracy of gender estimation can also find the corresponding interval in the competition results published on this website. The data set format can be described in detail as follows: The training data set contains 11,703 female samples and 3297 male samples, and the corresponding 15,000 gender labels. The number of the products in our dataset is 36634 while the subcategory A has 11 products, the subcategory B has 91 products, the subcategory C has 440 products, and the subcategory D has 36092 products. Its file storage space is 1651 KB. The test data set contains 15,000 samples that lack gender labels. It is stored in a file with a size of 1639 KB. Since the data format of each sample in the two data sets is the same, 4 samples are randomly selected from the training set for display, as shown in Table 1. Each sample represents a customer's viewing session and contains 5 columns data. Specifically, the first 4 columns are ''Session ID'', ''Start Time'', ''End Time'' and ''Product IDs'', respectively and the last column is ''Gender Label''. Among them, the ''Session ID'' column is the session ID, the ''Start time'' column is the session start time, the ''End time'' column is the session end time, the ''Product IDs'' column is the product IDs viewed by a consumer, and the ''Gender Label'' is the customer's gender. In the ''Product IDs'' column, there are 4 categories of IDs: The most generalized products are represented by the IDs beginning with 'A'. These product IDs beginning with 'B' and 'C' are the subcategoris and sub-subcategories of the products, respectively. The product IDs start with 'D' are the fourth category, corresponding to individual products. The data used in this paper only shows the items the customers viewed while it is unknown if the item is purchased or not. In addition, some more information about the items, such as price, is also unknown. We are predicting the gender using the minimum information from the customers, indicating the potential of applying our design in more restrict scenarios.
The vectors ''predict'' and ''actual'' represent the predicted results of this paper and the truth gender labels, respectively. The variables ACC m and ACC f are used to represent the accuracy of predicted male and female, respectively. Next, the integer 0 represents the male label, and the integer 1 represents the female label. Then, the results measurement is followed as: and . (2) Since the distribution of female labels and male labels is imbalanced, the results of gender prediction will be measured using ''Combined Accuracy (CA)''. According to the (1) and (2), the definition of combined accuracy is as follows: In summary, through the research on the data format of the training data set, it is found that each sample data contains the ID of the browsing session, the start time, the end time, and the viewed product IDs. These samples are similar to each other. Directly from the first 4 columns of data, it is difficult to get the same predicted label as the truth gender label. This also means that the correlation between the training sample data and the training sample label is low, which would make the generalization ability of the obtained prediction model low. In addition, in the training sample set, female samples accounted for about 75% while male samples accounted for about 25%, which would lead to sample imbalance and further reduce the generalization ability of the prediction model. To further reduce the impact of sample imbalance on gender prediction results, CA measurement needs to be used to measure the results of gender estimation.

IV. GENDER MINING MODEL SOLUTION A. FEATURE EXTRACTION AND CANDIDATE FEATURE COMBINATIONS
In order to solve the issue that the correlation between the training sample data and the training sample label is low. Meaningful features should be defined to describe These features can be defined as: Definition 1: F1 is referred to being as number of products viewed. Since the number of products viewed in the i-th session can be denoted by | product_ID (s i ) |. Then, F1 in the i-th session is | product_ID (s i ) |. Definition 2: F2 is referred to being as average time spent on each view product. Then, in the i-th session, F2 is ( t |. Definition 3: F3 is referred to being as start time of the session. Then, F3 in the i-th session is t (s i ) s . Definition 4: F4 is referred to being as ID of the maximum subcategorized ('B' category) products. Then, in the i-th session, F4 is max{ product_ID Definition 5: F5 is referred to being as ID of the maximum sub-subcategorized ('C' category) products. Then, in the i-th session, F5 is max{ product_ID As shown in Fig. 1, first do data cleaning for the training data set. Then, use Definition 1 to Definition 5 for feature extraction, respectively. After that, feature selection is conducted based on the extracted features. Finally, the female set and male set defined by the combination of these selected features constitute the training set.
Then, perform feature selection based on these 5 featured definitions and decide which features to use in our approach. Random forests are an integrated classifier composed of a set of decision tree classifiers [50]. For the given training sample set X , random forests trains K decision tree classifiers, and allows these K decision tree classifiers to participate in voting, and the prediction results of this sample set are determined by majority voting.
In the decision tree generation process, the tree node splits itself into left and right sub-trees according to the selected optimal attribute. The splitting process after comparing other attributes is node splitting. In this paper, the CART (Classification And Regression Tree) algorithm [51] based on Gini coefficient splitting is used to generate each decision tree. Specifically, when a node is split, the CART algorithm first calculates the Gini coefficient of the two subsets after each attribute is split. Then, select the attribute that minimizes the Gini coefficient to split the node into two left and right subnodes. Finally, the decision tree is constructed in the form of recursion. To save the space of the paper, the details of our calculation process of the CART algorithm is demonstrated in Appendix VI.
After CART algorithm is executed, the majority voting method is used to combine all the decision tree classifiers in the random forest obtained. Assuming that the random forests contain K decision tree classifiers, the decision tree classifier is h 1 , h 2 , . . . , h K , and the sample x is input to the decision tree classifier and the output is h k (x). For the customer gender classification task in this article, the decision tree classifier h k will predict a category tag from the category tag set {c 1 , c 2 , . . . , c N }. The detailed steps of the random forest is demonstrated in Appendix VI.
After we apply the random forest classifier, we complete the category prediction on the training set X , and we can perform category prediction on the test set. Therefore, the random forest classifier is suitable for gender classification of feature combinations. Because 32 combinations of these 5 features need to be classified by gender, random forester needs to be run to get the gender classification result. These feature combinations with a prediction combined accuracy of more than 50% are selected as candidate feature combinations for further research.

B. CLUSTERING BASED ON PERSONALITY DIVERSITY
Given significant personality overlap among genders [53], and all the customer have diversity of personalities [54], [55], that is to say, different personalities also exist in the female. Therefore, under normal circumstances, there are a large number of personality diversity phenomena among female customers, and these phenomena can be reflected by different VOLUME 10, 2022  feature combinations. In the female set, samples with the same personality as a certain type of personality can naturally gather into a cluster. In the given data set, the female sample is about three times as large as the male sample. If there is a combination of characteristics that can clearly reflect the three personalities, then the female set can naturally be clustered into three clusters. Then count the number of samples in each cluster after clustering. If the number of samples in the three clusters is almost equal, the corresponding feature combination is the feature to be selected. If there are multiple such feature combinations, select the feature combination with the closest number of three cluster samples.
A variant of the traditional K-Means algorithm is the Mini Batch K-Means (MBKM) algorithm, which uses a mini batch of data subsets obtained by random sampling in each iteration to update the centroid. Empirical research shows that this algorithm can increase the calculation speed of the clustering process when the sample volume is big, and effectively reduce the algorithm convergence time [56]. Since the training data set contains 15000 samples, it is a big sample set that can be clustered by this algorithm. The clustering process is shown in Fig. 2.
This paper uses the MBKM method to cluster the female set defined by each feature combination in Table 2, and the corresponding clustering results are shown in Table 3. We split the data sets into 3 clusters because the number of female samples is 3 times of the male samples. To balance the female and male samples in the training set, we consider 3 cluster centers in the subsets generation.
From Table 3, the number of samples in each cluster after the female set defined by the feature combination F3&F4&F5 is clustered are 3797, 4079 and 3827, respectively, and their ratio is close to 1:1:1. In other words, by selecting the feature combination F3&F4&F5, the sample size of each female subset is almost equal to the sample size of the male set. However, after clustering the female set defined by other feature combinations, the proportion of sample size in each cluster is obviously not as good as that of F3&F4&F5. Hence, the imbalance problem of a given training set can be solved preliminarily.
In order to further confirm whether the feature combination of F3&F4&F5 can clearly reflect three clearly and different personalities. Combining Table 2 and Table 3, the three feature combinations with the highest combined accuracy to the fourth highest are F3&F4&F5, F1&F4&F5, F2&F4&F5, and F1&F2&F5, respectively. Then, dram them as 3D, and each corresponding 3D data visualization view is shown in Fig. 3. Fig. 3 shows that the female set defined by feature combination F3&F4&F5 is clearly divided into 3 subsets that are clearly separated from each other. The female subsets portrayed by other combinations of characteristics are not independent of each other, the interval is not obvious, and they cannot clearly reflect the three independent personalities. Therefore, this paper selects the feature combination F3&F4&F5 as the feature combination.
Then, the female set defined by the feature combination F3&F4&F5 can be average divided into three female subsets. After that, as Fig. 4 depicts, merge the male set and each female subset to generate a balanced training set, and get three such balanced training sets TS1, TS2 and TS3. Finally, the problem of sample imbalance is solved.

C. A TWO-LAYER GENDER CLASSIFICATION MODEL
Based on the three balanced training sets obtained, a twolayer gender classification network is designed as the gender estimation model. This model is shown in Fig. 5.
The hidden layer of the network consists of three classifiers, and the output layer consists of one classifier. Each classifier in the hidden layer and the output layer can use any typical classification algorithm, such as random forest, SVM, decision tree, and Gaussian NB. On this basis, C 1 with the best classification result is trained to form the output layer of the network. This also means that each layer of classifier has    selected the best classification. Thereby, a two-layer gender classification model is obtained.
Finally, we could use the second-level classifier model to make the final gender decision as shown in Fig. 6.

D. SUMMARY
In summary, the approach of this paper includes the following steps: 1) Feature extraction and selection. We first read the training set file and check whether there are samples with  3 , respectively and use them as the nodes of the first-layer network. According to the output of each node of the first layer network, we design a new classifier C (2) 1 as the second layer network node to make the final gender decision. Considering the irrelevance between the proposed method and these classifiers, representative classification algorithms such as random forests, SVM, decision tree and Gaussian Naive Bayes can be selected as candidate classifiers. In other words, we could select a classifier from the candidate classifiers as any node in each layer, such as C (1) 1 uses decision tree and C (2) 1 adopts random forests. Therefore, for the combination C  (2) 1 , we can get different combinations of classification algorithms. Then, the combination with the highest gender estimation combined accuracy is selected, and each classification algorithm of the combination is used as the classifier corresponding to each layer in turn. Finally, train the two-layer classifiers network and make the final gender decision.

A. FEATURE COMBINATION SELECTION
Select the appropriate clustering features from the 5 features defined in Section IV-A, which can better reflect the correlation between the training data and gender labels. As Fig. 7 shows, the feature combination F3&F4&F5 selected in this paper can depict the clear three clusters, which further indicates the diversity of female personality and divide the female set into three clusters with approximately the same number of samples. In other words, the clusters 1, 2 and 3 divided by the feature combination F3&F4&F5 contain 3797, 4079 and 3827 samples, respectively. Compared with female clusters classified by other combinations of features, the number of samples in the three clusters of F3&F4&F5 classification is the most balanced. This is because other feature combinations divide a large number of samples into specific subsets, resulting in large imbalance of samples between subsets.
To verify the effect of feature combination selection, we use the MBKM clustering of scikit-learn package in python to cluster three clusters for the female set, and record the number of samples contained in each cluster. According to Table 3, the number of samples between each cluster of the female set may differ by more than 14 times. In other words, the feature combination selection has a direct impact on the number of samples contained in each cluster in the female set. Therefore, the most balanced combination F3&F4&F5 is suitable as the feature combination of our method.

B. CLASSIFIER SELECTION
In the two-layer classifier network we designed, the first layer has three nodes C (1) 2 and C (1) 3 , and the second layer has one node C (2) 1 . We could select random forests, SVM, decision tree, Gaussian Naive Bayes, etc. as candidate classifiers among representative classification algorithms. In other words, we can select a classifier from the candidate classifiers as any node in each layer. Therefore, for the combination C

&C
(2) 1 , we can get different combinations of classification algorithms. Then, the combination with the highest gender estimation combined accuracy is selected, and each classification algorithm of the combination is used as the classifier corresponding to each layer node in turn. Finally, the trained two-layer classifier network is used to test the gender prediction performance of our proposed method. Finally, after our model uses the random forest classifier to perform gender classification, select the top 6 feature combinations with the highest average combined accuracy, and then select the feature combination F1&F2&F3, and use them to measure the average combined accuracy of these different algorithms.
Then, Fig. 8 shows the result. It can be seen from Fig. 8 that on the training subsets generated by the feature combination F3&F4&F5, the random forest classifier has better average combined accuracy than that of the other combinations. In addition, among all four typical classifiers, the average combined accuracy of the decision tree and SVM classifiers are second and third, respectively, and the Gaussian Naive Bayes classifier has the lowest combined accuracy.

C. COMBINED ACCURACY MEASUREMENT
The sample size of women is three times that of men, which leads to the problem of sample imbalance. The feature combination F3&F4&F5 selected in this paper can better solve the problem, while the other combinations cannot solve the problem better. In general, sample imbalance will make gender prediction results more biased towards sample categories with a larger sample size, that is, gender prediction results will be more biased towards female. To improve the balance and credibility of gender prediction results, the combined accuracy (3) is used to evaluate the accuracy of our proposed model.
When the model proposed in this paper uses a random forest classifier, the average combined accuracy of gender prediction is the highest. On this basis, the average gender combined accuracy of the sample set defined by each feature combination is measured, and 18 feature combinations including the first 6 combined accuracy are selected for display. The average combined accuracy result based on each feature combination is shown in Fig. 9. Through comparison, it can be seen that the average combined accuracy of gender prediction based on the feature combination F3&F4&F5 is 78%, which is the best classification effect among all combinations. In addition, it can be observed that with the increase of features, the combined accuracy presents a distribution trend that first increases and then decreases. This also reflects that the number and combination of features have an important influence on the combined accuracy results. For example, the average combined accuracy of a single feature F3 or F4 is always less than that of the two feature combinations F3&F4, while the average combined accuracy of the feature combination F3&F4&F5 is significantly higher than that of F1&F2&F3. We can also find that F3&F4&F5 and F1&F2&F3 have the same number of features, but the average combined accuracy is quite different. Even F1&F2&F3 is lower than F3&F4, F3&F5 and F4&F5. This is because the shopping behavior represented by the feature combination F3&F4&F5 is more accurate in relation to gender than the shopping preference represented by the feature combination F1&F2&F3. The reason for this trend is that the feature combination F3&F4&F5 can cluster the female set into three clusters with the most similar sample numbers and significant separation from each other. This also means that the feature combination F3&F4&F5 can show a clearer personality in terms of gender than other feature combinations.

D. TIME OVERHEAD
As we known, the field of e-commerce generates massive amounts of online transaction data all the time. How to dig out useful features from these data in a timely and effective manner for gender prediction is also a challenge to be considered and faced in this article. In other words, our method can extract meaningful features from large-scale data sets in a timely and effective manner, and should be efficient and robust after training a large number of times. Only on this basis can our method be applied to practical and commercial scenarios. Therefore, time overhead is a very critical factor. It can be seen that we need to evaluate the time overhead of different classifiers on different feature combinations. The time cost of this experiment includes two processes of feature extraction and model training. Fig. 10 and Fig. 11 summarize the corresponding average time costs.  This is because it takes a certain amount of time to extract a feature from a given data set, and as the number of features to be extracted increases, the time spent will also increase. Meanwhile, the average time-consuming of random forests is very little more than that of the other two classifiers, but it is within an acceptable range. This is because sacrificing average combined accuracy to pursue less average time-consuming is not our first choice. In order to make the average combined accuracy as high as possible while keeping the average time-consuming as short as possible, we can adjust the depth of the tree in the random forest algorithm. Fig. 11 shows that the correlation between the number of features and the time cost of the SVM classifier is not as good as that of the three classifiers in Fig. 10, and it takes more time. This also reflects that as the number of female samples and male samples increases, the classification lines required in different dimensions will also increase, which increases the complexity and leads to a large number of calculations and increases the time overhead. Among these combinations, the feature combination F3&F4&F5 selected in this paper has the least average time overhead. It also reflects the advantages of lighter weight of the method in this paper. For this paper, if the method does not provide good gender prediction results, then less time overhead is of no research value. In other words, compared with the time overhead, the combined accuracy is more important.

E. DISCUSSION
The customers' gender information is vital to improve the product recommendation performance, and it is significant for the in-depth research of shopping recommendation system. However, the gender information of the existing online shopping recommendation system mainly relies on the gender data users registered. For customers who are unwilling to actively disclose their privacy, the authenticity of the gender information they provide cannot be guaranteed. Meanwhile, the existing gender estimation model based on customers' online shopping transaction log data mainly relies on the analysis of users' click behaviors, ignoring the diversity of female and male personality. Moreover, these gender estimation models may be time-consuming to estimate the gender of large-scale imbalanced data that changes in real time. To this end, a gender estimation model based on personality diversity is proposed, and customer gender is estimated through shopping log data mining. Through our evaluation, we find that the model is more lightweight and efficient.
Up to now, in the face of massive shopping log data, how to process these data in a timely manner and extract features from it for effective customer gender estimation, which has further demonstrated the challenge of estimating customer gender from the massive shopping log data. Therefore, we want to leave the research on how the industry adjusts the feature extraction speed according to the constantly changing scale of online shopping data, thereby affecting the time spent in the customer gender estimation process.
We note that the data we obtained from FPT group website are labeled with 'Male' and 'Female'. Basically all samples have their labels, which are either male or female. For the dataset, which contains just a few samples without label, we just train the model without using these samples. However, training the model using the dataset containing too many not labeled samples if out of the scope of our proposed and we decide to leave this to the future work.

VI. CONCLUSION
This paper introduces a novel approach to mine the customers' gender information from the online product viewing log provided by Vietnam FPT Group. First, we make feature combinations based on the extracted features to reflect the correlation between personality diversity and gender, and select the best feature combination through data visualization. Therefore, we can solve the problem of low correlation between training data and gender labels. Then, using the best feature combination, the female samples are naturally clustered into three subsets equal to the number of male samples. Each female subset and male set generate a balanced training subset. In this way, three balanced training subsets can be obtained. At this point, we can solve the issue of unbalanced training samples. Finally, based on these three balanced training subsets, three independent classifiers are trained as the nodes of the first-layer network. Then train a new classifier as the second layer network node based on the output of the first-layer network. On this basis, a two-layer classifier network can be designed and trained to make the final gender decision. Experimental results on the given data set show that our proposed method can provide accurate prediction results and consume less time. As a data mining model for gender prediction, our method is lightweight and efficient, and can be applied to different actual and e-commerce scenarios.

APPENDIX A DETAILED STEPS OF DECISION TREE MODEL
We demonstrate the detailed calculation of CART algorithm as follows: 1) Calculate the Gini coefficient of the sample set. Suppose that the sample set X contains C categories of samples, and the proportion of each category of samples is P i (i = 1, 2, . . . , C). Then the Gini coefficient of the sample set can be expressed as The Gini coefficient Gini(X ) reflects the probability of inconsistent class labels when two samples are drawn at random from dataset X . Therefore, the smaller the Gini(X ), the higher the purity of the dataset X . 2) Calculate the data set divided by the Gini index of each feature. The decision tree can be built recursively by means of bisection splitting. Each node is split by the CART algorithm, which adopts Gini-index as the split criterion [52]. Suppose that the set F represents all the features in the sample set X , and F = f 1 , f 2 , . . . , f M . It can be seen that there are M features in the set F, and any feature f ∈ F. If f used to divide the training set X , M branch nodes will be generated. We use X m to denote all samples in X whose value is f m on feature f , and divide X m into the m-th branch node. In this paper, the features extracted by Definition 1 to Definition 5 can be regarded as different features of the training set, and these features can also be regarded as different attributes. Then the Gini-index of attribute f can be expressed as In the candidate attribute set F, we use the Gini-index of attribute f to divide and score, and find the attribute that makes the Gini-index the smallest after the division as the optimal division attribute, namely 3) Recursively build the tree. For the divided decision tree, repeat step 2) until the division cannot be continued or the Gini value is less than the set threshold. 4) Output the final CART decision tree.

APPENDIX B DETAILED STEPS OF RANDOM FOREST MODEL
So as to facilitate prediction, we use an N -dimensional vector (h 1 k (x), h 2 k (x), . . . , h j k (x), . . . , h N k (x)) to represent the prediction output obtained after the sample x is input to h k . Then the logarithmic output h k (x) ∈ R, the combination strategy we adopt is majority voting: reject, otherwise.
where the output of h k on the category label c n is h n k (x). If a tag has more than half of the votes, random forests will predict the sample x as the tag. Otherwise, the prediction is rejected. In this article, the customer gender classification task requires that the gender prediction results of the sample must be provided, and the majority voting method will degenerate into plurality voting: When counting the tags with the most votes, if there are multiple such tags, one is randomly selected as the category tag as the prediction result of the sample.