K-Means and K-Medoids: Cluster Analysis on Birth Data Collected in City Muzaffarabad, Kashmir

In the field of medical, each and every analysis is decisive as the study links to life of the subject under observation. One of the most vital area in the field of medical is the healthcare of expecting women in low income countries. High mortality rate due to increased number of caesarean section is evident because of poor medical infrastructure in the region, misunderstood religious teachings, low education and lack of proper decision making at the right time. The root cause analysis of situations demanding caesarean section is a tough job, however in the presence of historical data, one may extract useful information that will help supporting a medical decision by predicting the outcome. It is obvious that regional disparities have a huge impact on the residents of that region. A study performed on any region cannot be all applicable to the residents of some other distant region. This motive has established grounds to conduct a local study upon the data collected from expecting women in city Muzaffarabad, Kashmir. It is believed that the findings of this study will be significant for women that share more or less similar physical, social and maternal traits. Keeping this in mind, study presents an analysis of two clustering techniques for the investigation of appropriate algorithm that groups data into relevant clusters robustly. Firstly, we analyzed K-means and K-medoids algorithms’ capability to cluster the data using different distance metrics. Secondly, data transformation techniques including scale, range and Yeo-Johnson are applied. Finally, transformed data are used in K-means and K-medoids algorithms’ to generate cluster accuracy. It is observed that the results produced from transformed data are better than using raw data. Yeo-Johnson transformation method is found best for k-means (Hartigan & Wang), K-medoids (SEV distance function) and Rank k-medoids (SEV distance function) with mean accuracy 67.58%, 69.58% and 72.64% respectively.


I. INTRODUCTION
Healthcare is an attractive research domain of science because of its social implications. More advancement in healthcare consequently enhances the probability of healthy life. This idea has brought several interdisciplinary research outcomes that serve the said purpose in one way or another. Last decade has witnessed numerous prognoses and diagnoses based research articles targeting some disease or health issue. Most of the studies attempted to provide valuable The associate editor coordinating the review of this manuscript and approving it for publication was Emre Koyuncu . details by performing classification or prediction of diseases that cause high mortality worldwide [1]. Several delicate domains, for example pregnancy complications, yet seek attention that is one of a major cause of death in low income countries as shown in Figure 1 [1]. Pregnancy complications cause higher number of deaths in lower middle income countries as compared to the number of deaths reported in advanced countries due to kidney diseases and breast cancer. Though, several attempts have been made to answer the complications during pregnancy and possible precautions, however, the generic findings are not entirely applicable to the people from different regions. Therefore, it is necessary VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/  to widen the sphere of research by conducting studies specific to the people sharing similar life styles. Consequently, the outcomes of such study will be significant and fully applicable to the people of that region. There are several factors that endanger expecting woman and her child. A lot of physical complications cause situations that lead to the need of Cesarean section (C-section), which itself is life threatening. The occasions that require C-section are birth of twins, triplets or more, a substantial infant, any previous birth by surgery, preterm births, diabetes etc. A C-section may be performed depending upon the shape of the mothers' uterus, history of a previous C-section, prolonged labor, cord prolapse etc. A C-section is often necessary when a vaginal delivery put mother or baby at risk [2]. More than 50 countries globally have C-section rates greater than 27 percent [3]. Efforts are being made to reduce occasions demanding C-section as the risk of death during surgical procedures is higher as compared to vaginal delivery. Therefore, the discovery of the factors that cause complications during pregnancy is essential. It is apparent that expecting mothers face more or less similar kinds of gestational, medical and physical experiences during pregnancy. These experiences are based on their social, physical and medical factors. The data of expecting women, if collected carefully, may become very handy for a physician to support his decisions based on the facts generated after sound data analysis. At this stage, data mining [4] and machine learning algorithms [5], [6] come into play which are excellent at learning patterns from data files and unveiling useful information that human eye fails to discover at once. It is believed that the development of decision support systems based on historical data, strengthened with learning capability of machine learning algorithms creates an opportunity for physicians to gain predictive information and take timely decisions that may save the life of expecting women and fetus. In computing, supervised and unsupervised learning algorithms are usually modeled to generate predictions through historic data. Clustering falls into unsupervised scheme of study that aims at grouping of instances into a cluster based on the instances perceived similarities [7], [8]. In current study, we present cluster analysis using k-means and k-medoids on birth data that has been collected from government hospitals of the city Muzaffarabad, capital of Azad Kashmir. Firstly, we analyzed K-means and K-medoids algorithms' capability to cluster the data using different distance metrics. Secondly, data transformation techniques including Scale, Range and Yeo-Johnson are applied to transform the data. Lastly, K-means and K-medoids algorithms' with different distance metrics are reused with transformed data to generate cluster accuracy. The application area of clustering is wide ranging covering major fields of sciences including image segmentation [9], [10], hand writing/object recognition [11], [12], big data mining [13], human genetic clustering [14], recommender systems [15], spatial data analysis [16], data reduction [17] etc. Clustering techniques are widely used in healthcare including cardio vascular diseases [18], [19], classification and prediction using cancer gene expression data [20], [21], diabetes [22], stroke [23], Alzheimer's [24] etc. In the light of literature, few contributions in maternal health/pregnancy complications domains are discussed below. Median clustering method is used to establish a group with percentage data of maternal health services in Aceh Province, Indonesia [25]. Study describes the similarity in data, usage of distance metrics, process to enhance the distance metrics by the median clustering method, process to determine the number of clusters, interpret clustering results and to set grouping boundary to describe the characteristics of pregnant women's health services in each cluster [25]. Another cluster analysis on 33,740 preterm births data is performed to assess the determinants for various preterm birth sub types. Adapted k-means model and fuzzy algorithm were used to identify clusters using predefined conditions [26]. This study first considered K-means algorithm, followed by, the self-organizing map (SOM) technique to extract co-morbidity based clusters from a healthcare discharge data set. After validation of general cluster composition for diabetes mellitus, co-morbidity based clusters were identified for pregnancy. The SOM technique was found to infer distinct clustering of pregnancy ranging from normal birth to preterm birth, and potentially interesting comorbidities that could be validated by published literature. The promising results suggested that SOM technique is a valuable unsupervised clustering method for discovering co-morbidity based clusters [27]. In another study, latent class cluster analysis is used to identify maternal exposure clusters and association between risk factor and birth defects. clustering was performed using adjusted odds ratios [28]. Study identified three latent maternal exposure clusters: a high-risk, a moderate-risk and a low-risk cluster. The use of intelligent data mining and cluster analysis is performed on the data of 222 pregnant women [29]. Cluster analysis gained 94.10% classification accuracy. Study claimed proposed method appropriate to classify expecting women during early trimesters. Taylor et. al., [30] studies the differences between the US women receiving prenatal care and others who does not. Discriminant analysis was used to evaluate the results gained after grouping women in clusters based on their similar characteristics. Study reported six replicable clusters of women which did not receive prenatal care [30]. Another study based on multiple imputation fuzzy clustering identified three exposure groups, non-exposed, lighter tobacco exposed, and heavier-tobacco exposed based on variables related smoking. Authors claimed that multiple imputation fuzzy clustering is good at categorizing patterns of exposure and their influence on results [31]. Few other interesting and relevant researches include preeclampsia disorder [32], [33], intrauterine disorders [34]- [36], Regional Disparities in Maternal and Child Health Indicators [37], effect of social and demographic factors on maternal health [38], [39], effect of facility birth maternal and perinatal mortality [40]. Literature witnesses the application of unsupervised methods for analyzing different dimensions of pregnancy and complications among expecting women. However, most of the studies are conducted in advanced countries of the world and their results are not entirely applicable upon women of different regions because of regional disparities [37]. Therefore, this research is carried out in Muzaffarabad, the capital city of Azad Jammu & Kashmir having population above one million. The city lacks in medical infrastructure, interdisciplinary liaisons and financial assistance to mitigate the pregnancy-based complications at once. However, it is believed that the success and continuity of such studies will help flourish grounds to improve healthcare infrastructure and uplift the society.

II. METHODS
Current study aims to answer the issues that are associated to complications caused by C-section. Furthermore, the analysis and investigation of clustering methods and their robustness when utilised with our data set.To achieve this, we aligned our objectives as: 1) Collection of local data that reflects the traits of women sharing same social, economical and physical attributes. 2) Application of K-means and K-medoids clustering algorithms with different distance functions on collected data to analyze the clustering ability of said methods. 3) Analysis of behavior of K-means and K-medoids algorithms'with different distance metrics when provided with transformed data to generate cluster accuracy. The scheme of the study is depicted in Figure 3.

A. ABOUT DATA
The data used is collected from the government hospitals of the city Muzaffarabad, the capital of Azad Kashmir. The Questionnaires were designed for data collection. While keeping in mind that most respondents come from rural areas with no educational background, questionnaires were filled in by a university graduate in the presence of obstetriciangynecologist. Original data was comprised of 983 instances with 79 attributes divided into pregnancy, maternal history, medical condition, gestational and social life factors. The mode of the delivery (Caesarean or normal) is the outcome variable of the data. This variable makes the data suitable to conduct classification studies as well [41], [42]. Few attributes from the data set used in experiments are provided in table 1. The attributes are in correlation with each another and effect the mode of the birth i.e., caesarean or normal. The correlation matrix of main attributes is provided in Figure 4.
The key findings from data and correlation matrix are provided below.

4)
Women who have gone through surgeries (other than C-section) tend to deliver via C-section (35 out of 49 cases). 5) All women with C-section deliveries (most recent) have delivered via C-section in their current pregnancy. 6) Women with high and low blood pressure, lower levels of hemoglobin, diabetes and hypertension tend to deliver via C-section. 7) C-section rate is high among women of early ages (17)(18)(19)(20)(21)(22)(23) and late 30's.( 291 out of 374) 8) Cousin marriages are in high positive correlation with C-section deliveries (523 out of 682). 9) Working women category has higher number of C-section deliveries (197 out of 280). 10) High blood pressure is reflected among women with first pregnancy 11) Women with low educational background or no education at all delivered 4 children on average. 12) The number of girls delivered is almost double to the number of boys delivered.

B. CLUSTERING
Clustering attempts to gather similar data into only one group and that data is not allowed to appear in any other group [26], [43]. In the presence of a good clustering method one may identify anomalies in data, gain more knowledge about data, generate hypothesis, discover rations of similarities in instances, perform compression etc. [44]. Clustering is a tough job that greatly depends upon the shape of the data. Presence of noise in data affects the shape, density and size of the cluster [44]. Human eye is good at clustering objects within two dimensions but with the increase of dimensionality, we need clustering algorithms that should provide isolated and compact clusters. There are several clustering methods available to be used in different conditions.
Clustering is normally categorized as hierarchical clustering technique or partitional clustering technique [45]. Hierarchical clusters iteratively divide the patterns using bottom-up (agglomerative hierarchy) or top-down (divisive hierarchy) approach. In hierarchical clustering methods, clusters form by repeatedly dividing the patterns by using top-down or bottom up approach [46]. Contrary to hierarchical clustering, Partitional clustering attempts to classify the observations of the data into the k clusters based on some criterion function. The criterion function most commonly attempts to find minimum distance between points in the available cluster. K-means [47], K-medoids [43], CLARA [43] etc., belong to this category. Clara is designed to handle data with several thousand objects, hence not utilized in current experiments.To investigate the efficacy of clustering methods with similar but large data set is left for the future. Though, the data set used in current study is small to medium sized, however practical application developed using proposed methodology will reduce the computational time while generating results. Further discussion this point forward is focused on K-means and K-medoids.

C. K-MEANS
K-means is a simple algorithm that is famous for its effectiveness in clustering [47]. K-means attempts to minimize the squared error difference between the mean of the cluster and the data points in that cluster. Suppose we have some n-dimensional data points that a user wants to group in k number of clusters with µ k as a mean of that cluster, then k-means is represented as [44]: where x i is the set of data points with i = 1, 2, 3, .., n, to be grouped in a cluster from a set of clusters given as c k with k = 1, 2, 3, . . . ., k. In order to reduce the squared error, k-means allocate patterns to initially partitioned k clusters [44]. In case of non-decisive clusters' membership, k-means continue to repeat following steps [48]. 1) Assign each pattern to its nearest cluster and generate new partition 2) Compute new cluster mean Another important parameter required by k-means is distant metric. Typically, k-means is used with Euclidean distance metric that computes the root of squared differences between coordinates of objects. Euclidean distance is computed as follows: Other than Euclidean distance, Manhattan distance and Minkowski Distance metrics are available with same notion. Manhattan distance computes the absolute differences between the points of pair of objects as follows.
Minkowski distance is the generalization of Euclidean and Manhattan distance. It can be used with ordinal and quantitative variables. Minkowsky distance is depicted in equation 4.
Other famous distances include Mahalanobis distance to detect hyper ellipsoidal clusters [49] and Itakura-Saito distance [50] used in speech processing for vector quantization.
In current study, k-means is applied with three different algorithms: the MacQueen et al. [47],Hartigan and Wong [51] and the Lloyd [52] algorithms.

D. MacQueen ALGORITHM
Macqueen is an iterative algorithm that starts up with choosing the number of clusters, select distance metric and gets initial centers on the basis of some method. Macqueen then follow an iterative approach to assign case to a cluster on following condition. If the case is nearest to a centroid of a subspace that it belongs to, then no change is made. If the case is nearest to another centroid, then the case is reassigned to the new nearest centroid and the centroids of two effected clusters are recalculated.

E. HARTIGAN AND WONG ALGORITHM (H AND W)
This algorithms' objective is an attempt to obtain the local optimal sum of squared errors with-in cluster. It means that H and W may assign a case currently residing in a cluster with the nearest centroid to any other subspace, with a condition that it will minimize the with-in cluster sum of squared error as shown in equation 5 [53].
In equation 5, for all i =1, if sum of square of the current cluster (SSE2) is smaller than the sum of square of another cluster (SSE1), then the case is assigned to SSE1 (new cluster) else it would be assigned to the current cluster.

F. LLOYD ALGORITHM
Lloyd algorithm attempts to find a set of cluster centers for the n-dimensional data points such that (x 1 , x 2 , x 3 . . . .x n ) d n that is a solution to the minimization problem [53]: Lloyd starts up with choosing the number of clusters, select distance metric and gets initial centers on the basis of some method. The iterative part is carried out in following 3 step fashions.
1) Assignment of each case of data set to a cluster based on a distance metric 2) Update centroid based on the mean value of cases assigned to cluster in previous step. 3) Repeat step 1 and step 2 until centroids stop changing.

G. K-MEDOIDS
K-medoids or partitioning around the medoids algorithm is a variation in k-means algorithm, where data points are selected as medoids rather selecting mean as a centroid in k-means. A medoid can be considered an object within a cluster which has minimum average dissimilarity to the other objects of that cluster. The k-medoid algorithm begins with computing K medoids and assigning each object of the dataset to the nearest medoid using some distance metric. Afterwards, k-medoids calculates the swapping cost for swapping object P i and medoid M i as follows.
When this cost decreases to a set threshold then following steps are taken by the algorithm.For each medoid M, data point P such that P = M , 1) 1. Consider the swap of M and P, and compute the cost change 2) If the cost change is the current best, remember this M and P swap. 3) Perform the swap of M and P. If, it decreases the cost then repeat step 1 and 2. Else the algorithm terminates

H. RANKED K-MEDOIDS
This algorithm computes the similarity between pairs of objects once. The cost to update medoids in each iteration is O(k * m) for K clusters and m number of objects [54]. Ranked k-medoids (rkmed) follows following steps for assigning object to a medoid based on similarity. 1) Using distance metric, compute similarities for pairs of objects. 2) By sorting the similarity values, calculate R-matrix.
In sorted index matrix, store the indexes of similar objects from most similar to least similar object. 3) Randomly choose a value for K. 4) From the sorted index matrix, choose the group of most similar objects to each medoid. 5) For every object in the group identified in step 4, calculate the hostility value using following equation Where X i is an object in the set objects N and r ij is the rank matrix. 6) The new medoid is the object with highest hostility value. 7) Reposition one of the medoids placed in the same group. VOLUME 8, 2020 8) Go to step 4, until maximum iterations criterion meets. 9) Assign object to the most similar medoid.

I. DISTANCE METRICS
The experimental tasks of current study are carried out in R, which is considered to be a comprehensive statistical computing tool supported with R language. There are several distance computing options are available for k-medoids in R [55].
Considering the type of data bset used in the study, we have incorporated Manhattan weighted by range (mwr), squared Euclidean weighted by range (ser) and squared Euclidean weighted by variance (sev) distance functions.

J. DATA TRANSFORMATION
Heuristically, data transformation contributes in generating good results from machine learning algorithms. Some times altering the structure or format of the raw data makes it useful. In current study data is transformed using Scale, Range and Yeo-Johnson. In scaling, the standard deviation is calculated for an attribute and later each value is divided by this standard deviation to acquire transformed scaled data. Range transform is a normalization method where data is scaled into some provided range, i.e. [0, 1]. Third incorporated transformation method, Yeo and Johnson [56] is selected from power transform family. Yeo-Johnson shifts the distribution of data rather changing the values of data like in scale or range. In the presence of skewed data, Yeo-Johnson works effectively by shifting the distribution, that consequently reduces the skewness in data and increase models' performance.

III. RESULTS & DISCUSSION
The aim of the study is to present an analysis of two clustering techniques to investigate appropriate algorithm that groups locally collected data into relevant clusters robustly. To achieve said objective, firstly we used k-means with H & W, MacQueen and Lloyd algorithms. The predicted value that falls within a cluster is then compared with actual class label to calculate the accuracy of clusters output. The mean accuracy of k-means with three algorithms is presented in Table 2. In order to analyze the clustering ability of k-means with transformed data, we repeated the same experiment after transforming data using scale, range and Yeo-Johnson. The mean accuracy for k-means with three algorithms is recalculated and presented in Table 2. The results show that K-means works well with H&W algorithm by correctly clustering 61.88% of birth data (figure 5). The mean clustering accuracy for Lloyd and MacQueen algorithm is 60.04% and 60.32% respectively. The performance   of k-means clustering tends to improve with transformed data using scale, range and Yeo-Johnson. The highest percentage of correct clustering is achieved by H & W algorithm with Yeo-Johnson power transform i.e., 67.58% ( figure 6). The mean accuracy of Lloyd and MacQueen algorithm with Yeo-Johnson transform is 67.12% and 66.05% respectively. The same procedure is repeated for K-medoids and ranked k-medoids using Manhattan weighted by range (mwr), squared Euclidean weighted by range (ser) and squared Euclidean weighted by variance (sev) distance functions. The mean accuracy of k-medoids and ranked k-medoids is presented in Table 3 and table 4 respectively. K-medoids are better at clustering if compared with k-means, so far, this particular dataset is concerned. Without data transformation, k-medoids with squared Euclidean weighted by variance metric correctly clustered 62.64% of data. The mean accuracy of k-medoids using MWR and SER distance metrics is 62.62% and 61.49% respectively. Alike k-means, K-medoids clustering ability is improved with transformed data. Scale and range, though improves in terms of mean accuracy, however the difference in between scale and range is marginal. The highest percentage of correct clustering is achieved by k-medoids with SEV distance metric and Yeo-Johnson power transform i.e., 69.58%. Ranked k-medoids as the second variant of k-medoids is tested for its mean percentage accuracy of clustering. Results are provided in Table 4. The highest percentage of correct clustering is achieved using SEV distance metric and Yeo-Johnson power transform i.e., 72.64%. The results clearly demonstrate the superiority of Yeo-Johnson transform when applied with ranked k-medoids incorporating squared Euclidean weighted by variance distance metric. This superiority is not accidental. The dataset has slight skewness that affects the outcome of all methods when used without transformation. Yeo-Johnson transforms the data by shifting its distribution, consequently helps in reducing the skewness and improving the results. The second goal i.e., to minimize the intra-cluster variance is achieved by SEV distance metric. The combination of Yeo-Johnson power transforms and squared Euclidean weighted by variance works well for the data set used in current study. The usefulness of these two techniques, when combined with k-medoids that minimizes the pairwise dissimilarities outperforms simple k-means.

IV. CONCLUSION
It is observed that there are several studies conducted in advanced countries of the world. Their findings are significant, however, it is inferred that people as well the healthcare facilities differ from region to region, hence making the area specific research outcome inapplicable for people living in different parts of the world. Keeping this in mind, current cluster analysis based study using k-means and k-medoids clustering is carried out on locally collected birth data. Firstly; we analyzed k-means and k-medoids algorithms' capability to cluster the data using different distance metrics. Secondly, data transformation techniques including scale, range and Yeo-Johnson are applied and the clustering accuracy of the said clustering methods is re-calculated. It has been observed that the results after incorporating data transformation techniques are better as compared to the mean accuracy generated by clustering methods at alone. The highest percentage of correct clustering is achieved by rank k-medoids using SEV distance metric and Yeo-Johnson power transform i.e., 72.64%. It is believed that the predictions provided by any decision system based on reliable machine learning method will allow physicians to support their judgments and take effective measures to address some problem. Furthermore, several factors are associated that cause situations demanding C-section. For example, women married in early ages are at the highest risk to deliver via C-section. The trend of cousin marriages in the society should be addressed immediately as data reveals high number of C-section among women who married to their first cousins. High number of C-section among working women is alarming. The policy making departments should facilitate working women with stress free working environment, as stress itself is a huge contributor to other medical complications. Moreover, the number of newborn girls as compared to newborn boys is gradually increasing reflected by the data as well as in general. The policy makers should revise the rules for jobs and admissions in educational institutions that cater the needs of growing number of women. Contrary to a one published classification based study on same data set, presented study is first of its kind in a city having population above one million. It is believed, continuity of such interdisciplinary studies will help flourish grounds to improve healthcare infrastructure and uplift the society.