Introduction
Experts in modern medical areas are focusing more on technical approaches for a variety of chronic diseases. Even though many diseases are incurable, such as cancer, stroke, heart attack, chronic liver diseases, viral hepatitis, and coronary artery disease, the death rate from breast cancer is increasing every year. According to a statistical report on medical health, cancer is a genetic disease that leads to variations in genes involved in the functionality of human body cells. Variation of the gene in genetic diseases may affect the internal parts of human organs for future generations. It may also affect DNA structure, resulting in environmental exposure to substances such as UV radiation, smoking, and other variables that are significant in the development of breast cancer [1]. Despite this, 60% of women affected by breast cancer are diagnosed at the last stage, which leads to death in women.
The main contribution of the proposed method is to segment the disordered portion of the cancerous cells in the breast image. The novel idea in this work includes a hybrid technique for determining breast cancer detection, and multi-variant analysis is performed to improve the prediction rate for the proposed system.
Research on breast cancer has increased in the past decade when abnormalities and uncontrollability in breast cell tissues develop into serious breast cancer in women [2]. It may include angiosarcoma, ductal carcinoma in situ (DCIS), and lobular carcinoma in situ (LCIS). As a result, it is critical to track the number of deaths caused by breast cancer before treatment. Figure 1 (a) and (b) show cancerous and non-cancerous images taken as exemplars. Therapeutic imaging is a non-invasive method of examining the inside of the human body that can help doctors detect and treat breast cancer at an early stage.
The determination of breast cancer in the initial stage is controllable. Breast cancer is caused by microcalcifications and masses, which are common abnormalities. Microcalcifications and breast masses occur in the connective tissues and epithelia of the breast region [4]. Breast tumors emerge in the breast and differ in size and shape. These are classified as benign or malignant, depending on their severity. Benign breast lumps are non-aggressive and non-cancerous, but they expand and impinge on adjacent organs, causing additional complications [5]. Malignant breast tumors are aggressive and cancerous. They must be treated as soon as possible to avoid mortality. Benign masses are oval or circular with confined and smooth borders, whereas malignant tumors are uneven in shape. Malignant breast masses are defined as fuzzy, rough, or ambiguous lumps. Furthermore, the cancerous tumor appears whiter than any surrounding tissue. The challenges and benefits of previous breast tumor classification and detection have led to the development of an automatic technique for assisting professional radiologists in ensuring greater interpretation and accuracy.
A diagnostic mammographic image is typically pre-processed to remove the pectoral muscle with a mammogram encircling for the detection process. By removing the pectoral muscle and background areas from a mammographic image, accurate breast profile segmentation on the surface can be determined [6]. Cancer tissues with larger pixel intensities were detected more easily than those in the breast area. The intensities of opaque breasts in normal tissues are similar to those in cancer areas; hence, tumor areas are productively generated. The manual techniques implemented by radiologists fail because of the similar appearance of microcalcifications and breast masses. Finding the tumor mass by segmenting the region of interest is a challenging task in research [7]. As a result, early detection technologies combined with automated systems must aid radiologists in accurately diagnosing breast tumors.
Screening models are utilized for screening breast cancer, including clinical and self-breast checks, magnetic resource imaging (MRI), mammography, and ultrasound. Mammography is an efficient and reliable radiographic procedure for detecting breast masses [8]. During screening, a 3D model of the breast is generated from various angles. High-quality and high-resolution images are utilized in subsequent image processing techniques, including feature extraction and segmentation. Thus, prior identification of breast cancer aid in reducing the death rate was considered in this research [9]. The proposed research uses a hybrid K-means and GMM machine learning model to increase the classification accuracy, reduce the error rate, and achieve a high signal-to-noise ratio.
The structure of this study is organized into different sections. The second section involves related works based on breast cancer classification and detection. The third Section presents the materials and methods used in the proposed work. The fourth Section discussed in detail about the experimental results in detail. The final section concludes with the novelty of this research.
Related Works
The existing technique in the literature presents a computer-aided detection (CAD) method that depends on classification and feature extraction using machine learning (ML) models, which aid radiologists in identifying breast tumor lesions in X-rays. The initial process contains a pre-determined deep convolutional neural network (DCNN), and deep features are extracted in the second stage [10]. These are further fed with a support vector machine (SVM) classifier and various kernel functions. The third process presents deep feature fusion, which increases the accuracy of the SVM classifier compared to other methods.
Various methods have been used to identify various computer-aided detection approaches for breast cancer using ML techniques [11]. The inputs of these approaches are grouped into histopathological images, which have a variety of visual patterns and seem to be complicated in recognizing quality features to assist in the recognition of cancer. The author investigated various pre-trained CNNs to extract attributes from the histopathology images. These images were taken from the BreakHis dataset [12], which is publicly available.
Several approaches emphasize feature extraction, histopathological imaging, and segmentation. Pre-processing and adaptive learning based on the Gaussian aggregate model and interconnected element survey-based interest localization around the formed extraction are all components of this method. This approach operates in correlation with SVM to detect breast cancer [13].
Full-field digital mammography (FFDM) is broadly used to screen for breast cancer [14]. Contrast-enhanced digital mammography is an expanding technology in the current field comprising low-energy images related to FFDM and recombines images supporting cancer neo-angiogenesis, which are the same as breast MRI.
The advanced level of artificial intelligence (AI) technique and the natural image classification method for breast figure categorization tasks were investigated. The author has explained the performance of the neural network (NN), support vector machine (SVM), Bayesian methods, and random forest (RF) algorithms for breast image classification [15].
Advanced soft computing technology is used to pre-process the images and achieve the best classification process. Using a hybrid combination of photoacoustic images and machine learning to compare the region of the curve, the specificity and sensitivity of SVM has the potential to have a significant impact on diagnostics [16].
A novel classification technique depends on the fuzzy Gaussian mixture model (FGMM) by merging the fuzzy logic system and Gaussian mixture model power for the CAD method. This approach is used to distinguish between normal and malignant mammography images [17]. The confusion matrix was applied to generate the FGMM performance metrics, which improved the FGMM diagnostic accuracy and reliability in breast cancer diagnosis.
Breast cancer can be detected earlier using mammography. This model is based on a technique for mammography segmentation that is given with increased thresholding [18]. Furthermore, the final segmented image from the original image can easily identify breast cancer. In general, amplified segmentation is employed in all biomedical images for better detection, feature extraction, and visualization, which improves the accuracy of diagnosis.
Fuzzy multi-layer support vector machine (FMSVM) classification was used to estimate the extracted features, and their effects were determined [19]. This method is based on a combined image set taken from the publicly available mini MIAS databases [20]. This shows the efficacy with which benign, normal, and malignant tumors can be detected. It is also used to detect the tumor area and determine the location of the tumor is mainly concentrated [21]. It focuses on identifying the best algorithms for determining the tumors that exist in the breast. The most effective strategy for tumor diagnosis is a hybrid combination of K-means, dilatation, and canny edge detection techniques.
An automated breast segmentation process is employed to find the hottest region in thermograms by employing a morphological watershed driver to assist the experts in discovering the tumor in an effective method of infrared thermography [22]. An operation for thermogram assessment is the time required to achieve the proposed thermal stabilization. Image analysis for an automated system has low breast cancer grades in digitized histopathology, and intermediates have been examined [23]. Object-level, semantic-level, pixel-level features, hematoxylins, and eosin-stained breast biopsy tissue from 106 patients were identified among the multiple levels of feature sets. In this study, a hybrid active segmentation method was used to classify nuclei from images. A cascaded approach was used to construct multiple SVM classifiers for abnormal mammogram classes [30].
A segmentation model based on various machine-learning approaches is presented [26]. This model was trained effectively using normal back propagation to improve the neural network convergence rate and segmentation. The typical technique for the segmentation process in breast cancer is discussed using an advanced soft computing paradigm [27]. Pixel-to-pixel-level classification and segmentation are effectively used to detect all mammograms. These models are effectively trained using an advanced machine-learning approach with a better accuracy rate [28]. This research is further enhanced by the gaps in the existing soft computing strategy, which comprises the numerous tools and datasets employed in this work [29]. Breast cancer can be diagnosed at an earlier stage based on histological images. Hyper-parameter tuning was used to improve the efficiency of the trained model [31].
A residual neural network model for breast cancer segmentation is performed by fine-tuning the magnification factors. Using this process, the classification accuracy was calculated [32], [33]. Diagnostic tools are used to detect abnormalities in the breast using breast ultrasound (BUS) imaging. Three classifiers are employed to increase the classification accuracy: K-nearest neighbors (KNN), random forest, and decision tree [34], [35]. The subjective approach of classification is to use SVM and decision tree to categorize malignant and non-malignant categories [36], [37].
Materials and Methods
This study presents a K-means segmentation model using a hybrid combination approach to detect cancerous and non-cancerous breasts. For image pre-processing, an adaptive median filter was applied for K-means classification and the Gaussian mixture model (GMM). Cancer is the uncontrolled accumulation of cell groups in a specific body location and the second most common cause of death in women worldwide. It is possible to treat the condition when it is properly recognized in its early stages. Several studies have been performed to detect cancers. However, no accurate techniques have been developed to date. Hence, a novel approach was used to accurately identify tumor regions. The proposed model was utilized to visually detect tumors and determine the location of the tumor. This work mainly focuses on the detection of tumors situated in the breast and fragments benign and malignant images using K-means and GMM algorithms.
Digital mammographic images, such as normal, benign, and malignant, were obtained from the source [20]. A pre-processing technique improves the image quality for further processing by reducing or removing surplus or unrelated elements in the mammography image background.
A. Dataset and Data Preparation
The Mammographic Image Analysis Society (MIAS) is a consortium of UK research organizations authorized to better understand mammograms that have a digital mammography database [20]. It consists of normal and abnormal breast images of the patients. The database contains 322 open-access digitized films and is accessible on a 2.3 GB 8 mm (ExaByte) tape. The radiologist’s “truth”-markings on the areas of any anomalies may be included. The database was padded/clipped and trimmed to a 200 micron pixel edge, resulting in an image size of
The adaptive median filter operates in a rectangular xy space. It varies the Rxy size in the filtering operation based on the conditions mentioned below. The median in the 3-by-3 neighborhood near the corresponding pixel in the collected images was used to create each output pixel. The image edges, on the other hand, are replaced with zeros. The filter output holds only one value that replaces the present pixel value at (x, y), where the point at which R is centered at time. The notation used is:
pixel value of Rxy\text{S}_{\mathrm {min}} = \text {minimal} pixel value of Rxy\text{S}_{\mathrm {max}} = \text {maximal} pixel value of Rxy\text{S}_{\mathrm {med}}= \text {median} of pixel at coordinates (x, y)\text{S}_{\mathrm {xy}}= \text {value} allowed Rxy size\text{R}_{\mathrm {max}} = \text {maximal}
Thus, adaptive median filtering is used to smoothen the non-repulsive noise arising from 2D signals without blurring borders and conserved images. The pre-processing model is used for orientation, segmentation, label, enhancement, artifact removal, and mammography. It is used to create masks to pixels with high intensity for decreased resolution and breast segments. The median filter causes the entire image fuzzier to transform the boundaries of objects present in the image into crisp, fine, and straight lines that are isolated directly.
Pre-processing
Preprocessing was performed using an adaptive median filter. This is the most significant step in medical image processing for detecting breast cancer using mammography images. The pre-processing image output was utilized for noise-free image classification. Figure 2 shows the various input images, such as normal, benign, and malignant, which are considered for further processing. The boundaries between microcalcifications and breast tissue were enhanced in the initial view of the images. The outcome of an adaptive median filter shows a better restoration of grayscale images. This helps to reduce the noise level when compared to other multilevel median filter types.
B. Proposed Model and Algorithm
The proposed model consists of an input breast database, image preprocessing, background elimination, filtering, and segmentation, as shown in Figure 3. The input dataset from the Mammographic Image Analysis Society (MIAS) is publicly accessible, and mammography images are extracted. Low-level image processing is often used in pre-processing to increase the contrast level. This improves the intensity between the backgrounds to produce reliable breast tissue. Background elimination is the process of creating a foreground mask to separate a component from the background. This method is used to detect objects from motionless images. An adaptive median filter approach was used to remove the impulse noise and speckle from the images. In the proposed hybrid approach, the labeled features of both k-means and GMM are effectively used to partition the region or seed points into various sub-instances.
The cluster numbers and mean values were initialized using K-means. The Euclidean distance is used to determine the distance (each instant) between the center of the cluster and the case. The center of each cluster was measured using the Euclidean distance, and the instance was allocated to the cluster with the minimal distance. As a result, the image points were labeled and clustered using the estimated distance. The cycle is terminated when each group is clustered, and each center is updated by averaging the points that belong to that cluster. When each instance permanently settles in clusters, the algorithm terminates. In other words, the instances are not transmitted from one cluster to another. GMM is a versatile segmentation approach that allows the selection of a component distribution, estimating the density for each group, and constructing soft clustered boundaries. GMM utilizes the expectation-maximization (EM) algorithm to compute the GMM parameters. The EM design is an iterative process in which the maximum likelihood is determined when the observed data are considered to be incomplete. Every frequency in the EM design contains two main processes: E-step (i.e., expectation) and M-step (maximization). In the E-step, the current estimates and observed data of the model parameters were used to evaluate the missing data. This parameter is the conditioned expectation to determine the terminology option. Under the hypothesis that such missing data are known, the M-step maximizes the probability function. The E-step was used to estimate the missing data. The design ensures that likelihood maximization occurs in each cycle, guaranteeing convergence.
GMM is a function of the likelihood to maximize the parameters, namely variance and mean. Thus, the parameters are estimated using the EM algorithm. In the initial stage, the number of means, classes, mixing coefficients, and variance were initialized. In the expectation step, compute the probabilities of the posterior with the present parameter values using (1).\begin{equation*} \gamma _{m}\left ({x }\right)=\frac {\pi _{n}G(x \mathord {\left /{ {\vphantom {x {\mu _{n},\sigma _{n}}}} }\right. } {\mu _{n},\sigma _{n}})}{\sum \nolimits _{m=1}^{n} {\pi _{m}G(x \mathord {\left /{ {\vphantom {x {\mu _{m},\sigma _{m}}}} }\right. } {\mu _{m},\sigma _{m}})}}\tag{1}\end{equation*}
\begin{align*} Mean~\mu _{m}=&\frac {\sum {\gamma _{m}(x_{k})x_{k}}}{\sum {\gamma _{m}(x_{k})}}\tag{2}\\ Variance~\sigma _{m}=&\frac {\sum {\gamma _{m}\left ({x_{k}-\mu _{m} }\right)} {(x_{k}-\mu _{m})}^{T}}{\sum {\gamma _{m}(x_{k})}}\qquad \tag{3}\\ Mixing~Coefficient~\pi _{m}=&\frac {1}{G}\sum {\gamma _{m}(x_{k})}\tag{4}\end{align*}
The log-likelihood is evaluated by (5), \begin{equation*} \ln {L\left ({Y \mathord {\left /{ {\vphantom {Y {\mu,\sigma,\pi }}} }\right. } {\mu,\sigma,\pi } }\right)=\sum {ln}}\sum \nolimits _{n=1}^{N} {\pi _{n}G(x \mathord {\left /{ {\vphantom {x {\mu _{n},\sigma _{n}}}} }\right. } {\mu _{n},\sigma _{n}})}\tag{5}\end{equation*}
According to the density calculation, the cluster k numbers in the GMM segmentation model are automatically computed using the thresholding technique for each image. The mammography images are segmented into regions of the k cluster, where every pixel belongs to a cluster after the GMM parameters are computed using the EM design. As a result, the image is segmented into benign, normal, and malignant tissue classes using k-means and GMM. Finally, the accuracy of the segmentation method is expressed as a percentage, as in (6):\begin{align*}&\hspace {-1.2pc}Accuracy \\=&\frac {absolute~TP+absolute~TN}{absolute~TP\!+\!absolute~FP \!+\!absolute~TN\!+\!absolute~FN} \\&\times 100\tag{6}\end{align*}
\begin{equation*} Error~Rate\!=\!\frac {1}{nm}\sum \nolimits _{a=0}^{n-1} \sum \nolimits _{b=0}^{m-1} {\vert \vert {(K\left ({a,b }\right)\!-\!I(a,b))}^{2}\vert \vert }\tag{7}\end{equation*}
\begin{equation*} {SNR}_{decibel}=10{log}_{10}\left({\frac {R_{signal}}{R_{noise}}}\right)\tag{8}\end{equation*}
A signal rate greater than 1:1 (i.e., more than zero dB) indicates that the signal is greater than the noise. The steps for k-means and the GMM algorithm are as follows.
K-Means Algorithm
Normal, Benign, and Malignant image Output: Segmented image Start
Set the number of clusters
Select
Repeat
Expectation: Select the each point to its closest centroid.
Maximization: Estimate the new centroid of each point in the cluster.
Until Centroid positions and coordinates does not change
End
GMM Algorithm
Normal, Benign, and Malignant image
Segmented image Start
Consider
Calculate posterior probability of each pixel for each class
Assign pixel to class with highest probability
Update
Estimate maximum likelihood estimation
Repeat Step 2 to 6.
End
The proposed K-means and GMM models detect breast tumors and segment the images into benign, normal, and malignant categories. Greater accuracy was obtained with a lower error rate. The pseudo-code for the proposed method is as follows:
Proposed Algorithm
Normal, Benign, and Malignant image
Segmented image Start
Selecting
The pre-processing technique is applied to improve image quality
Eliminating breast region boundary and uneven background
Removal of noise and high frequency through an adaptive median filter
K-means and GMM segments data into k-clusters
Frame the expectation step using Eqn. (1)
Calculate mean, variance, and mixing coefficient in maximization step using Eqn. (2), (3), and (4)
Evaluate the log-likelihood in the GMM model using Eqn. (5)
Estimate the accuracy values using Eqn. (6)
Classification of
End
The hyper-parameters of the k-means, GMM, and hybrid methods are presented. Using this algorithm, the training process was obtained for all data in the given breast image repository. Cross-validation was used to evaluate the proposed model to determine a better breast cancer model.
Experiment, Results and Discussion
A. MIAS Dataset
Initially, the input data were imported from a breast data repository [20]. The original 322 images (161 pairs) at 50-micron resolution in “Portable Gray Map” (PGM) format and accompanying truth data description are included in the Mammographic Image Analysis Society (MIAS) dataset of digital mammograms (v1.21), as shown in Table 1.
A digital dataset for screening mammography (DDSM) was obtained from the University of South Florida. In image preprocessing, artifacts are one of the limitations in the given image owing to the marking of some additional lesion spots. In addition, MIAS datasets were used to enhance the size of the data collections for further processing. Pre-processing and classification techniques were utilized to evaluate the accuracy of the proposed method (322 images, 64 benign, 51 malignant, and 207 normal breast images).
Subsequently, the images must be pre-processed to increase the difference in intensity between background objects and produce reliable breast tissue structure representations. Furthermore, an adaptive median filter was utilized to eliminate noise and high frequencies. Additionally, hybrid k-means and GMM models were applied to segment the clusters using different sets of parameters.
Input images are classified into three types, namely normal, benign, and malignant images, which also include physician marking on the place of abnormality. The database concludes with four types of abnormalities: suspicious lesions, architectural distortions, circumscribed calcifications, and masses. The proposed method was evaluated using mammography image collection, and the results are presented separately. The image set was divided into classes based on size.
B. Segmentation
The infrared images of three different cases, namely normal, benign, and malignant, were segmented and implemented using MATLAB R2019a. When a mammographic image contains microcalcifications, the proposed method allows for binary outcomes to indicate whether the tissue is benign, normal, or malignant. This process was computed in an Intel®Core™i5–8265 U processor at 3.9 GHz using Windows®10 operating system of 64-bit with 8 GB DDR4 memory.
Figures 4, 5, and 6 depict the segmentation of normal, benign, and malignant tissues from mammogram images. A step-wise reflection of the methodology is depicted by projecting essential stages, such as removal of the pectoral muscle, filtering process, and segmentation.
C. Comparative Analysis
An extensive analysis of the proposed segmentation model was performed by comparing the hybrid model with three other methods: GMM, K-means, and thresholding methods. Figure 7 depicts the performance of the true-positive rate versus the false-positive rate.
Figure 8 shows that K-Means is slower than GMM with a K-Means initializer. Hybrid GMM and K-means algorithm converge after 3rd epoch. Expectation-Maximization procedure is assured to have a local maximum after 10th iteration. At this point, the overall convergence of optimized K-means and GMM is existing at 10th iteration. GMM consumes less computation time than other existing techniques. This occurs when it finds a local minimum existence that is not close to the global minimum.
When a precise value for k is specified, it can be substituted for k in the model reference, for example,
The learning rate of
Various segmentation approaches were compared with the proposed method to validate the performance measures. K-means and GMM have 93.8% and 65% accuracy with high error rates of 29.47% and 24.35%, and low SNR, respectively. Thresholding had 86% accuracy and error rates of 32.58% and 10.17%, respectively. The accuracies of the three categories of SVM with kernel functions were 56.93%, 72.28 %, and 84.33 %, respectively. Growth region hand selection and FCM-GA selection had accuracies of 63% and 71%, respectively. The proposed hybrid model (K-Means and GMM) has a better accuracy of 95.50%, a low error rate of 18.64%, and a high SNR of 13.05. Table 2 presents a comparative analysis of classification accuracy, error rate, and SNR parameters for benign, malignant, and normal images after 10 epochs and an average execution time of 0.068 s.
The application of hybrid K-means and GMM segmentation will assist physicians in making early diagnoses by improving the qualitative identification of breast cancer in mammography images. Table 2 shows that the proposed hybrid model has a segmentation classification accuracy of 95.5 %, an error rate of 18.64, and a signal-to-noise ratio of 13.05, which is significantly more reliable than the existing techniques. Furthermore, the proposed technique minimizes the error rate.
The efficacy of the proposed method is presented in the diagnosis of breast cancer and its reliability in identifying malignant tumors from benign tumors. Using this method, medical experts can identify breast cancer faster with greater precision.
Extensive result analysis is presented with multi-variant matrices, such as accuracy, error rate, and signal-to-noise ratio. Analysis of variance (ANOVA) is a statistical approach for determining one or more variables in a set that differs significantly from one another. It checks the impact of one or more factors by comparing the means of different samples, as shown in Table 3.\begin{equation*} \sigma =\sqrt {\frac {1}{N}\sum \nolimits _{i=1}^{N} \left ({x_{i}-\bar {x} }\right)^{2}}\tag{9}\end{equation*}
sets\text{X}_{\mathrm {i}} = \text {sub} mean of data.\bar {x} = \text {arithmetic} N = number of sample sets
of all sample points.\sum {({\mathrm {Xi-}\bar {x})}^{2}} = \text {Sum}
The f-ratio value was 1.0638, p-value was < 0.0001, and significant at p < 0.05.
The ANOVA test showed an improved prediction rate for the proposed breast cancer performance metrics. The hybrid model proved the improvement in the detection of malignant breast cancer. The inference of this analytical study is to improve the accuracy, lower error rate, and high SNR.
Table 4 shows that among the various statistical tests, ANOVA has better result for the proposed work.
Conclusion and Future Work
In this research, two segmentation approaches, namely the K-means and Gaussian mixture model (GMM), are used to segment different categories of breast images, such as normal, benign, and malignant. It is proven that the hybrid approach has better performance measures, such as an accuracy of 95.5%, an error rate of 18.64%, and a signal-to-noise of 13.05 when compared to other existing techniques. The ANOVA test checks the impact of one or more factors by comparing the mean, variances, and standard deviations of different samples. It shows a high prediction rate for the hybrid segmentation technique used in breast cancer detection.
The hybrid GMM and K-means model is a novel method for detecting breast cancer with good accuracy. Initially, the breast image from the data repository was preprocessed. Removal of speckle noise and special markings in medical images enhances image segmentation quality. The results show that the hybrid GMM and K-means perform better than the existing techniques. The future scope of this method shows better outcomes in terms of precision, and the segmentation models are greatly emphasized. This intelligent healthcare model will bring a revolution in the medical era by solving human problems in society, especially in detecting breast cancer in women at an early stage.
ACKNOWLEDGMENT
The authors would like to thank all our universities for facilitating our time support in this study.