Modeling and Analyzing of Breast Tumor Deterioration Process with Petri Nets and Logistic Regression

: It is important to understand the process of cancer cell metastasis and some cancer characteristics that increase disease risk. Because the occurrence of the disease is caused by many factors, and the pathogenesis process is also complicated. It is necessary to use interpretable and visual modeling methods to characterize this complex process. Machine learning techniques have demonstrated extraordinary capabilities in identifying models and extracting patterns from data to improve medical prognostic decisions. However, in most cases, it is unexplainable. Using formal methods to model can ensure the correctness and understandability of prediction decisions in a certain extent, and can well visualize the analysis process. Coloured Petri Nets (CPN) is a powerful formal model. This paper presents a modeling approach with CPN and machine learning in breast cancer, which can visualize the process of cancer cell metastasis and the impact of cell characteristics on the risk of disease. By evaluating the performance of several common machine learning algorithms, we finally choose the logistic regression algorithm to analyze the data, and integrate the obtained prediction model into the CPN model. Our method allows us to understand the relations among the cancer cell metastasis and clearly see the quantitative prediction results.


Introduction
Breast cancer is a cancer developed from breast tissue, which is the most common female cancer, accounting for about 25% of female [1] .At present, there are several studies expressing interest in medical information and analytics [2−4] .Usually, the morphological structure, metabolism, and function of the diseased body will change, which is an important basis for the research and understanding of diseases.From the perspective of pathology, it is a crucial means to prevent and treat diseases by studying the causes, mechanisms, development rules, and pathological changes of human diseases [5] .Under the interaction of pathogenic factors and body reaction function, the occurrence of diseases is an extremely complex process.Therefore, how to let doctors and patients better understand the pathogenic factors and disease process is our top priority.We need an interpretable and visual modeling method.
Petri nets [6] is a modeling and analysis tool for distributed systems.As a system model, Petri nets can not only describe the structure of the system, but also describe the dynamic behaviors of the system, such as the dynamic changes of the system.At present, many works have applied Petri nets as modeling tools to medical field.For example, Ref. [7] proposed a new • Xuyue Wang and Wangyang Yu are with the Key Laboratory of Intelligent Computing and Service Technology for Folk Song, Ministry of Culture and Tourism, and School of Computer Science, Shaanxi Normal University, Xi'an 710100, China.E-mail: wangxuyue@snnu.edu.cn;ywy191@snnu.edu.cn.
• Xiaojun Zhai and Sangeet Saha are with the School of Computer Science and Electronic Engineering, University of Essex, Colchester, CO4 3SQ, UK.E-mail: xzhai@essex.ac.uk; sangeet.saha@essex.ac.uk.pathway analysis method using Petri nets to model the signaling pathways.In Ref. [8], the authors proposed a Petri nets model to estimate genomic and regulatory metabolic levels.In Ref. [9], the authors used Petri nets to study the bioenergetics of Mycobacterium tuberculosis with and without an uncoupler.In Refs.[10, 11], the Petri nets based model of the human body iron homeostasis process has been presented.But the studies on modeling and analyzing of disease factors by formal methods are still open.For the disease itself, it is crucial to understand the actual course and etiology.
Machine learning methods have been successfully applied in medical diagnosis and image analysis [12,13] , but scholars still show strong interest in how to use a reasonable model to model diseases.If a model is pathologically plausible to a certain extent, it may reveal some cellular features that lead to disease, thereby contributing patients to understanding the disease.Most existing machine learning algorithms are not specifically designed for this purpose.The prediction results are still susceptible to the influence of label bias and errors [14,15] .There are also shortcomings such as concept drift and inexplicability [16] .Formal methods can solve this kind of problem to a certain extent.It is essential to verify the correctness and accuracy of complex process.Reference [17] proposed the use of colored Petri nets to analyze breast cancer data to demonstrate that formal modeling can reduce the impact of machine learning errors.
As of now, most studies are devoted to the diagnosis and prognosis of breast cancer, and research in the fields of biochemistry and cellular microscopic [18] .Thus, we should pay special attention to the pathological features of breast cancer and the process of cancer cell metastasis, which requires the construction of a model to describe and analyze these processes.Therefore, to intuitively understand disease process, we combined machine learning and Coloured Petri Nets (CPN) [19] to model and analyze breast cancer cell metastasis, and model the specific transfer process under different stages.CPN is a modeling language that belongs to a kind of high-level Petri nets, where it can represent the relations between causes and results.Although CPN can introduce many mathematical methods to analyze its properties, we want to have a specific numerical analysis of the pathogenic outcomes.Thus, this paper discusses the modeling and analysis method of breast cancer process based on CPN and machine learning.
We reveal the proliferation of breast tumor cells at different stages, analyze the specific states of cancer cell, and finally use the predicting function of machine learning instead of the arc function to output the probability of tumor malignancy.In the model simulation, we integrated machine learning algorithms to process relevant data, and finally outputted the impact ratio of various cell parameters.
Our method not only well visualizes the entire metastasis process of cancer cells, but also conducts simulation and outputs relevant results.This can help patient understand the specific process of cancer cell metastasis and the probability of effect on cancer outcome under different cancer cell indicators.It is concluded that different cancer cell factors and the degree of cancer cell lesions would lead to different risks of the disease.The framework of the proposed method is shown in Fig. 1.
The rest of this paper is organized as follows.Section 2 describes the whole construction process of breast cancer analysis model.Section 3 describes the specific analysis methods and simulation results of the model.Section 4 summarizes the full paper.

Coloured Petri nets
Petri nets is traditionally divided into original Petri nets and high level Petri nets.CPN belongs to the high level Petri nets, which is a graphical language for constructing models of concurrent systems and analysing their properties.The CPN model is executable and state and action oriented [20,21] .Through the simulation of CPN model, we can study the behaviors of the system in different situations, so CPN model is simpler and clearer when modeling complex system.Based on these advantages of CPN, we constructed Breast Cancer Cell Anlylze (BCCA) model based on CPN.The definition of CPN can refer to Ref. [19], and the BCCA model is shown in Fig. 2.

Introduction of complete model
In modeling process, we establish a CPN model about four stages of breast tumor progress.Before modeling the entire mechanism process, we firstly need to consider the changes that may be involved under different cell stages.The next step is to refine these changes, after which we can obtain a complete formal corresponding model.Finally, we finish the whole system.In the next section, we will introduce the algorithm evaluation and analysis method in detail.Modeling the action mechanism of cancer cell includes some key steps.Firstly, we need to judge the cancer cell stage by , characterize the specific stages of cancer cells, and then carry out state classification, which is shown in Fig. 2. The color set used in Fig. 2 b, c, g, e, f, t, inv, d) that belong to the real type of num, final value i is cell stage that belongs to the int type of stage.They together form the IN, and after the judgement , output the type of INfo.We will elaborate the meaning of the numerical value in the next section.It is particularly noteworthy that our breast cancer cell model is a formalized modeling of the specific case of multiple factors, subject to the following constraints: (1) In the process of cell transfer, each stage is continuous and uninterrupted, and each factor is juxtaposed and has no priority.
(2) In the CPN model, we take all cell parameters into a color set as the input.After determining the stage   of the cell, we use the another color set as the parameter carrier of cell metastasis, and finally use the tumor type probability as the output color set.

Specific modeling process of breast cancer
Breast cancer is caused by various internal and external carcinogenic factors, which mainly manifest as the loss of normal characteristics and abnormal proliferation of mammary duct epithelial cells, so that they exceed the limit of self-repair and become cancerous.The main clinical manifestation is mammary mass.Different cell phenotypes have different effects on tumor phenotype [22] .In the analysis process, we integrate the machine learning algorithms into the cell action mechanism, and the specific meaning of dataset and relevant details are described in Section 4. In Fig. 2, BCCA model describes a series of change processes of cell metastasis and subsequent results caused by various parameters of breast cells in human body.The development of breast cancer cells in the human body is a fairly long process.It is mainly divided into four stages, namely occult stage, early invasive cancer, invasive cancer stage, and advanced breast cancer.
First, in occult stage: it is also known as the early stage of cancer.During this development process, human breast cells have undergone cancerization and carcinoma in situ.Second, in early invasive cancer: it refers to the fact that cancer cells begin to break through the basement membrane of the breast ductal epithelium.It can be divided into two categories: early invasive lobular carcinoma and ductal carcinoma.Third, in invasive cancer stage: cancer cells begin to infiltrate extensively into the breast stroma.Finally, in advanced breast cancer: most patients will have cancer metastasis of varying degrees.The cancer cells will spread widely, mainly in lung, liver, bone, and other parts with multiple metastases, and even endanger the patient's life.
The BCCA model visualizes these processes very well because CPN provides the foundation and basic primitives for graphical representation.This goal is to model specific systems with formal modeling methods, and make the model more organized and the logic of cell transfer clearer.The details of the BCCA model are shown in Fig. 2. The BCCA model mechanism becomes simpler and clearer by the CPN, it also becomes clearer to analyze and understand.Before modeling the entire cell mechanism process, we firstly need to consider the cell stage that may be involved under different cell parameters.The next step is to judge the state according to the principle of pathological metastasis.We should pay attention to the stage details during the modeling process, and finally obtain the corresponding formal model.At the beginning of this section, we show the four stages of breast cancer cell.After the model is constructed, we need to verify its rationality and validity.In the next section, we will present the analysis method of the model and the simulation results.

Model Simulation and Analysis
In this section, we will introduce the process of selecting the optimal algorithm from several basic algorithms of machine learning in detail, the specific meaning of the dataset, the fusion process of machine learning algorithm and CPN model, and the final simulation result of BCCA model.

Logistic regression
The Logistic Regression (LR) [23] is a generalized linear regression analysis model, which is often used in data mining, automatic diagnosis of diseases, economic forecasting, and other fields.LR is essentially a dichotomous problem, and its process probability transformation is nonlinear.The equation of LR is Eq. ( 1).The output variable range of this model is always between 0 and 1.The hypothesis of the Logistic Regression algorithm model is h( ) = g( ), where X represents the feature vector, and g stands for the logical function.

Other algorithms
The Linear Discriminant Analysis (LDA) [23] is linear decision boundary classifier.The basic idea of LDA classification is to assume that the sample data of each category conform to the Gaussian distribution.
The K-Nearest Neighbors (KNN) [23] is one of the simplest methods in data mining classification techniques.The core idea of the algorithm is that if most of the k nearest neighbors of a sample in the feature space belong to a certain category, the sample also belongs to this category and has the characteristics of the samples in this category.
The Classification and Regression Tree (CART) [23,24] is a widely used decision tree learning method.It is also a type of decision tree.It is suitable for predicting discrete data.
The Naive Bayes (NB) [23] is based on the Bayes theorem.This classification algorithm assumes that the class conditions are independent, that is, the variables are assumed to be independent of each other, which can simplify the calculation.Only when the assumption is true, the algorithm is accurate to determine the highest.In practice, there is often some kinds of dependency between variables.
The Support Vector Machine (SVM) [23] is a binary classification algorithm that supports both linear and nonlinear classification.The core idea is to try to maximize the separation between the two categories, so that the separation has higher reliability.At the same time, it also has good classification and prediction ability for unknown new samples, that is generalization ability.
In recent years, machine learning algorithms have been widely used in medical diagnosis because they can use data or past experience to automatically optimize computer programs.Machine learning can be divided into supervised learning, unsupervised learning, and semi-supervised learning, among which supervised learning includes classification algorithms and regression algorithms.In this experiment, we adopt the dataset of breast cancer tumors derived from UCI Machine Learning (http://archive.ics.uci.edu/ml), and make a concrete evaluation of several basic algorithms based on this dataset to choose the one that worked best for us.We usually use precision, recall, and F1-score to evaluate the algorithm, and the evaluation results are shown in Fig. 3.The Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC) index of LR are shown in Fig. 4. When AUC = 1, it means that the most suitable algorithm model is obtained.Therefore, by referring to the above evaluation results, we finally choose LR to integrate into the BCCA model to analyze the data.

Data analysis
Breast cancer cells are a type of cancer cell that hides in a woman's breast, and it can be stimulated under the right circumstances to cause breast cancer.To demonstrate our approach to the study of breast tumor metastasis process, we performed data analysis on a public dataset of breast cancer tumors derived from UCI Machine Learning (http://archive.ics.uci.edu/ml).The Wisconsin Breast Cancer Original dataset is composed of 699 instances with 11 attributes from breast cell properties.These features are derived from the clinical disease reported by Dr. Wolberg.The meanings of the features that make up the dataset are shown in Table 1.At the same time, we visualized the dataset, and the histogram and scatter matrix of the data are shown in Fig. 5.
Cancer refers to all malignant tumors in a broad sense, but only refers to malignant tumors originating from epithelial cells in a narrow sense.In terms of morphology, the cancer cells of benign tumors are not very different from normal cells, while the cancer cells of malignant tumors have obvious atypia.In addition to   its large size, it often spreads to the surrounding area, and has strong destructive and lethal power.From a pathological point of view, in the pathological examination of benign tumors, the cell structure is similar to that of normal cells, and there is no mitotic phenomenon, which has no effect on human life.However, in the pathological detection of malignant tumors, in addition to the increase in size to block or compress the surrounding tissue, its cells completely lose their normal physiological functions, eventually leading to human death.Therefore, the cell parameters and mass parameters in the dataset are very important for judging the nature of the tumor.

Model simulation
Through the above data analysis and algorithm analysis, we finally choose LR to integrate into the BCCA model.Firstly, we need to get the regression coefficients and bias coefficients of the trained LR model.Then, we combine it into our prediction function according to the sigmod function.Finally, we use it to replace the original output arc function.The followings are the detailed steps of model fusion and model simulation.
In the final output arc function, the regression coefficients we get are: 1.20, 0.32, 0.56, 0.93, −0.13, 1.49, 0.98, 0.9, 0.73, and the resulting bias coefficient is −1.05.As shown in Fig. 6, we integrate it into the final output function.The operating rules of the entire model follow Algorithm 1.We can see that the input value is 1'(5.0,1.0, 1.0, 1.0, 2.0, 1.0, 3.0, 1.0, 1.0, 1).The meaning represented by each value follows the T 1 serial numbers 2−10 in Table 1, and the last value represents the stage of the cancer cell.When it undergoes the specific process of differentiating into various cancer cell stages after the judgment of , the output function finally outputs the probability of the tumor type (2: benign, 4: malignant).
Then, we used CPNTools [25] for model simulation.CPNTools is a convenient simulation tool, which can dynamically display the action process and cancer cell stage change of model, and output the final states.The simulation results of the effects of cancer cell parameters on tumor properties are shown in Fig. 6.

Discussion
In this part, we will discuss and verify the results.Figure 6 shows the prediction process of the BCCA model specified by CPN.When the input value is 1'(5.0,1.0, 1.0, 1.0, 2.0, 1.0, 3.0, 1.0, 1.0, 1), this indicates that the cancer cells are in the first stage.As shown in Fig. 6, the probability of tumor which is benign at this time is 0.955 414 012 739.Meanwhile, when the input value is 1'(8.0,10.0, 10.0, 8.0, 7.0, 10.0, 9.0, 7.0, 1.0, 4), this indicates that the cancer cells are in the fourth stage.As shown in Fig. 6, the probability of tumor which is benign at this time is 0.219 659 527 732.The above simulation results are consistent with the source database, and the model prediction accuracy is 0.970 760 233 918 128 6.To sum up, the simulation results verified the rationality of model.We proposed modeling method for breast cancer cell metastasis process based on CPN.This modeling method not only uses formal language for modeling, but also incorporates machine learning algorithms as analysis methods.The two make up for each other.CPN can effectively describe its pathological process, and it makes up in certain extent for the shortcomings of machine learning in the prediction process and improves the interpretability.Machine learning gives CPN a convenient method for analysis.As far as we know, this is the first attempt to use a formal approach to the analysis of breast cancer cell metastasis.This method provides a new idea for the study of the metastatic process of breast cancer cells.

Conclusion
This paper studies the metastatic process of breast cancer cells at different stages, and proposes the modeling and analysis methods on the basis of CPN.During the modeling process, we introduced a logistic regression algorithm as an approach of data analysis to efficiently output the effects of individual cell parameters on tumor types.This method can demonstrate the cancer cell metastasis process well, and also visualize the data analysis process.We built a BCCA based on CPN, which allows us to express cancer cell characteristic data as tokens.BCCA can dynamically simulate in CPNTools, and output the probability of tumor progression under different cancer cell parameters.At the same time, it can also provide better predictive assistance for clinical medical diagnosis.In the future, we will continue to carry out in-depth research and optimize the disease analysis model to make it more widely applicable. 1′(5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,1)Output:

Fig. 1
Fig. 1 Proposed research methodology to analyze and visualize breast cancer using CPN.

Fig. 4
Fig. 4 ROC curve and AUC index of LR.

Modeling and Analyzing of Breast Tumor Deterioration Process with Petri Nets and …
Get the new marking , recalculate the enabled binding elements under the marking, where is printed with probability Xuyue Wang et al.: 2 ∪ {E(t i , p i )} C 2 = C