Automated Risk Management based Software Security Vulnerabilities Management

An automated risk assessment approach is explored in this work. The focus is to optimize the conventional threat modeling approach to explore software system vulnerabilities. Data produced in the software development processes are better leveraged using Machine Learning approaches. A large amount of industry knowledge around security vulnerabilities can be leveraged to enhance current threat modeling approaches. Work done here is in the ecosystem of software development processes that use Agile methodology. Insurance business domain data are explored as a target for this study. The focus is to enhance the traditional threat modeling approach with a better quantitative approach and reduce the biases introduced by the people who are part of software development processes. This effort will help bridge multiple data sources prevalent across the software development ecosystem. Bringing these various data sources together will assist in understanding patterns associated with security aspects of the software systems. This perspective further helps to understand and devise better controls. Approaches explored so far have considered individual areas of software development and their influence on improving security. There is a need to build an integrated approach for a total security solution for the software systems. A wide variety of machine learning approaches and ensemble approaches will be explored. The insurance business domain is considered for the research here. CWE (Common Weaknesses Enumeration) mapping from industry knowledge are leveraged to validate the security needs from the industry perspective. This combination of industry and company data will help get a holistic picture of the software system’s security. Combining the industry and company data helps lay down the path for an integrated security management system in software development. The risk management framework with the quantitative threat modeling process is the work’s uniqueness. This work contributes towards making the software systems secure and robust with time.


I. INTRODUCTION
Threat modeling is one of the prominent parts of software development processes. A large part of the exercise includes expert judgment in practice. There is a need to make this exercise as quantitative as possible. In this paper, constructs of threat modeling are studied to build a quantitative threat modeling approach with less dependency on the experts. The paper is organized to explore some of the related work in work is automating the security vulnerability risk assessment approach and threat modeling approach with the machine learning approach. Both exercises are optimally combined for better outcomes. Machine learning classification approaches are leveraged to get visibility into possible security vulnerabilities.

A. MOTIVATION
This work is motivated by the lack of focus on software vulnerabilities threat modeling. Though this exercise is conducted, it is restricted due to the manual intervention involved by experts and the time involved. Multiple efforts focus on the security vulnerability discovery but are happening in silos. This state motivates to bring together the efforts into a common framework. With fast-paced progress in the security threat and its impact, it is essential to develop these systems.

B. OBJECTIVE
This work intends to quantify the security threats and help focus as needed. Connecting the knowledge prevalent in the industry with the needs of the software development industry is the focus area. Utilizing the historical knowledge of the organization for better visibility into future operations is given prominence in the study. The flexible threat modeling approach targets the critical security areas as per the area's prioritized under study. Getting the confidence of the software development stakeholders with predictable and secure software systems.

C. CONTRIBUTION
Paper contributes an approach to make threat modeling a data-based quantitative process, be reducing manual intervention of the experts. This approach reduced the dependency on the security experts. With reduced dependency on experts and human intervention, this approach can be extensively used when needed. The proposed approach will help build a knowledge system that will get better over time by including knowledge from across the industry and within the company. This proposed system helps bridge the gap between security experts, software development teams, and software system users. This work is part of a comprehensive Software Security Management system envisioned by the authors.
Paper also contributes to the Information Security domain by helping reconcile the data available across industry and company for the benefit of software development teams. The ideas presented in this paper are noble and essential in ensuring a common approach to threat modeling in an organizational setting. An integrated model for detecting security threats in an organizational setting will help the software development teams explore security flaws effectively. This approach would be of immense use as it standardizes threat detection in software system development. It will serve as a knowledge management tool for software development companies. The primary focus of our work is on applying data analytics in threat modeling, and risk assessment approaches and to propose an integrated approach.
Paper start with exploring the literature for work done on threat modeling for software development and machine learning approaches used to learn the security needs from the various data sources across the industry. Focus areas of the paper are discussed in the next section, followed by the understanding of threat modeling. The application of risk assessment with threat modeling is explored in the next section. The data collection-related process is discussed in the subsequent section. Experiments are built on understanding how threat prediction can be introduced into the conventional threat modeling approach. The outcome of these experiments is validated with the available best approaches and their results. Paper is wrapped up with the discussion on weaknesses in the work and possible future prospective areas.

II. LITERATURE REVIEW
In work [5], neural networks, deep learning techniques, and ensembles were explored in cyber security. Cyber security areas of intrusion detection, prediction of cyber-attacks, and malware identification are targeted areas. This paper provides a reference point for cyber security professionals in deep learning. This work highlights the need for further exploration to make the algorithms more efficient based on the specific data under study [6] [7]. Challenges in data collections are also are highlighted as the area that needs focus. This paper explores a variety of the deep learning approach in the space of cyber security. However, these approaches are not covered by software development processes. We leverage some of the learning from this paper and explore them for software development processes in our work. In work, [8], machine learning and deep learning approaches are explored to tackle the security issues bothering big data. Due to considerable growth in data volumes, there are vulnerabilities for the security threats to hamper the system. In [9] and [10], an exploration of taxonomy around the threat modeling approach was achieved with the machine learning and deep learning approaches. However, the practical implementation in software development eco-systems was not achieved. In [11], the growing application of machine learning and the possible vulnerabilities introduced into the system were explored. The study covers the threat model for machine learning and explores the attack involved and the defenses that would be needed. This paper takes away some of the learning around threat modeling. Work attempts to bring perspective around model accuracy, complexities, and resilience that needs attention based on its operating environment [12] [13].
Work [14] explores machine learning capabilities for cyber security. Machine learning capabilities to identify advanced threats and targets in infrastructure vulnerabilities, organization profiling, and other exploits are explored. With the inability of the traditional malware handling approaches, these new capabilities come in handy. From this work, we take away the key insights of applying machine learning in the cyber security space and re-use and test it in the software development space. Work [15] focuses on providing insights into threat modeling. Exploration done on Microsoft's threat modeling is used as a base to offer insights into an effective threat modeling approach. Work [16] [17] explores threat modeling applications in agile software development processes with Microsoft's STRIDE approach. Practical challenges facing the industry are explored and validated with the challenges highlighted in the literature. Some of the key challenges are seen during the identification of assets stage and how the post threat modeling exercise is implemented.
In [17], the variety of vulnerabilities were studied. IT helped to understand the granular details of the vulnerability. This knowledge helps to build the datasets for our experiments. This knowledge also helps to enhance the construction of machine learning experiments. In work, [18], automation of threat modeling is focused. The focus is to reduce the effort involved in the threat modeling by leveraging the available data [19] [20]. This work helps to understand the thought process behind the threat modeling framework. This learning helped build the threat modeling framework that can work with other security-related frameworks. Lack of context is another challenge faced by the prediction models. This lack of context is due to the lack of domain knowledge being considered in the modeling process. Authors try to bring in an ontology framework to improve the conceptual modeling. In work [21], authors introduce threat identification as part of the software development lifecycle. The idea here is to reduce the need for educating software development experts on security knowledge. The proposed approach looks at analyzing the design of the software to explore risks and threats to the system. Authors introduce an identification tree named a new data structure approach for detection of threats to citer7 [22]. The mitigation tree approach is utilized for the description of countermeasures. These methods provide a guided approach for risk assessment across the software development lifecycle.

III. RESEARCH GAPS
Based on the literature review done, we see that there are the following research gaps. Focus on learning the structure of customer requirements in agile development methodology from a security perspective which is missing now. Utilizing the construct of security categories from the industry data to derive the implicit security needs of the customer. These element are essential to bridge the conventional threat modeling, and risk assessment approaches with machine learning capabilities. Integrating the customer, industry, and software processes data sources to learn the security needs is not addressed appropriately and needs deeper exploration. This approach helps to have a comprehensive view of the proposed framework's security vulnerabilities.
Deep learning advancements in cyber security is another area that needs attention. Improvisation of algorithms based on the data eco-system can provide a good leverage for the research in software development practices improvisation.
Deep learning approaches exploration for software development space needs more effort. Deeper work needed on addressing security issues in software development. Software system threat modeling automation needs more focus. More deeper work is needed on associating context and the domain knowledge for modeling the information. Stronger association of risk assessment and threat modeling good practices would be important to leverage the power of both the framework

IV. KEY THEMES OF THE STUDY
In work [3], we have outlined an efficient system that can facilitate information management in software development processes. The intent is various sources of the software development process data and ways to model them to supply it as helpful information [23] [24]. The overall objective of this integrated information system is to combine the information in the areas around customer conversation, industry best practices, and internal software development processes. Figure 1 depicts the system outline of the conceptualized Integrated Information Management system to tackle security vulnerabilities. Under the module of industry knowledge modeling, threat modeling concepts are studied. This study will reduce the inefficiencies prevalent in the threat modeling exercise. Machine learning approaches are explored to build predictability into the exercise. The information available in the industry around the threats and vulnerabilities can be leveraged to provide the information needed when it matters. A combination of vulnerabilities identification from customer conversation, internal software development processes, and industry knowledge will help build a robust information system.

V. THREAT MODELING APPROACH
Cyber security risks that challenge the software system make it essential for the industry to take a proactive step towards tackling it effectively. The complexity of threat modeling makes it less effective when it is implemented. The best approach is to start with simple steps and build it [25] [26]. Constructing the software systems boils down to the requirements that specify the features needed, acceptance criteria from the customer, and technical breakdown of the requirements. A specific standardized approach to threat modeling is missing that makes the situation harder [27]. Technical risks would be a good starting point as they are particular to the software VOLUME 4, 2021 system, like the ones around missing security control in the software. Since the software system's structure is well within the control, it will be easier to handle. Making risk identification a collaborative effort goes a long way in maintaining effectiveness in the system [28]. Agile methodology has a nice setup of the team structure where the product owner, system analyst, developer, tester, architect, and scrum master form a scrum team that intends to deliver the value-based product to the customer. This mix of expertise across the value chain can be a good setup for collaborative threat assessment. Cyber security risks go beyond ticking the checklist and making sure the business risks are kept under check [29] [30].
Breaking the system into smaller components to start the analysis will be a good starting place. This specific focus helps to take action more frequently and see the progress. This iterative approach of threat modeling will help get everyone's involvement rather than a complex analysis done at the beginning of the project [31] [32]. Exploration, brainstorming of the threats, and prioritizing and fixing the threats are the simple starting points to implement. Deciding on the stakeholders needed for the threat exploration is essential. Frequency to be agreed for the threat exploration session. It is always better to have a face-to-face session with the people involved rather than an online session. This aspect has been our experience while conducting brainstorming sessions with the software development team for threat modeling analysis. Figure 2 shows the simple format to identify the threat in the system.
As discussed earlier, prioritizing, and taking up the important work to time box the exercise is essential. This focus helps to maintain the healthy progress on threat modeling exercise. The latest features that are worked upon, any identified security feature, services that are collaborating with other services, and technical security debt are good areas to start focusing upon [33].

VI. COMBINING RISK ASSESSMENT APPROACH AND THREAT MODELING
The risk assessment approach that we have been practicing in our company is based on information security assets. We look at confidentiality, integrity, and data availability as a primary focus area for our security risk assessment. Based on the probability of occurrence of events that compromise these three factors and their impact, we arrive at the risk level. Based on these risk levels, risk mitigation actions are devised. Risk mitigation will be around security controls needed to manage those risks. Threat modeling for security risks focuses on the technical risks involved in software systems. It focuses on all phases of the software development lifecycle, including requirements gathering, design, construction, and testing. Threats and vulnerabilities hampering the software systems are used as a base in this assessment. We integrate the risk assessment and threat modeling approach with data analytics approaches in our work. In the further part of this section and subsequent sections, we propose our approach. In the first stage of this risk assessment approach-based threat modeling, all the components of the software system and processes are to be listed. For example, network and communication-related components, software components, and other similar areas. In the next phase, impact analysis is conducted. To start with this exercise, initial impact analysis can be expert judgment-based. Later a database can be set up to track the events that will feed into automated impact analysis [34]. Impact analysis includes three components, what is the result of compromise on the confidentiality of the data, integrity of the data, and availability of the data. We have used this approach of risk assessment based on confidentiality, integrity, and availability in our company and have observed that it provides a comprehensive view of the security risks and their impact. Based on these three components, impact value can be derived. The organizational database can be created to collate the experience and events, which will help understand the impact of compromise of data from all three perspectives referred above [35]. This database can be an ongoing repository that helps build the knowledge base for impact analysis. Impact value of confidentiality, availability, and integrity can be provided a range of values based on its impact on the customer. Based on the combination of values of these three parameters, the final impact value can be arrived at in this work [36].
Data collected in the organizational database can predict the impact value. All the attributes associated with the identified components can be put together to model the impact based on confidentiality, integrity, and availability. Alternatively, any other parameters would help build a threat modeling system. Impact value can also be directly derived from the attributes associated with the target components. In the next stage, based on the categories of the components, possible threats and vulnerabilities that would impact the components can be listed. This listing can be based on the organization's historical data or industry knowledge. To build a historical experience-based list, it is essential to have a process that helps capture all the threats that have hampered software system components and vulnerability in the system that has led the threat to exploit. If we take the technical failure of the software components as the threat, it would be caused by vulnerabilities like inadequate business continuity management, inadequate system monitoring, insufficient user testing, and others. Using industry data, we can model the threat and vulnerabilities based on the information related to software system failures and the causes.
The next part of the information needed is the probability of occurrence of these vulnerabilities. This information will help to assess the risk level of the failures further. The probability of occurrence can be provided on a high, medium, or low scale based on the number of times those events have occurred in the past. This tracking needs a system to capture all the events associated with the software system deficiencies from related to confidentiality, integrity, and availability. As discussed earlier, these three factors or any other factors that are relevant to the system can be considered. While modeling the threats is done in this approach, impact evaluation and occurrence evaluation can be combined to obtain the final risk levels of the components. The final risk level is a combination of impact value and probability of occurrence of the vulnerabilities in the past. Components can be subjected to the study of controls needed based on the risk level. Controls needed can be based on the threat type and its vulnerabilities type. Control information can be derived from industry knowledge databases. Control refinement can be carried out based on the residual risks after implementing the control. So, the system can be made in real-time where it captures all the information periodically and recalibrates the system for its risk value and controls applied. Based on future events, this system calibrates itself and provides direction for further strengthening. Figure 3 depicts the outline of a risk assessment-based approach for threat modeling. We consider Impact value = I v , Impact factors identified= T f , Risk value-R v , Probability of occurrence=P O . (1)

VII. DATA COLLECTION
Threat modeling for the software development processes can be done concerning the information captured in the software development processes. Threat categories will have multiple CWE (Common Weaknesses Enumeration) under them. CWE is a community-developed list of software and hardware weakness types [37]. In the software development processes with the Agile framework, requirements are documented in the form of user stories, which are further broken down into tasks. TFS (Team Foundation Server) is used as ALM (Application Life-cycle Management) tool. Test cases are created to cover all the expected testing scenarios. Any issues identified during the software development are tracked as defects and addressed.
Tasks are linked to the user stories; test cases are also linked to the user stories. Any defects found during testing are linked to test cases. These linkages help to maintain traceability. Leveraging these work items' traceability, requirements can be mapped to defects that are related to security; additional security-related issues can be traced to the CWE. Expert involvement is needed to map the security issues to CWE. Required training and knowledge sharing must be enabled for this process. Building a model around the patterns of software requirements, to security issues to associated CWE will help understand the possible threats that would hamper the software system. In this data collection approach, linkages between these work items are leveraged. CWE mapping done with the involvement of software development experts is leveraged for the modeling. The idea is to build a prediction engine that can predict possible CWEs that would get resulted when a customer requirement is being worked on. This identification provides an opportunity for the software development teams to engage the security controls much earlier in the process.
All the work items from TFS are extracted, including user stories, tasks, test cases, and defects. All the work items that have reference to CWE are selected. Parents' work items for these work items are also collected to trace back to original requirements. CWE ids are separated from the text content, which will act as a label for the text descriptions. All the information available across work items in the form of their title and description with CWE being referred are extracted to create a data source that has text description and the CWE mapping. From the data extracted, 1458 text data descriptions are available that are mapped to 64 different categories of CWEs.

VIII. PREDICTION MODEL FOR THREAT ASSESSMENT
The prediction model intends to categorize the customer requirements into respective CWE categories. Once the model is built around this content, it will be possible to map the new requirements coming from customers to their possible CWE and predict the potential threats that may hamper the software system. Based on the CWEs mapped, security controls can be devised to tackle threats to software systems.

A. MULTI-CLASS CLASSIFICATION APPROACHES
In the first phase of this exercise, a text description is subjected to TF-IDF (Term Frequency Inverse Document Frequency) for the vectorization process. Logistic regression, Random forest classifier, Multinomial NB (Naïve Bayes), and Linear SVC (Support Vector Classifier) are used for classification modeling. Random forest classifier is tuned with n_estimator of 200 and max_depth of 3. A crossvalidation of 5 is chosen for the modeling. Table 1 shows the results of the first round of modeling. Table 1 shows results of the first round of modeling. Linear SVC showed a precision of 51%, recall of 48%, and F1 score of 46% on a weighted average scale. Performance is not up to mark.
In this section, we try to build ensemble models. Preprocessing of the data is conducted with Beautiful Soup and tqdm libraries. Also, TensorFlow Kera's preprocessor is used to tokenize the natural language data used as input. The text to sequence method is used for this purpose. Input data is split into train and test components with 80% data for training and 20% data for testing. This round tries the Random Forest Classifier, XGB (XG Boost) classifier, and Logistic Regression classifier. All three models' outputs are averaged in this method to obtain better performance from the ensemble model. Table 2 shows the performance of various models in terms of F1 score and the parameters that are used. This experiment shows that XGB is the best among all the parameters but not good enough. Further ensemble methods are explored based on the feasibility study of the software development work items and the details of the domain experimented on. Multinomial NB (Naïve Bayes), Decision Tree Classifier, K Neighbors Classifier, Linear SVC (Support Vector Classifier), and Random Forest Classifier are used. These algorithms are started with averaging methods prediction. The F1 score metric is used for the evaluation of performance. Table 3 shows the performance of various models in terms of F1 score and the parameters used. Table 4 lists out the parameters chosen when all the models were run together. Since Random Forest Classifier and KNeighbors Classifier were relatively better, they were run together under averaging method, but their actual performance was reduced by 3%. Under the max voting method also the performance is only about 31%.
In this section, ensemble, deep learning models that are appropriate to model the data from software development processes focusing on the security of the software are shortlisted. An attempt has been to classify the content captured in software development work items into security-related content mapped to respective CWE. This mapping will help to call out the possible threats hidden in the system. Spacy library from NLP (Natural Language Processing) is used for data processing. Training and testing data of 70% and 30% are constructed for the experiment. Kera's preprocessing library text tokenizer and pad sequencer are applied. A pretrained model glove with 200 dimensions is utilized for the generation embedding matrix used for training the model.

B. CNN (CONVOLUTIONAL NEURAL NETWORK) STATIC
CNN static algorithm architecture includes layers of CONV1D, BatchNormalization, Activation, and GlobalMax-Pool1D being concatenated. Dropout is kept at 50%, followed by a dense layer of 512 units and 'relu' activation. The output layer is a dense layer with a 'softmax' activation function. CNN static model is compiled with loss function of 'categorical_crossentropy,' optimizer 'adam,' and batch_size of 128 with a function written to compute top three accuracies. CNN static model is created with the 'Model' function from Keras.model library. This model is further run with "fit_generator" to feed data in sequential mode. The top 3 accuracies show a performance of 50.68%. Performance on training and hold-out data set over the epochs in terms of the loss is depicted in figure 4. The hold-out set cannot close on the training dataset in terms of the loss value. 80% and 20% split of training data is a general guideline. In ensemble and deep learning models, we want to experiment with different training and testing data spilled. However, this did not make much of a difference at the end of the experiment.

C. CNN DYNAMIC
To make the CNN network dynamic, in the embedding_layer creation, parameter 'trainable' is kept to 'True' so that the training happens dynamically. The architecture of the network remains the same as the CNN-Static network. Model building and compilation stay the same. The top 3 accuracies show a performance of 31.76%. Performance on training and holdout data set over the epochs in terms of the loss is depicted in figure 5. Though the holdout data set is close to

D. DATA PROCESSING, TRANSFORMATION AND MODELING
In this section, data pre-processed with NLP's spacy library is used. Tfidf (Term frequency-inverse document frequency) vectorizer is used for the data vectorization process. Now the data is subjected to models Logistic Regression, Random Forest Classifier, and Linear SVC. A cross-validation value of 5 is chosen for the processing. Random Forest uses "n_estimators" of 300 and "max_depth" of 3. Accuracies of the models are shown in figure 6. Table 5 depicts model performance in terms of accuracy. Even modifications in the data processing models do not significantly improve their performance. Based on the literature review, some models that have shown good performances for the classification problems will be explored here. Exploration will look for compatibility of these models for the data used here, coming from software development processes that follow Agile methodology and serve insurance domain business.
Naïve Bayes, KNearest Neighbor, Support Vector Machine, Random Forest, Decision Tree, and ensemble classi-VOLUME 4, 2021   fiers are the ones to be explored. There was no significant improvement in the performance so that raw data will be used directly without the NLP space-based processing done in the previous section. Tfidf vectorizer will be used on this data for vectorization purposes. The cross-validation value is maintained at 5. Table 6 shows the model parameters and their performance. Except for the improvement of the Linear SVC model, where accuracy improved to 47.61%, the rest of the models is still not doing well. Ensembling the best models among these will be explored for the data used here. Stacking ensemble modeling is tried in the next section. There will be level 0 and 1 models; a stacking classifier combines the models from levels 0 and 1. The logistic regression model is used as a level 1 model, and the rest are configured as a level 0 model. Data processing is kept to raw data being processed with the Tfidf vectorizer.
Repeated Stratified KFold method from sklearn's model selection library is used to configure the 'cv' parameter for the modeling. Parameter set for 'RepeatedStratifiedKFold' are 'n_splits of 10 and 'n_repeats' of 3. 'cross_val_score' method from sklearn's model selection library is used to generate scores to evaluate the model. This method uses parameters, 'model,' 'input data,' 'label data,' 'scoring methods,' and 'cv' value. The scoring method used is accuracy, and the cv value gets generated from the 'RepeatedStratifiedKFold' method. Performance of various models with their parameter configuration is provided in table 7.
In the first round of stacking with all model's accuracy was poor at 0.2%. As per the literature review, decision tree, neighbors, and Logistics regression has performed well in a similar setup. The stacking of these models improved the performance to 44.6%. Individual model performances of Logistic Regression, Decision Tree classifier, and Support Vector Classifier show better performance. However, removing KNeighbors and adding a Support Vector classifier reduced stacking performance back to 0.2%. This performance indicates the earlier combination was best. With the best performers, Multinomial Naïve Bayes is added, performance slightly improved to 44.9%, whereas Multinomial Naïve Bayes performed at 28.6%. XG Boost classifier took the performance to 49.4%, but this is computationally expensive, so it would not be feasible. Logistic regression and Decision tree classifier were fine-tuned with grid search CV, and performance was 50.6% and 41.3%, respectively, but stacking performance reduced to 38.2%.
The original database had data across 63 CWEs categories. Many of the CWEs categories had only a few data points under them. This state resulted in an imbalanced dataset, and models were poorly performing. The top 20 CWEs were more prevalent upon discussion with application development experts. These top 20 CWEs frequency of occurrence was also observed to be high. Only 20 of the most occurring CWE were shortlisted based on expert input. To improve the prediction performance, more data was collected from across other programs in the company. Three thousand two hundred fourteen data points from across multiple programs were collected for the 20 CWEs that were shortlisted. Among all the above experiments conducted, the stacking model of the Decision Tree classifier, KNeighbors classifier, and Logistic Regression showed the best performance with 77.7% accuracy and a standard deviation of 2.5%. Decision Tree classifier and KNeighbors were used at level 0, and the Logistic Regression model was used in level 1 in the stacking.
Based on the variation of the occurrences of the CWEs in the future, modeling must be fine-tuned to cover more CWEs. Collaboration with experts is to be continued to study the outcome of the current model. The experts must validate predicted labels. Labeling of the data into appropriate CWEs must be improved during validation. This ongoing effort will help improve the prediction engine to a much better level.
Evaluation against the state-of-art: Work [38], utilizes SMOTE, SVM with RBF kernel and logistic regression approaches utilizing Recordings of meetings between developers and customers from a software development company in the United States. Here they explore a classification approach to figure out security vulnerabilities. They have recorded the results of Precision at 70.8% and Recall at 18.3%.
Work [39] explored LDA and SVM approaches with Stack Overflow dataset for classification of the security vulnerabilities from the data. Following results were produced. For LDA, Precision was 70.33%, Recall was 77% [40] [41]. For SVM (Support Vector Machine), Precision was at 72%, and Recall was at 77% [42]. Among all the above experiments conducted, the stacking model of the Decision Tree classifier, K-Neighbors classifier, and Logistic Regression showed the VOLUME 4, 2021 best performance with 77.7% accuracy and a standard deviation of 2.5%. Decision Tree classifier and K-Neighbors were used at level 0, and the Logistic Regression model was used in level 1 in the stacking. In comparison to these best works, we would achieve better performance with the precision of 76% and recall of 79%.
Methods used in the paper are briefed by starting with the background of existing methods. We start from the "Understanding threat modeling" section, where there is an exploration of automating some of the sub-processes using machine learning. In the next section, "Risk assessment approach for threat modeling," there is an exploration of combining the conventional risk assessment method with the threat modeling approach. This approach helps to leverage the best of both approaches. In section "Threat prediction modeling," the core proposal of our work is detailed.
In this section, 'Data Transformation and Modeling,' we did a detailed study of the machine learning algorithms that will fit in our architecture. Starting from basic machine learning algorithms to more advanced algorithms were explored. A comparison of the performance of various experiments was conducted. Models' parameter tuning and the best combination of the parameters are explored in detail to arrive at the best combination of parameters for models working well. Table 7 provides a comprehensive view of all the models, parameters, and performance. We also provide the best outcome of the performance on our dataset.
Some of the methods used are as follows. TF-IDF is the famous approach for weighing the terms in Natural Language Processing. Logistic Regression, Random Forest Classifier, Support Vector Classifier are basic machine learning approaches used for classification problems. Beautiful Soup is a python library utilized for web scrapping from XML and HTML pages. Tensor Flow is an open-source artificial intelligence library that uses data flow graphs to build models. Keras is a neural network library that provides high-level APIs for building and training models. XG Boost stands for eXtreme Gradient Boosting and is a supervised learning library with parallel processing capabilities. CNN is an artificial neural network for processing image data. Ensemble classifiers help to improve machine learning outcomes by combining various models.

IX. CONCLUSION
In this exploration, the focus was to build a quantitative threat modeling approach. It is essential to use the knowledge prevalent in the software development processes and from across industries. The information available from the industry has been overwhelming for the software development team to leverage. Approaches discussed in this paper will help make this information available as and when needed. Security challenges are faced depending on the business domain in which the industry is operating. In this study title insurance business domain is the focus area. The title insurance business domain is unique from other branches of the insurance business. Software development processes following the ag-ile development model also provide different set-up in which security improvements can be focused. Ongoing calibration of this system is needed to strengthen the system.
It is essential to calibrate the data store for identifying the right CWEs in the software development processes. Identification of all security-related events is also another crucial aspect. All these call for appropriate education of the software development communities on security practices.
These proposed systems need to adapt and learn from dynamic changes in the industry. There are new vulnerabilities that are discovered in the industry regularly. The system and people need to be up to date on these dynamics of security issues. This work contributes to establishing an integrated and automated approach for software threat modeling. Studies of conventional threat modeling and security risk assessment are conducted, and the best of both are brought together with machine learning approaches. Machine learning approaches are customized to get better results than other related work done earlier. We have demonstrated a machine learning architecture appropriate to the subject under study and one that shows promising results with available data.

X. FUTURE STUDY
This study is limited to looking at software systems without looking at the category to which the software system belongs. Visibility into the class or category of the software system and study-specific to those classes can make the outcome more effective. This area is a good one for future research. Modeling conducted in the study is generic; exploring the machine learning models that can leverage the contextual information from the data can make the experiments further stronger. This area must be developed in future studies. The agile software development model and security-related controls in software development may have varied objectives. These objectives are not analyzed concerning each other as part of our study. Putting these together and aligning the work will help to optimize the framework further and needs focus in future studies. Our work is also limited to building larger datasets to leverage the capabilities of the deep learning methods. This area can be a focus for future studies. Imbalanced datasets are another area that needs focus regarding security-related data and machine learning approaches. There are many categories of security threats that are less frequent, but when they occur, they will impact badly; this needs to be explored further. Unsupervised anomaly detection methods would help bring in more efficiency in this research area and need exploration.

A. AVAILABILITY OF DATA AND MATERIALS
The data used to support the findings of this study are available from the corresponding author upon request.

B. COMPETING INTERESTS
This article does not contain any studies with human participants performed by any of the authors. There is no conflict 10 VOLUME 4, 2021 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3185069 of interest between authors.

C. FUNDING
There is no Funding available for this research work.