Unsupervised Machine Learning for Managing Safety Accidents in Railway Stations

For both passenger and freight transportation, railroad operations must be dependable, accessible, maintained, and safe (RAMS). In many urban areas, railway stations risk and safety accidents represent an essential safety concern for daily operations. Moreover, the accidents lead to damage to market reputation, including injuries and anxiety among the people and costs. This stations under pressure caused by higher demand which consuming infrastructure and raised the safety administration consideration. To analysing these accidents and utilising the technology such AI methods to enhance safety, it is suggested to use unsupervised topic modelling for better understand the contributors to these extreme accidents. It is conducted to optimise Latent Dirichlet Allocation (LDA) for fatality accidents in the railway stations from textual data gathered RSSB including 1000 accidents in the UK railway station. This research describes using the machine learning topic method for systematic spot accident characteristics to enhance safety and risk management in the stations and provides advanced analysing. The study evaluates the efficacy of text by mining from accident history, gaining information, lesson learned and deeply coherent of the risk caused by assessing fatalities accidents for large and enduring scale. This Intelligent Text Analysis presents predictive accuracy for valuable accident information such as root causes and the hot spots in the railway stations. Further, the big data analytics ’ improvement results in an understanding of the accidents’ nature in ways not possible if a considerable amount of safety history and not through narrow domain analysis of the accident reports. This technology renders stand with high accuracy and a beneficial and extensive new era of AI applications in railway industry safety and other fields for safety applications.


I. INTRODUCTION
Trains as public transportation have been considered as safer than other means. However, passengers on trains stations sometimes face many risks because of many overlapping factors such as station operation, design, and passenger behaviours. Due to the gradually increasing demand and the heavily congested society and the state of some station's layout and complexity in design, there are potential risks The associate editor coordinating the review of this manuscript and approving it for publication was Yongming Li . during the operation of the stations. Furthermore, Passenger, people and public safety is the main concern of the railway industry and one of the critical parts of the system. European Union put into practice Reliability, Availability, Maintainability and Safety (RAMS)as a standard in 1999 known as EN 50126. Aiming to prevent railway accidents and ensure a high level of safety in railway operations. The RAMS analyses concepts lead to minimising the risks to acceptable levels and rise safety levels. However, that have been an urgent issue and still, the reports show several people are killed every year in the railway station, some accidents lead to injuries or fatalities. For example, In Japan in 2016, 420 accidents occurred that included being struck by a train, which resulted in 202 deaths. This including of those 420 accidents, 179 (resulting in 24 fatalities) included falling from a platform and following injury or death as a consequence of hitting with a train [1]. In the UK, 2019/20, it has been reported that Most passenger injuries occur from accidents in stations. Greatest Major injuries are the outcome of slips, trips and falls, of which there were approximately 200 [2] play significant impact in reducing injuries on station platforms and provide quality, reliable and safe travel environment for all passengers, worker and public. Even if some accident does not result in deaths or injuries, such accidents cause delay, cost, fear and anxiety among the people, interruption in the operations and damage the industry reputation. Also, to provide or invest any control safety measurements the stations it is crucial to considering the risks associated with the railway incidents and risks in the station and identification of many factors related to the accident by a comprehensive knowledge of the root cause of accidents considering all the possible technology.
The objective of this research is to analysis a collection case of accidents between 01/01/2000 and 17/04/2020 data to introduce a smart method, which expected to develop the safety level future, the risk management process, and the way to collect data in the railway stations. This data been gathered by RSSBS and agreed to be used for the research purpose. Analysing an extensive amount of data recorded in a different form are a challenging job. Nowadays, it is hard to obtain for specific information in such mix digitization big data in including Web, video, images and other sources, it is research of a needle in a haystack. Thus, a powerful tool for assistance manage, search and understand these vast amounts of information is needed indeed [3], [4]. Many pre-processing techniques and algorithms are required to obtain valuable characteristics from an enormous amount of safety data in the stations including textual. The study covers the topic modelling to identify useful characteristics such the root cause of the accidents and also exploring the factors which are multiple groups of words or phrases that explain and summarize the content covered by an accident's reports reducing time with high accuracy of outcomes. Topic modelling techniques are robust smart methods that extensively applied in natural language processing to topic detection and semantic mining from unstructured documents. Consequently, It has been suggested in this work the LDA model which is one of the best-known probabilistic unsupervised learning methods that marks the topics implicit in collection of contexts [5]. Since increasing of applying new technologies and the revolution of data, the development of technology and utilising AI in many fields it suggested in this paper a smart analysis utilising the topic modelling techniques which can be very useful and effective to semantic mining and latent discovery context documents and datasets. The other source of data (Images-videos and numerical) been conducted utilising AI approaches which cover supervised learning [6], [7], so the unstructured textual data is targeted.
Hence, our motivation is to investigate the topic modelling approaches to risks and safety accident subjects in the stations. This work provides the method of topic modelling based on LDA with other models for advanced analytics, aiming to make contributions in the future of smart safety and risk management in the stations. Through applying the models, we investigate the safety accidents for fatality accident in the railway.
This paper establishes an innovative method in the area to studies how the textual source of data of railway station accident reports could be efficiently used to extract the root causes of accidents and establish an analysis between the textual and the possible cause. where the full automated process that has ability to get the input of text and provide outputs not yet ready [8]. Applying this method expected to come overcome issues such as aid the decision-maker in real time and extract the key information to be understandable from non-experts, better identify the details of the accident in-depth, design expert smart safety system and effective usage of the safety history records. A Such results could support in the analysis of safety and risk management to be systematic and smarter. Our approach uses state-of-the-art LDA algorithm to capture the critical texts information of accidents and their causes. The rest of this paper is arranged as follows: In Section II, related work in both accident analysis and text classification with deep learning have been presented. Section III describes in detail the approach that has been used along with evaluation criteria. Section IV provides details of our implementations and section V reports the results. Finally, Section VI presents the conclusion.

II. TOPIC MODEL FOR RAILWAY STATION SAFETY
Text data is essential nowadays more than before, which is valuable and can be easy to store in massive amounts to be processed and mining [9]. Using social media is expanding from the public, and the customer's reviews and reactions are necessary and powerful tool for quality services, sustainable tourism [10] and transport and other aspects such as maintenance. Many points can be raised from such technology of data mining see Figure 1. For instance, the call data which is valuable and raw for long-term history safety data contains many inputs such as risk indicators, the time and date of the week or the seasons. This big data can be classified by different methods, which contain information on safety hazard, can be used to reduce accidents, and form a proactive analysing approach [11].
Safety history is a rich source of knowledge discovery and risk management analysis. For instance, investigation reports after accidents by a responsible authority or expert person, are one of the most popular safety actions that it evaluates and analysis the accidents causes and the consequence of the risk which be very effective for analysing the behaviour, hidden risk cause and lessons can be learned. The text data has many source forms including social media, emails and call recording, such data exist in a raw and unstructured status which requiring transfer and cleaning as part from topic modelling to capture the needed information. A framework based on textual sources data using AI algorithms to build a tag recordation system from safety documents been suggested see Figure 2. Such method has ability to explore and digest the complete history, it has powerful to tracking, navigate through time to reveal how specific events have changed and can be adapted to many kinds of data. Moreover, to enable automation and digitisation concepts, currently, more texts are available online and the human do not have ability to read, analysis, explore and study how connected to each other, such flow of textual. The topic model is fit to facilitate such issues and annotate large archives of records [3].
The lesson must be learned to prevent repeat accidents in the stations, and a massive effort happened in the field for controlling the issues, and recommendations from investigations have been yielded for high safety level. Usually, many reports and or document been recorded and were presented initiated from risk assessment until the accident investigations report from different organisations which is narratives are indispensable. Regardless of whether or not the text data is structured, many challenges have been expected, such as, massive data, time, cost, the shortage of experts and the context in the documents which may has nonstandard terms. These challenges and more can be decreased by the intelligent use of Deep Learning methods to automate and analysis as a part of the process [12].

III. RELATED WORK
Despite the scatter of applying such method and the differences in terms been using in the literature, there is a shortage of such applications in the railway industry. Moreover, the NLP has been implemented to detect defects in the requirements documents of a railway signalling manufacturer [13].Also, for translating terms of the contract into technical specifications in the railway sector [14]. Additionally, identifying the significant factors contributing to railway accidents, the taxonomy framework was proposed using (Self-Organizing Maps -SOM), to classify human, technology, and organization factors in railway accidents [15].Likewise, association rules mining has been used to identify potential causal relationships between factors in railway accidents [16]. In the field of the machine learning and risk, safety accident, and occupational safety, there are many ML algorithms been used such as SVM, ANN, extreme learning machine (ELM), and decision tree (DT) [7], [17].
Scholars have been conducted the topic modelling in, where such method has been proved as one of the most powerful methods in data mining [18] many fields and applied in various areas such as software engineering [19], [4], [20], medical and health [21], [22], [23], [24] and linguistic science [25], [26], etc., Furthermore, from the literature It has been utilised this technique in for predictions some areas such as occupational accident [17], construction [8], [27], [28] and aviation [29], [30], [31]. For Understand occupational construction incidents in the VOLUME 11, 2023 83189 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
construction and for construction injury prediction the method been conducted [32], [33], for analysing the factors associated with occupational falls [34], for steel factory occupational incidents [35] and Cybersecurity and Data Science [36]. Moreover, From 156 construction safety accidents reports in urban rail transport in china risks information, relationships and factors been extracting and identified for safety risk analysis [37]. From the literature it has been seen that,there is no perfect model for all text classifications issues and also the process of extracting information from text is an incremental [38], [11]. In the railway sector, a semi-automated method has been examined for classifying unstructured text-based close call reports which show high accuracy. Moreover, for future expectations, it has been reported that such technology could be compulsory for safety management in railway [11]. Applying text analysing methods in railway safety expected to solve issues such as time-consuming analysis and incomplete analysis. Additionally, some advantages have been proved, automated process, high productivity with quality and effective system for supervision safety in the railway system. Moreover, For the prevention of railway accidents, machine learning methods have been conducted. Many methods used for data mining including machine learning, information extraction (IE), natural language processing (NLP), and information retrieval (IR). For instance, to improve the identification of secondary crashes, a text mining approach (classification) based on machine learning been applied to distinguish secondary crashes based on crash narratives, which appear satisfactory performance and has great potential for identifying secondary crashes [39]. Such methods are powerful for railway safety, which aid decision-maker, investigate the causes of the accident, the relevant factors, and their correlations [40]. It has been proved that text mining has several areas of future work development and advances for safety engineering railway [41].
Text mining with probabilistic modelling and k-means clustering is helpful for the knowledge of causes factors to rail accidents. From that application analysis for reports about major railroad accidents in the United States and the Transportation Safety Board of Canada, the study has been designating out that the factors of lane defects, wheel defects, level crossing accidents and switching accidents can lead to the many of recurring accidents [42]. Text mining is used to understand the characteristics of rail accidents and enhance safety engineers, and more to provide a worth amount of information with more detail. An accident reports data for 11 years in the U.S. are analysed by the combination of text analysis with ensemble methods has been used to better understand the contributors and characteristics of these accidents, yet and more research is needed [41]. Also, from the U.S, railroad equipment accidents report are used to identify themes using a comparison text mining methods (Latent Semantic Analysis(LSA)and Latent Dirichlet Allocation(LDA)) [43]. Additionally, to identify the main factors associated with injury severity, data mining methods such as an ordered probit model, association rules, and classification and regression tree (CART) algorithms have been conducted.
Using the U.S accidents highway railroad grade crossings database for the period 2007-2013, where Some factors have been discussed such the train speed, age, gender and the time [44]. In recent years, the revolution of big data is opportunities in the railway industry, and that is opening up for safety analysis depends on data [45], so, the approach to proactively identify high-risk scenarios been recommended such as applying the Natural Language Processing (NLP) analysis [46].
From Big Data Application Case A Supervision System has been introduced as a significant role tool in railway safety supervision system. Applying Text Mining Methods in Railway Safety from accident and fault analysis reports been conducted [47]. Also, As well as big data and natural language is an opportunity should be to use for processing for Analysing Railway Safety, NLP framework for analysing accident data been explained using investigation reports of railway accidents [48].Moreover, for Fault Diagnosis in Railway System, classification of maintenance text been proposed using (LDA) algorithm [49], and to improve the fault diagnosis performance [50]. In China railway, for prediction passenger capacity, the social network text data have been used with a combination of text mining and deep learning which show a good accuracy rate [51]. Also from the Chinese Railway, natural language processing has been applied for extraction and analysis of risk factors from accident reports [52]. In the context of deep learning, Data From 2001 to 2016 rail accidents reports in the U.S. examined to extract the relationships between railroad accidents' causes and their correspondent descriptions. Thus for automatic understanding of domain specific texts and analyze railway accident narratives, deep learning has been conducted, which bestowed an accurately classify accident causes, notice important differences in accident reporting and beneficial to safety engineers [53].Also text mining conducted to diagnose and predict failures of switches [54]. For high-speed railways, fault diagnosis of vehicle onboard equipment, the prior LDA model was introduced for fault feature extraction [55] and for fault feature extraction the Bayesian network (BN) is also used [56]. For automatic classification of passenger complaints text and eigenvalue extraction, the term frequency-inverse document frequency algorithm been used with Naive Bayesian classifier [57].

IV. THE LATENT DIRICHLET ALLOCATION
Stations as ML and natural language processing (NLP), topic method, Latent Dirichlet Allocation (LDA) are a kinds of statistical approach for defining the abstract ''topics'' that occur in a collection of context. The concept is to capture the text from multiple topics in the documents, the document is explained as a unique mixture of topics with different proportions see Figure 3, where different colure keywords from accident investigation report documents which exhibit multiple topics. Some terms are highlighted as examples such as the time, date and accident title or causes, and the topic is a distribution over a fixed vocabulary. This analysing  present the ability to manage and summarize the textual data in automated real time manner [3].
The power of machine learning has the ability to learn, predict and describe qualitative and quantitative patterns lying in data such root cause of accidents, which leads to the study of hidden knowledge and the correlation factors in accidents in the railway or other fields. the methods such clustering and k means clustering been used for detect text from unstructured data [58], [5]. LDA as flexible generative probabilistic framework, assumes that each document can be demonstrated as a probabilistic distribution over latent topics, and that topic distribution in all documents take part in a common Dirichlet prior. Any latent topic in the LDA model is likewise demonstrated as a probabilistic distribution over words and the word distributions of topics participate in a common Dirichlet prior. As a generative system, the data from such process includes hidden variables which is a joint probability distribution over both the observed and hidden random variables. The process is executed via that the hidden variables (topic structure) given the observed variables (the words of the documents) as the conditional distribution (posterior distribution) [59]. The model can be described with the notation presented in Table 1 and in the Plate diagram or the graphical model shown in Figure 4 which are means of explaining the probabilistic theories behind LDA mode.
The topics are φ 1:k and each φ k is the distributions over words. The topic proportions document θd for the dth document and θ dk is topic proportion for topic K in document d.
The S dw remarked in yellow as a variable which the only words are observed, α is a matrix where each row is a document, and each column represents a topic and β is a matrix where each row represents a topic and each column represents a word. Both α and β are the parameters of the respective Dirichlet distributions. For computational the conditional distribution (posterior distribution) of the hidden variables which is the topic structure given the observed documents, the posterior can be formed as: in fact, different models of approximate inference algorithms can be analysed for LDA, for instance, Laplace approximation, variational approximation, and Markov chain Monte Carlo.in spite of the fact that the posterior distribution is intractable for exact inference, In this part, we present a simple model variational algorithm for inference in LDA [59], [5].
The LDA method has the strength to recognise sub-topics for risks range formed of many causes and represent each of the risks in an array of topic distributions. With LDA, the terms in the set of documents, produce a vocabulary that is then utilised to discover hidden topics.

V. PREPARING DATA
The textual data have some key information can be used such as the time, description of the accidents, location and the range age of the victim. The time of accidents occurred been VOLUME 11, 2023 divided as the Parts of the Day for more mining to capture accurate times. The Morning from 5 am to 12 pm (Noon), Afternoon from 12 pm (midday) to 5pm (17:00). The Evening from 5 pm (17:00) to 9 pm (21:00) and the Night from 9 pm to 4 am.
The data set containing the fatalities occurring at rail stations between 01/01/2000 and 17/04/2020. The following information is almost available for each accident including hazardous description, and the below text data: • The day of the week the fatality occurred on • The time of day of the fatality occurred • Information on the cause of the fatality • The age range of the deceased • The site type where the fatality occurred.
• The physical environment where the fatality occurred. From the RSSB the raw data set has more than 2250 accidents that been registered in the railway stations in ten years. However, the data have been divided into datasets, for more clarifications and practices in precise views of future research scope; for example, the suicide dataset been excluded from this work data set. As apart from the data mining the raw data need to be processed to extract the knowledge for Text Cleaning and Pre-processing, it has been known that many documents have additional words like misspelling, punctuation, stop words, slang, and others which affect the algorithms and topic model results performance. Some of the techniques have been remarked to Pre-processing text data, convert the context to formal language and remove any Noise as follows: • Tokenization which is breaks the context of text into meaningful elements called tokens, aiming for investigation of the words in a sentence by data gets split into parts [60], [61].
• Stop and noise Words which is the words that do not form a key in the classification algorithms, for example (a, after, about etc.,), so it needs to be removed [62].
• Capitalization where the words or Abbreviation written in capital letters, so converters to lower case can help account for such exceptions [63]. For Text Cleaning and Pre-processing, it has been known that many documents have additional words like misspelling, stop words, slang,son and others which affect the algorithms and topic model results performance [64]. Thus, for improving the quality of the data set and the model performance, filtering configuration is used, which allow selecting the fields that considered in the modelling or not, where some information's in the unprocessed data can lead to fuzzy overview, noise and or has missing values, favourably, in this data set no missing value. For topic model configuration, the non-language characters, Non-dictionary, and Numeric digits are excluded from the analyses. Also, uninformative words such as at, on, and or are removed.
From the view visualisation of the data set the times of nights and afternoon capture most accidents then morning and afternoons. However, the day of accident not easy to identify, also the adults seems to be involved in the accidents more than children. The lineside is the location which gathering more accidents and then the stairs /bridge and escalators see Figure 5.

VI. MODEL ANALYSING
A DT is a determination support tool that applies a treelike pattern of decisions and their likely outcomes [40], [53]. There are many possible (ML) approaches towards safety analysis. More exactly, we train a DT to classify the accidents and the patterns that occurred in these accidents in the stations [41], [52]. This model is applied to a wide variety of data, and it is preferable because its structured rules are simple to follow and understand. This technique is used to classify instances by classifying them based on feature values (Yuan and Shaw). The two general types of DTs are classification (where the class variable is discrete), and regression (where the class variable is continuous) [42], [43], [53], [54]. After, the data sets are uploaded and then a DTs model is designed and visualised. The DT for the predictive model provides a visualisation of the prediction case. The DTs have useful information; branches are used to make a branching decision. It shows the decisions that led to a given prediction. The tool presents the model prediction path on the side of the tree which gives this tool an advantage.

A. TEXT ANALYSIS
The dataset that been used in this work has key text attributes, and information's as the day of week and time of day that accident had occurred, also including the hazardous event description and Precursor description. Moreover, the age of the victim and the Site type with data of physical environment information been remarked. Form a quick overview noticed to the cloud, which is a key visualisation since it has summaries a bulk of the text. Generally, variety of accidents has been occurred such as fall off from train platforms or in the gab, trapped in the door, struck by trains, and electrical shuck or suicide and so on. These accidents can occur when alighting from or boarding to the train or also when there is no train stopped at the station.
The Saturday been remarked linked to the accidents in the stations which is a day usually not crowded such the workdays, but may that public be going out more as it is off the workday. The details and reasons behind that need more investigations such as the factor of deficiency of assistance staff and intoxication impact. The night and evening times obtain accident more than morning and afternoons. This also raised many factors to need to be redefined, such as the light condition, the seasonality and weather effect. More information can be gained such as the trespass accidents, which one of the brightness words in the cloud which linked to the station and the contact with the vehicles of the train which present the importance of isolate passenger from the train at safe distance. In another view, this raises the query of overcrowding risk effect which may force the people to be close to the trains and track. Moreover, infrastructure been appeared which reflect the stations age and the impact of intensive usage. The lineside in the platforms is the hot spot place that interacts the human with the machine and forms an accidents trap. The consequence like such crushed is appear which raise the engineering solution to indicate the objects and stop trains in accidents situations [6], [65]. In addition, to review and visualise graphic statistics of the dataset, that shows the distribution of the accidents among factors. The time of the day (AM &PM), the range of the time Afternoon, Morning and Night produce more solid information, the children and elderly passenger been involved in accidents in the morning. Nevertheless, the Night-time captured most accidents (See Figure 6-b). Moreover, these factors overlapping with the days of the week, some accidents been occurred on Saturday morning and seems that Afternoon is safe over all the week (See Figure 6-c). Their many valuable illustrations can be found from the details of data, yet the details of the detailed need to be gathered in the future which proved the importance of the data in such case of safety analytics. Clearly, if this approach been considered in the future safety and reporting systems, the system of reporting and gathering data will be improved to be more valuable for advanced analytics such as AI methods.
Moreover, it is expected that passenger behaviour is primary towards inside accidents, as passengers on the platform tend to walk or stand near the platform edges to avoid crowded areas, and there are others who run to catch the trains or stand too close in order to get on the train before others, and this can be coupled with slow responses of moving trains, or little time to react. From frequency view of the words, the risks related to fall/slip and trip, struck/crashed, train and platform with the passenger, are noticed as risk joint words from the details of the accident occurrence.
Even though the available data does not provide a deep understanding of the causes, the information was analysed independently and correlated with all the input attributes, and the factors related to the outcomes show the importance of the details of data for each unwanted event for future data gathering. For example, in a fall in accident cases, the position (status) that passengers had at the time of falling and the position (forward position) that land in is key factor impact also impact the rail track as sharp and solid on the seriousness of the accident's results, and the details of the lights, flooring and the platform slope and the infrastructure status. To detect the relevant topics within the text and learn from the topics underlying a collection of documents, the Latent Dirichlet Allocation (LDA) algorithm is implemented, which has some configurations (see Table 2).
The nodes have been created in the topic map and they illustrate each topic via word probability with different sizes and colours (see Figure 6-a). This collection of models is powerful for visualisation, and each circle presents a topic, the size notes how common the topic is in the data, and the H. Alawad, S. Kaewunruen: Unsupervised Machine Learning for Managing Safety Accidents  topics that are close together are semantically linked in the data.
In selecting the passenger as a topic in specific, the word maps and the distribution show that this topic linked to falling accidents in the platform and the departing, more than the arriving at night. Additionally, for instance, the relation with the fall and slip/trip as a group, and in many locations such as stairs, escalators and bridges, and the afternoon time connect with the accidents caused by contact with trains. This presents some query not been answered, such as if this occurred because the passenger is running to catch up the train or their other hidden reasons such as narrow lineside, slippery floor and congestion. These analytics illustrates the importance of the details of the accidents reporting and the key factors that can be gain from such a method to provide safety measures.
Each topic node has many probabilities of the words. More details can get from the topic distribution, for example for the selection of the words, the most associated with the specific topic will be captured, as long as the words changing, the topic distribution updated. Falling occurred in many locations such as escalators and bridges and was linked with elderly passengers, which opens up the importance of accessibility in the stations to most vulnerable people like the disabled, aged and families, also the age of mature passengers (31)(32)(33)(34)(35)(36)(37)(38)(39)(40)(41)(42)(43)(44)(45)(46)(47)(48)(49)(50) held more dangers in the platforms in the stations. Electrical shock is linked with conductors and small age groups, which raise the importance of standards review for safety related to electrical equipment in the stations. The contact of the passengers and the train in the platform while the train moving is shows an associated cause between topics of train and moving. Additionally, there is a correlation between the fall and slip/trip as a group, and many locations such as stairs, escalators and bridges, and the edge in the lineside, as well as the contact caused with trains. Such analytics presents some mysteries, which required detailed reporting to be answered in the future by the safety community, such as if this occurred because the passenger is running to catch the train or other hidden reasons such as narrow linesides, slippery floor, floor damage, or congestion. These analytics illustrate the importance of the details of accident reporting and the  key factors that can be gained from such a method to provide safety measures. Each topic node has many probabilities of the words. More details can be gained from the topic distribution; for example, by selecting the words, the words most associated with the specific topic will be captured; as long as the words are changing, the topic distribution is updated. The development of data analytics can be with advanced techniques such as linking the outcomes with the dictionaries with the meaning and providing suggestions for the designer and analyst as an expert system or providing a recommendation for the safety employees as a tool of an expert system. Also, the topics modification and each topic have the distribution of the common words (see Table 4).
Applying a batch topic distribution with some sentences that contain information leads to predicting the properties related to the inputs (see Table 5 & 6 below).

B. CLUSTERING AND VALIDATION
Evaluating for further analysis, the cluster has been conducted in which the data set can be used. In this case, the G-Means are considered to find the best number of cluster group with a critical value (5); contrarily, K-Means can be used for a specific number of clusters (K). 11 clusters are used in the cluster algorithm to show the largest problem, and the correlations in the topic. However, 8 clusters appear as the most common correlations in the topic (the clusters with the maximum total of cases). The second cluster reveals the elderly passengers were involved in some accidents at the time when the train is moving. The next cluster shows the electricity risk more specific to the conductor's parts in the platforms. These accidents required details to obtain more safety measures. Analyses can provide causes roots, correlations, and any hidden patterns, for example, electric shocks, including contact with Overhead Line Equipment (OHLE) or Overhead Contact Line (OCL), which may occur accidentally by carrying long objects (selfie stick/conductive materials) or vandalism and trespassing. Such details will improve the standards and obligation to add more protections for the public, passengers, and workforce at the PTI. All probability of the instances fields is shown in Table 6 below, which can be used as workflow in future projects and generalised. A guide system of the context can be converted to be numerical and cover a huge range of the stations.
For testing, topic evaluation by a short text, which is a text describing an accident, has been used to extract the information, which shows the ability to capture information from the textual content. This depends on the tags (Labelled Itemset) being used for the training model, Labels like the day, the time, event description and location (Figure 7-a).
For the overall trained model, statistics measures present excellent outcomes for all the tags, as shown in (Figure 7c). The analysis presents the power of the approach to be an expert system and guide the reporting process, also noting any hidden root causes of the accidents.
Some data are accurately detected such as the day, location, ages and time of the occurrence of the accident. Others, such as details of the accidents, show less accuracy to detect, but these types of data can be captured by providing more training data, that will in turn provide more balance for all tags and reflect on accuracy.

VII. DISCUSSION
Applying such a method shows the ability and the power of the new technology in the safety of the railway industry which has not been used widely to enhance safety and risk management. The systematic reducing of accidents in the station is beneficial for the public and stockholder, sustainability, and the safety community. Topic moulding is a proactive approach, where the input can be analysed in real-time, and actions can be taken. Moreover, multi-input can be used in parallel with supporting decision-makers. This approach reduces the dependence on experts, where they are costly and not available all the time and reduce the impact of manpower knowledge retirement. The development is key for many aspects which reflect on the quality, reliability, and satisfaction for both workers and passengers. The increase of technology such as AI and IoT with the growth of data requires more investment and research of such methods. Even with the limitation of data and delayed application in the field, text from data is analysed via text modelling which opens the novel approach of applying technology in the safety system. Such types of data are essential as they contain the history of safety such as risk evaluation, accident reports and periodic safety analyse documents, and can also use live text from media or calls. Moreover, social media has a textual source of information that can be harvested and analysed as it is related to safety, security, and quality. The method provides support for the safety authority of the decision-maker from many sides, including improving the service, quick response rates and advanced analysis. Also, the system can build an expert system in a specific area such as the railway stations and learning from the new documents and can cover an entire country's stations as well as pr training the system from many sources. In the digitalisation concept, the integration between the source of data is possible, which forms a smart safety system in the railway industry. This novel analytics opens a new window for applying AI technologies in the field. Also, railway organizations can use a such method to cover all RAMS parameters while safety overlaps with maintenance, reliability and many other factors. Analysing the safety in the stations is part of RAMS analysis of railway networks which can help managers find the key components of failure in the safety of the network [66], [67], [68].
The LDA provides a statistical model with the ability to learn, and this method does not only deal with the huge data in real-time automatically. It can also provide an expert system and decision support for the safety authority for the researcher. This concept corresponds with the future rail digitalization and the BD revolution. The fixability, effectiveness, and accuracy raise the importance of gathering more data in the station with all the possible details. The accessibility, privacy, skills and the IT infrastructure are some hurdles for the short term, but which it is expected the industry will overcome in the medium and long term. Finally, there are concerns with the vast acceleration of AI exploring texts and language, like the generative pre-trained transformer model (ChatGPT). However, using these models responsibly and cautiously and considering appropriate measures to mitigate potential risks is essential. Moreover, human health and safety in work or other areas, such as railway safety that can save lives, must be highly prioritised.

VIII. CONCLUSION
Topic models have an important role in many fields and in such case of safety and risk management in the railway stations for texts mining. In Topic modelling, a topic is a list of words that occur in statistically significant methods. A text can be voice records investigation reports, or reviews risk documents and so on.
This research displays various cases for the power of unsupervised machine learning topic modelling in promoting risk management, safety accidents investigation and restructuring accidents recording and documentation on the industrybased level. The description of the root causes accident, the suggested model, it has been showing that the platforms are the hot point in the stations. The outcomes reveal the station's accidents to be occurring owing to four main causes: falls, struck by trains, electric shock. Moreover, the night time and days of the week seems to contact to the risks are significant.
With increased safety text mining, knowledge is gained on a wide scale and different periods resulting in greater efficiency RAMS and providing the creation of a holistic perspective for all stakeholders.
Application of the unsupervised machine learning technique is useful for safety since, which is solving, exploring hidden patterns and deal with many challenges such as: • Text data from many perspectives and in unstructured forms VOLUME 11, 2023 • Power for discovery, dealing with missing values, and spot safety and risk kyes from data • Smart labelling, clustering, centroids, sampling, and associated coordinates • Capture the relationships, causations, more for ranking risks and related information • Prioritisation risks and measures implementations • Aid the process of safety review and learning from the long and massive experience.
• Can be used the scale and weighted as configuration options which can be used for assessing risks. Although this paper highlights the innovative of unsupervised machine learning in accidents classification of railway accidents and root cause analyses, it is a necessity to focus on expanded research on the huge data topics concerning the diversity of the station's locations, size and safety cultures and other factors with further techniques of unsupervised machine learning algorithms in the future. Finally, this research enhances safety, but it raises the importance of data in text form and suggests redesigning the way of gathering data to be more comprehensive.