Introduction
Occupational injury is defined as “any physical injury that affects a worker while working”. Similar terms refer to workplace injury, work-related accidents, and occupational accidents [1], [2]. The International Labour Organization (ILO) reported that occupational injury may potentially be declared a ‘public health emergency’; as it has killed more than thousands of workers each year [3]. It has substantially contributed to fatality events, reduced work productivity, and worsened the economy at large [4]. In addition, the economic cost of workplace injuries can be measured through the amount of medical and rehabilitation compensation, including the loss-time costs, social security benefits, as well as, the training and re-training of the worker [5]. This may cost up to 4-5% of the global ‘Gross National Product’ (GNP) [6]. Therefore, workplace injuries may affect workers physically, but they can also manifest detrimental psychological effects, including depression, anxiety, and post-traumatic injury [7]. These psychological effects of workplace injuries result in prolonged injury recovery, thus; increasing compensation expenses. Other research has revealed that workers who experienced occupational injury developed difficulties in physical and physiological health which interrupted their workplace relationships [8]. To some extent, they have reported psychiatric symptoms and incidences of suicide attempts [9]. Evidently, work-related accidents can cause a ‘domino effect’, contributing not only to the immediate physical health and financial burden but also contribute to the long-term psychological impacts. Therefore, the avoidance of workplace accidents and injuries should be a top priority for occupational health and safety throughout all sectors of the economy.
Occupational injury reports are records of injured worker information consisting of structured information (workers’ demographics, type of injury, accident cause, etc.) and unstructured data such as textual injury history or reports. These records are valuable and may provide remarkable opportunities, especially among Artificial Intelligence researchers to extract and analyze records in a more reliable and efficient manner. With recent advances, machine learning, including deep neural network techniques has gained interest as the method of choice for predicting occupational injury outcomes [10].
Some related studies executing those techniques with structured occupational injury information are as follows: (i) Yedla et al. [11] used categorical input from the mining industry to predict the occupational accident outcome of days away from work with Neural Network outperformed other models in their study; (ii) Chadwiya in their study had utilized South African workplace accident-labelled data and revealed the Support Vector Machine (SVM) as the best prediction model in predicting occupational injury based on affected body parts [12], and (iii) Khairuddin et al. [13] investigated the prediction performance of occupational injury severity using categorical variables through a Random Forest optimized model across all industrial sectors. In addition, it is believed that unstructured text data are a valuable source of information; thus, the extraction of insights can be achieved using text-mining techniques [14]. Often, unstructured data contains rich semantic information that can yield insightful insights. In text data, for instance, the choice of words, sentence structure, and tone can convey significant meaning. By incorporating this information, the model is better to comprehend and capture the data’s underlying patterns. Unstructured data can provide crucial additional context for comprehending the relationship within a dataset. Several studies have focused on analyzing unstructured injury reports as the input features. For example: (i) Jing [15] proposed Word2Vec and Long Short-Term Memory (LSTM), a recurrent neural network (RNN) variant as a text-mining predictive tool for workplace accidents in the chemical industry, (ii) Baker et al. [16] developed an improved text-mining with model stacking of the XGBoost-Random Forest algorithm to predict the occupational injury outcomes, and (iii) Goldberg [17] analyzed injury narratives to compare the techniques of word embedding, such as Word2Vec and TFIDF, including several machine learning algorithms in predicting the severity of occupational injury in the United States. As most of the preceding works focused on employing structured data or unstructured text separately, it has been verified that the development of the occupational injury severity prediction model by combining both modalities; structured and unstructured data are neglected and restricted in the occupational injury research domain [18].
Consequently, the purpose of this research is to propose an integrated predictive model based on multimodal learning of structured data and unstructured information using machine and deep neural network approaches in predicting occupational injury severity.
To summarize, the main contributions of our study are as follows:
The potential of integrating structured data, such as labelled data points and unstructured data, for example workplace injury reports has been neglected by the majority of previous studies in occupational injury severity prediction [18]. This work acknowledges the significance of both modalities and proposes a novel strategy that exploits the power of multimodal data integration. Integrating the unstructured data will enhance the feature representation of the overall dataset. The unstructured data in text narratives contain valuable information that may not be captured in the structured data alone.
Unstructured occupational injury reports are subjected to several preprocessing stages, including text cleaning using Natural Language Processing (NLP) and tokenization, followed by text representation techniques. Our study proposed an innovative approach for integrating Term Frequency-Inverse Document Frequency (TF-IDF) and the Global Vector (GloVe) as text representations. These stages allow unstructured textual data to be converted into numerical representations, making them appropriate for machine and deep learning models.
The vectors representing structured and preprocessed unstructured text data are concatenated and utilized as input features for the proposed predictive models. Because of this integration, the models may learn from both modalities simultaneously, collecting the complementary information available in structured and unstructured data.
The multimodal occupational injury severity prediction model presented herein has practical implications for workplace safety and health. By enhancing the early screening and identification of at-risk employees with severe occupational injury outcomes, the predictive model can contribute to the improvement of workplace safety measures and overall working environment. The model’s information can guide interventions and initiatives designed to promote workers’ well-being and prevent the occurrence of severe workplace injuries. These insights will assist in identifying potential hazards, implementing proactive measures, and enhancing overall workplace safety.
Therefore, this paper enhances the field of occupational injury research as the findings from the multimodal machine and deep learning will presents the benchmark model performance for occupational injury severity prediction tasks.
This paper is organized into eight sections, including the Introduction. In Section II, a summary of previous related studies is presented. The proposed methodology is explained in Section III and a step-by-step model experiment is summarized in Section IV. The model prediction findings are presented in Section V, followed by the results of model optimization in Section VI. Section VII discusses the overall findings, and Section VIII concludes the paper.
Related Works
Multimodal learning is defined as ‘the area of applying machine and deep learning techniques in integrating multiple types of data into a single model to optimum the uniqueness and valuable information in an algorithmic framework’ [24]. The ultimate aim of this multimodal learning is to harmonize the diversity of data in improving the data quality, thereby enhancing the performance predictions [25]. The application of multimodal learning has been explored in several fields, especially the use of different modalities extracted from ‘Electronic Health Record’ (EHR), as it contains categorical/numerical variables, clinical notes, and clinical images. For example, Ross et al. [26] integrated structured data and clinical text to be trained into Logistic Regression (LR) and Random Forest (RF) models in predicting cardiovascular diseases. Then, Lei et al. combined structured data, unstructured text, including audio and clinical images into deep neural network architectures to categorize the events of communicable disease, whereas Zhang et al. [27] executed LR, RF and advanced deep learning methods; Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) in structured, unstructured clinical notes, and the integration of both data to predict hospitalization stay and mortality, hence revealing the integration of both data achieved the best performance predictions. Therefore, multimodal learning with machine and deep learning techniques have emerged progressively in those fields including clinical diagnosis prediction [28], [29], pathological screening [30], and business intelligence purposes [31], as it has proven in improving the model predictability.
However, the exploration of multimodal data for occupational injury severity remains limited. Johansson et al. [32] proposed a study protocol to integrate structured occupational injury registry and photovoice inputs in predicting the event of medical leave by the injured workers. They believed that multimodal data could improve model predictability and assist in discovering insights into workplace injuries among Swedish adults. Next, a study by Paraskevopoulos et al. [33] used multimodal dataset of safety reports and workplace images, prepared by Safety Officers and executed an NLP-based neural network as the predictive model to predict the outcomes of workplace safety audits. Their study agreed that multimodal data can discover hidden information, thus providing a better accuracy performance. Recently, Sarkar et al. developed a multimodal deep neural network model by integrating occupational injury narratives and categorical variables [18]. Their study compared the performance prediction of deep neural network with several optimizer and found the model with ‘adaptive moment estimation’ (adam) optimizer became the best-performing prediction of occupational injury in steel manufacturing industry. Nevertheless, they emphasized that predictive analysis using multimodal data in occupational injury areas is undiscovered, limited, and requires further exploration.
Table 1 summarizes recent literature on occupational injury severity prediction models. On the basis of this overview, one can conclude that the current trend of occupational injury severity predictive analysis is to make use of structured and unstructured text data, and it concentrates exclusively on one industry sector. This causes a lack of generalizability of the existing predictive models, where data from different industries and sectors are neglected. These variations in work environments, hazards, and job tasks are disregarded, and predictive models do not have the capability to learn from diverse scenarios across various occupational settings. In addition, there is a paucity of multimodal data integration to develop an occupational injury prediction model. However, the exploration of advanced neural network architectures is gaining attention in the domain of occupational injury.
Proposed Methodology
Our study proposes a multimodal occupational injury severity prediction model that encompasses three main processes; first, the gathering of the data, followed by data pre-processing, and finally, the prediction classifier stage. These two modalities; structured and unstructured data, were preprocessed separately. Subsequently, the feature representations generated by the structured information and text representation were merged. These vectors were concatenated and fed as the input of the proposed classifiers to predict the severity of the occupational injury.
A. Dataset Description
A publicly accessed dataset was gathered from an established Occupational Safety and Health Administration (OSHA) located in the United States [34]. The dataset can be accessed through (https://www.osha.gov/severeinjury). In this study, the dataset comprises injured worker information with over sixty thousand data between January 2015 and July 2021. This dataset named as ‘Occupational Injury Severity Report’ includes variable columns such as (i) ID number, (ii) event date, (iii) employer’s address, (iv) the state with latitude and longitude, (v) nature of the injury, (vi) affected body parts, (vii) type of exposure, (viii) type of source, as well as, (ix) the injury narratives. Additionally, it contains information on (i) amputation and (ii) hospitalization, as the indicator of occupational injury severity. Table 2 presents the types of variables used in the dataset.
B. Structured Data Preprocessing
In this research, five categorical variables were considered as the input features: the (i) type of industry, (ii) nature of the injury, (iii) affected body part(s), (iv) type of event, and (v) type of source. These data are pre-coded according to the top label only, as guided by the Occupational Injury and Illness Classification Manual (OIICS). In addition, any rows with non-value or empty columns were removed, whereas other columns, for example, ID number, employer’s address, latitude, and longitude were excluded because of their irrelevancy. After data preparation, 66,405 data points were used for predictive analysis.
For the predictive analysis, a data preprocessing step was employed to ensure that the input features had consistent contributions during the machine and deep learning development process [35]. Categorical data were manually encoded by referring to the OIICS manual system, which represents categories as numerical labels.
The next steps involved developing machine and deep learning models using these encoded categorical data along with unstructured feature representations to predict the outcomes of occupational injury severity. This approach leverages information encoded in categorical variables and incorporates it into predictive classifiers for multimodal learning research.
C. Unstructured Data Preprocessing
Textual narratives of occupational injuries were included as sequential unstructured input features. This text report was prepared by Safety and Health personnel, assisted by Occupational Health Doctor and we believe it encompasses a large amount of insight that can be extracted for predictive analysis in this study. The conversion of text data into numerical values is essential before they can be processed by machine and deep learning algorithms [36].
In this study, text preprocessing was performed for Natural Language Processing (NLP) tasks through the following steps: (i) removal of punctuation and digit numbers as they did not contribute to the analysis, (ii) removal of extra whitespaces such as tabs and line breaks, (iii) removal of characters that may potentially interfere during the text vectorization step [37], (iv) removal of stop words such as “a” and “the” as they consider as the ‘unnecessary words’ which did not contribute the classification taks and may create higher dimension of vector [38], and (v) lower case capitalization of the text. Next, the text underwent the tokenization step, where a string of words is segmented into its component words named ‘tokens’. Each tokenized word is numbered to identify a particular word. This is a crucial step in converting words into numerical features [39].
Next, text representations were generated by converting the tokens into numbers to be processed by the classifiers. The text representation methods employed in this study were Term Frequency-Inverse Document Frequency (TF-IDF) and the Global Vector (GloVe). TF-IDF was considered as it commonly appears as a high-performing text vectorization technique [40], [41]. It consists of two elements; the ‘term frequency’ (TF) and the ‘inverse document frequency’ (IDF). TF depends on the number of occurrences of the word in each injury narrative, meanwhile, IDF is computed based on how much the word contains throughout the entire injury narrative dataset. TF is measured as \begin{align*} idf_{i}&= \log \left ({\frac {1+D}{1+df_{i}} }\right)+1 \tag{1}\\ tfidf_{i,w}&=tf_{i,w}\times idf_{i} \tag{2}\end{align*}
The vectorized word was then followed by a word-embedding of GloVe. It is a word-embedding method in which words are represented as vectors in a high-dimensional space that uses word2vec-word representation to learn word embeddings from textual materials efficiently [42]. To generate the word vector representation in this study, a pre-trained GloVe model from Stanford NLP labeled “Glove.6B” was used. This pre-trained GloVe is a 100-dimensional vector that has been trained on six billion tokens from Wikipedia articles and the Gigaword dataset. It is freely accessible to the public under a Public Domain Dedication and License [43]. Figure 1 is the schematic diagram of the unstructured text preprocessing used in this study.
D. Multimodal Data Fusion
Structured and unstructured data were combined as input representations to predict occupational injury severity, in terms of (i) hospitalization and (ii) amputation for multimodal learning. The structured data were prepared and normalized as explained in subsection B, whereas the unstructured text data underwent text preprocessing, tokenization, and text representation, as described in subsection C. Subsequently, both preprocessed representations were concatenated into a single input representation vector using an early fusion strategy. The early fusion strategy integrates both data modalities after their preprocessing steps and fed them as input representations to the sets of classifiers. The application of an early fusion strategy is preferred in multimodal learning owing to its practicality and simplicity [44]. Moreover, this strategy tends to generate better performance predictions than the unimodal versions [45], [46]. Figure 2 illustrates the flowchart of multimodal learning in this study.
E. Prediction Modelling
Both prediction outcomes are composed of a binary classification problem, where the label indicates Yes (1) or No (0) for the occurrence of hospitalization, as well as, the likelihood of an amputation event. Prior to the model development, the data were partitioned into two sets using stratified sampling; 80% of the data were used as the training set, and the remaining 20% of the data were applied as the testing set.
Five sets of machine learning algorithms and two deep neural architectures were proposed to analyze the multimodal data in predicting the severity of occupational injuries. The five sets of machine learning predictive models were (i) Naïve Bayes (NB), (ii) K-Nearest Neighbors (KNN), (iii) Decision Tree (DT), (iv) Random Forest (RF), and (v) Support Vector Machine (SVM). These ML models were selected because of their consistency in occupational injury prediction studies [11]; therefore, a comprehensive comparative analysis is required to assess model effectiveness [47]. Because several previous studies have recommended the exploration of RNN variants in the multimodal occupational injury domain [48], [49], this study executes two types of RNN variants as the proposed deep learning architectures: (i) Long Short-Term Memory (LSTM) and (ii) Bidirectional Long Short-Term Memory (Bi-LSTM).
NB is a supervised learning algorithm that is based on Bayes’ theorem and is preferred due to its simplicity and ability to predict performance accurately. It is named “naive” because it assumes that given the class label, the features in the data are conditionally independent of each other. Despite the “naïve” assumption between input variables, the NB algorithm performs well in a variety of classification problems [50].
KNN is an algorithm that implies on similar information exists nearby or in close proximity to one another. The model calculates the distance between data points and then categorizes them based on their proximity. Normally, it is based on a distance metric, such as Euclidean distance among the training samples, and then making a judgement based on the majority vote or average of its k neighbors’ labels [51].
Next, DT algorithm uses the training data to generate a tree-like decision structure, with the starting point being a ‘root node’ and the ending point being certain leaves. The classification strategy in DT begins with the division of the root node into the leaf node. The splitting process makes use of the input variables. The splitting will progress until it reaches the leaf node. Then, the leaf nodes, also known as end nodes in certain literature, indicate the final outputs, which is the classification problem [52].
SVM uses a hyperplane to maximize the margin between two classes in binary classification tasks. The training data points that are nearer to the boundary will impact the creation of the hyperplane, and they are referred to as “support vectors”. An interesting feature of SVM is it can accommodate both linearly and non-linearly separable datasets by mapping the data into a higher dimensional space where it can be separated by a hyperplane using a kernel function. A kernel function is introduced to aid in the separation of different classes. The linear, sigmoid, and radial basic function (RBF) kernels are the most commonly employed kernels, and this design of SVM produces the best generalization of decision boundaries for data categorization [53]. This study employed the radial basic kernel function.
RF combines numerous decision trees to provide a more accurate and stable prediction. Each tree in the RF model produces a classification and accounts it as a ‘vote’. The final classification is then established based on the majority of these votes, with the category with the most votes chosen as the final prediction. Because the RF model adheres to the ‘majority votes decision rule,’ the aggregation of these outcomes will provide a good generalization, resulting in improved accuracy. Additionally, each tree in the forest is trained using a randomly chosen portion of the training data, known as the ‘bootstrap sample’, and a randomly chosen subset of the features, known as the feature subset. This reduces variance and the risk of overfitting [54].
LSTM is an improved technique to solve a well-known drawback in training Recurrent Neural Network, which is the vanishing gradient issue. The LSTM method overcomes the problem by adding a gate mechanism and a memory unit. Three gates of LSTM are: (i) the input gate, controls which information is stored in memory cells, (ii) the output gate determines which information is used in prediction, and (iii) the forget gate controls which information is ignored. The configuration of these gates in LSTM enables information control, which is the primary rationale for reducing the vanishing gradient problem in standard RNNs [55].
Bi-LSTM is an advanced architecture of LSTM, which composed of forward and backward LSTMs. The key idea behind this bidirectional structure is the capacity to collect information patterns that may be overlooked by unidirectional LSTM [56]. Because the Bi-LSTM network is constructed of two LSTMs, their outputs are concatenated and utilized as inputs to the final prediction output layer. This enables the network to generate predictions while considering the complete sequence. The strength of the ‘forward-backward’ in Bi-LSTM leads to improve learning long-term sequences, as a result, improves the model’s performance prediction.
In this study, both deep learning models; LSTM and Bi-LSTM, were structured as follows: the hidden unit was chosen as 256, trained with a batch size of 64, and the maximum epoch number was set to 25. In addition, the models were implemented with an ‘Adam’ optimizer, ‘ReLU’ activation with a dropout rate of 0.2. ReLU activation was proposed based on recent studies that demonstrated the modified approach of ReLU with an LSTMs network has empirically improved model performance in terms of comparison with other activation functions [57] and existing deep learning tools [58], [59], [60]. In addition, an early stopping function was introduced to prevent overfitting and improve model generalization, including assisting in determining the optimal stopping point for training [61]. The metric criterion used was validation loss with a patience of 5, in which the models were trained for a maximum of 25 epochs but stopped the training earlier if the validation loss did not improve for consecutive 5 epochs. As this study is a binary classification task, a dense output layer functions as a sigmoid and loss function used was binary cross-entropy. The customized parameters of each classifier are summarized in Table 3.
All of the multimodal classifiers were imported and developed in the Python environment with the execution of suitable libraries and packages: NumPy (np), pandas (pd), matploblit (plt), sklearn, natural language toolkit (nltk), Keras and TensorFlow. The development of the prediction models was performed on a laptop with the following specifications: (i) CPU: AMD Ryzen 7 3700U @ 2.30GHz with 12GB RAM and (ii) GPU: Radeon™ RX Vega 10 Graphics 1400 MHz.
F. Model Evaluation Metrics
A confusion matrix is a foundation for computing the predictability performance of classifiers. It is in the form of a “contingency table” that visualizes how the findings are disseminated over the actual class (represented in rows) and predicted class (represented in columns) [62]. The matrix comprises four instances: “True Positive” (TP) and “False Positive” (FP) are observations of correct and incorrect predictions per actual classes, accordingly, while, “True Negative” (TN) and “False Negative” (FN) are instances of right and wrong rejections per actual classes, respectively. Based on this matrix, the following evaluation metrics for classification tasks were used to assess the prediction performance of all models proposed in this study: (i) precision, (ii) recall, (iii) F1-score, (iv) accuracy, and (v) AUC values. Figure 3 summarized the model evaluation metrics employed in this study.
Model Experimentation
This subsection summarizes and simplifies the overall implementation of the occupational injury severity prediction model used in this study. The step-by-step implementation is explained as follows:
The experiment began with the preprocessing of structured data, involving data cleaning and preparation. Five categorical variables were selected from the occupational injury dataset: (i) type of industry, (ii) nature of injury, (iii) affected body part(s), (iv) type of event, and (v) type of source. These categorical data are encoded using a reference system.
Next, unstructured injury narratives were extracted through NLP, followed by text representation using the integration of TFIDF and pretrained GloVe word embedding to convert the textual data into numerical representation for further analysis.
The experiment was then resumed with the integration of both preprocessed data modalities–structured and unstructured–as multimodal representations.
The data were split into two sets using stratified sampling at an 80:20 ratio; 80% of the data were used as the training set, whereas the remaining 20% were used as the testing set.
Several candidate models were explored, consisting of five sets of ML models (NB, KNN, DT, RF, and SVM) and two deep learning models (LSTM and Bi-LSTM), for predicting the severity outcomes of hospitalization and amputation.
Comparative analyses were performed by utilizing established model evaluation metrics, including accuracy, precision, recall, F1-score, AUC, and prediction time. These metrics are used to assess the performance of each candidate model.
Based on the evaluation, the model exhibiting superior performance in terms of accuracy, F1-score, AUC, precision, and recall was selected as the best-performing prediction model.
Then, the experiment was resumed with model optimization step. Firstly, both data modalities were assessed using the RF feature importance algorithm to determine the most important features and predictive keywords.
The selected best-performing model in (vii) was redeveloped using only the important features and keywords, followed by hyperparameter optimization using random search cross-validation.
Finally, the optimized model was compared with the initially developed model through model evaluation metrics to determine the best occupational injury severity prediction model for final model deployment. Additionally, its compatibility with computational efficiency in terms of processing time (training and testing) was a significant factor in its selection.
Overall, this careful and systematic model selection process ensures that the proposed approach represents the execution of rigorous experimentation and thorough evaluation, thereby generating an accurate interpretable practical occupational injury severity prediction model. The pseudocode of the proposed model experimentation is presented in Figure 4.
Results
The prediction outcomes investigated in this study was the likelihood of severity, in terms of hospitalization and amputation. Each classifier was evaluated on all performance metrics and comprehensively compared to determine the best-performing model.
A. Hospitalization Prediction
Table 4 presents the performance prediction of all proposed models on the hospitalization prediction task. From this table, it shows that the Bi-LSTM outperformed other models with a slightly higher accuracy of 0.93 as compared to RF and LSTM, both achieved 0.92, respectively. Also, Bi-LSTM achieved the best F1-score at 0.95, meanwhile, the AUC value is slightly better than the SVM model at 0.93.
B. Amputation Prediction
Table 5 summarizes the findings of each prediction algorithm for the amputation prediction task. From the table, it can be seen that Bi-LSTM is the best-performing model as it achieved higher accuracy (0.99), F1-score (0.97), and AUC of 0.98, compared to SVM, KNN, LSTM, DT, and NB. Meanwhile, the RF model ranked second, with accuracy and an AUC value of 0.97.
Based on both tables, the Bi-LSTM models have been discovered as the best-performing prediction model in predicting occupational injury severity, as this model performed significantly in each model evaluation metric, specifically in accuracy, recall, and F1-score, compared to other models.
C. Prediction Time
Additionally, the prediction or testing time for each model was investigated in this study. Although the Bi-LSTM model achieved higher accuracy, F1-score, and AUC values for both prediction tasks, the prediction time for Bi-LSTM may be longer than those for RF, SVM, and LSTM. The Bi-LSTM model required up to 67s to predict the hospitalization and at 66s to predict the amputation outcomes. However, it is presumed that the testing time of the Bi-LSTM model is still acceptable. A comparison of each model’s prediction time is depicted in the line graph in Figure 5 for both the predictions.
D. Learning Curves of Bi-LSTM Model
Another metric to measure the performance of our proposed multimodal Bi-LSTM model is the visualization of the accuracy and loss of progress in the training and validation sets [63], [64]. Figure 6 shows the performance of the Bi-LSTM model based on the learning curves of accuracy and loss on the training and validation sets for both prediction tasks. From the figure, by implementing early stopping function, it appears that the validation loss (model loss) for the hospitalization task stopped decreasing and stabilized at epoch 14, whereas for the amputation prediction, it halted at epoch 12. This indicates that the Bi-LSTM model achieved the best performance in terms of minimizing the validation loss within these epochs. Additionally, it can be observed that the accuracy of the training and validation sets shows an upward trend and gradually becomes flat. Overall, as the epochs advanced, the learning curve for both the loss and accuracy values became more stable, with less fluctuation in the training and validation sets. This demonstrates that the models reached a point of convergence, and training them beyond these epochs may result in overfitting.
Model Optimization
We conducted a comprehensive methodology that involved feature importance analysis and hyperparameter optimization to achieve the optimum performance of the proposed multimodal Bi-LSTM predictive model. Firstly, a Random Forest Feature Importance algorithm was employed to determine the most relevant features of structured variables and unstructured text, respectively. Subsequently, the proposed Bi-LSTM model was optimized with these significant features and underwent hyperparameter tuning using Randomized Search Cross Validation. The optimized Bi-LSTM was then compared with the initial developed model to determine the best performing model for final model deployment.
A. Feature Importance Analysis
This step was introduced to assess the most important variables of the occupational injury severity prediction model, thereby providing valuable insights for classification tasks [57]. The RF algorithm determines the feature importance by measuring the reduction in Gini impurity in the model. The impurity values for all variables were added and standardized across the trees. The final value is organized in decreasing order, with the most important attribute at the top; the greater the value, the more important the feature [13], [58].
The findings revealed the top three most significant features of structured data were similar for both prediction tasks, which were ‘nature of injury’, ‘type of event’, and ‘affected body part’. These results were consistent with other related studies that measured ‘nature of injury’ and ‘affected body part’ [11], [65], [66], as well as, ‘type of event’ [18] as the essential predictors for occupational injury outcomes.
Moreover, the feature importance of unstructured text data was determined based on the importance of keywords. In this study, we proposed the top 20 keywords to be measured and ranked for both the prediction tasks. Figure 7 illustrates the top 20 keywords for the prediction of hospitalization and amputation. Based on the figure, it is observed that the extracted predictive words were closely related to the ‘type of event’ of injury severity, such as ‘fell’, ‘pinched’, ‘slip’, ‘caught’, ‘burns’, ‘tripped’, ‘broken’, and ‘fractured’. In addition, some keywords were identified as common objects that may cause the workplace injuries, such as ‘machine’, ‘blade’, ‘ladder’, saw’, and ‘floor’. With this interpretability analysis, it can be concluded that the content of occupational injury narratives comprised keywords that indicated the event and source of workplace injuries, including the severity outcomes, such as ‘hospitalized/hospitalization’ and ‘amputation/amputated’.
B. Hyperparameter Optimization
Initially, the Bi-LSTM model was configured through multiple experiments based on the number of epochs, batch size, and LSTM units. Three architectures were proposed to determine the best accuracy, F1-score, and AUC for both the predictions. The proposed architectures are presented in Table 6.
Based on the experiments, it was found that the Bi-LSTMs configured with Arch 2 (epochs 25, batch size 64 and LSTM units 256) were the highest as presented in Table 7. Therefore, to further verify the Bi-LSTM configurations, hyperparameter optimization is introduced.
In this study, a Random Search algorithm was employed with cross-validation method (k-fold=10). This method allows a thorough exploration of the hyperparameter space to identify the optimal configuration for the Bi-LSTM predictive model. The dataset was divided into 10 equal-sized folds, in which each fold acted as a testing set, whereas the remaining folds served as the training set. This process was resumed until each fold was used once for testing [67]. For each iteration, a combination of hyperparameters for the defined search space is randomly sampled and trained on the training set. The performance of each configuration was then evaluated on the respective testing set using the assigned model evaluation metrics; such as accuracy, F1-score, and AUC. By repeating this process 10 times, a comprehensive review of the model’s performance across different hyperparameter combinations was obtained. It is believed that the Random Search algorithm provided better performance prediction and efficient approach to tune the model’s hyperparameter [68], thus ensuring the model predictability was reliable and generalized on the unseen data [69]. The corresponding hyperparameters are listed in Table 8.
Based on the table, HeUniform was determined as the preferred weight initialization method, as it is specifically designed to work well with ReLU activation [70]. This is in agreement with Huimin et al. [71], who concluded that the weight initializers corresponded to the activation function. The other identified best hyperparameter values were similar to the initial configurations.
This step was followed by optimizing the Bi-LSTM model with important features as input representations. For each prediction, the model was fed with three important features (structured data) and the most predictive keywords (unstructured data), and the hyperparameters were adjusted based on the optimal parameters. Next, the performance prediction of the optimized Bi-LSTM model was compared with the following architectures: (i) Bi-LSTM I, which was initially developed with all features and without hyperparameter tuning; (ii) Bi-LSTM II, a model with important features without hyperparameter tuning; and (iii) Optimized Bi-LSTM I (OPTIM Bi-LSTM I), a model developed with all features with hyperparameter tuning, whereas the optimized Bi-LSTM was labeled as OPTIM Bi-LSTM II composed of important features with hyperparameter tuning. The findings are presented in Table 9 for hospitalization prediction and Table 10 for amputation outcomes. Based on both tables, it was observed that the performance of the model evaluation metrics for each proposed model was consistent. Although the models using all features may produce slightly better performance metrics, the possibility of utilizing those features without feature importance or hyperparameter tuning may introduce noise and irrelevant information for model development.
Additionally, it was found that the OPTIM Bi-LSTM II, a proposed model with important features and hyperparameter tuning managed to generate prediction outputs in a timely manner. It is believed that the optimized hyperparameters can lead to faster convergence during model training, thereby allowing the model to reach its optimal performance more quickly. The optimized Bi-LSTM can predict hospitalization outcome at 49s and amputation severity at only 42s. Therefore, the feature optimization algorithms conducted in this study optimize the multimodal Bi-LSTM occupational injury severity prediction model and excellently accelerate the model prediction time, making it suitable for occupational injury decision support systems in real field applications.
Consequently, this study is in agreement with a recent study by [72], which preferred the execution of an effective and optimum prediction model that utilizes fewer important features than numerous features as input representations. Some arguments in the existing literature highlighted the ‘impracticality’ of using larger set of variables in developing the machine and deep learning classifiers; (i) numerous features may increase the complexity of the model and (ii) training process suffers with overfitting problems [73], including (iii) a complex model may generates higher computational tasks, making it cost expensive and less efficient [74]. Therefore, any technique that promotes the reduction of data dimensionality is recommended to improve model performance prediction [75].
Accordingly, this study emphasizes this feature optimization approach to assist Safety and Health Practitioners in executing an accurate interpretable practical and time-efficient occupational injury severity prediction model, thus guiding practitioners and policymakers to improve workplace injury intervention strategies. Figure 8 depicts the overall proposed framework of our multimodal Bi-LSTM occupational injury severity prediction model. This proposed framework concludes the innovative approaches developed from our study; multimodal learning with Bi-LSTM predictive model integrates with model optimization techniques to enhance model interpretability, practicality, and predictability.
The overall proposed framework of multimodal Bi-LSTM occupational injury severity prediction model.
Discussion
A. Uniqueness of the Proposed Bi-LSTM Model
From the findings, the recurrent neural network variant; the Bi-LSTM model showed promising prediction performances for both prediction tasks using the multimodal where the; structured and unstructured notes used as the input features. The new aspect of this study is the optimization of the proposed Bi-LSTM model in predicting the outcomes of occupational injuries by making use of both structured and unstructured data as input features. In addition, the proposed model of the optimized Bi-LSTM has two LSTMs applied to the input features. Firstly, an LSTM is executed on the input sequence (“forward layer”) and followed the training in reverse order with another LSTM (“backward layer”) [76]. Because of its innovative architecture, which includes both forward and backward LSTM layers, the proposed model is able to do an analysis on each and every component of the input sequences. As a result, the model’s accuracy is improved, and the results are more relevant. We believed that the architectures of the proposed model, in which the desired algorithm is trained, not only from the ‘input to output’ but also from the ‘output to input’ leads to its high model performances. This nature of architecture gives additional advantages as the proposed model is able to analyze every component of the input sequences, thus, providing more meaningful outputs and enhancing the model’s accuracy [77].
Additionally, the recurrent layers in the proposed Bi-LSTM model have been assumed as the reason for the capability of this deep learning algorithm to learn better the feature representations as the networks and layers grow deeper [27]. The results highlighted the remarkable performance of our proposed Bi-LSTM optimized model, in terms of the uniqueness of its architectures that are well-suited in handling ‘massive-length’ data sequences in multimodal learning.
Moreover, the execution of the LSTMs from both orders or directions justified more time required for prediction, in terms of training and testing time using this model. It is known that more complex the model, more time required to train and test the prediction outcomes [78]. Therefore, it is fair to mention the Bi-LSTM model in this multimodal learning required a bit longer training and testing time, compared to other algorithms.
Next, we observed the prediction accuracy, F1-score and AUC values of RF and SVM are quite close to the best-performing model for both prediction tasks. In RF, the classifier is a combination of prediction trees and acts as an “ensemble” [19]. This capability of integrating the prediction ability of multiple learners into a single RF model leads RF to perform quite well in this study. Not to mention, the SVM algorithm has the ability to map the input representation in a high dimensional space by using the ‘kernel function’. The ‘kernel function’ used in this study is a radial-based function (“rbf”) and this function significantly increased the accuracy performance [65] making SVM one of the most effective machine learning classifiers [53].
B. Complementary Nature of Multimodal Data
This study emphasizes the use of multimodal data sources to develop an occupational injury severity model. Structured input highlighted the injured worker’s information, whereas the sequential injury narratives contained the workplace injury history. Based on the findings, the performances of the proposed machine and deep learning models are satisfactory, as they ranged from 0.8 to 0.9 above in each metric, for both prediction tasks. This performance prediction implicitly clarifies the harmonious nature of multimodal data to complement one another, thus, generating good-quality of predictive models. Additionally, multimodal learning can enhance model robustness by reducing the impact of noisy or incomplete data in a single modality. If one modality, such as structured data, is ambiguous, the presence of other modalities, such as unstructured text, can compensate for it, thus providing a more realistic prediction performance. Moreover, the predictive model generated from multimodal learning can generalize better to hidden data because it learns and trains from multiple sources, thus capturing a broader range of patterns and data relationships
In the context of occupational injury, the integrated analysis of multimodal learning permits the extraction of rich useful information from occupational injury records prepared by Safety and Health Practitioners and Occupational Health Doctors. This kind of integration of field experts in the occupational injury domain and technical aspects of workplace safety resulted in dependable stable successful occupational injury severity prediction outcomes [79], [80], [81]. It is believed that by integrating both modalities, it can provide a more comprehensive data representations as each modality contains unique and complementary information, thereby integrating them lead to a more holistic understanding of the underlying occupational injury severity events. Consequently, the combination of the structured tabular data and unstructured injury notes in this study justified the successful execution of multimodal deep neural architectures as it appears as a convincing strategy to improve the prediction performance of occupational injury severity.
C. Long-Term Benefit of the Proposed System
The detection of occupational injury severity is important to address the post-injury consequences to the injured worker, as well as, the organization [82]. In the case of an injured worker being hospitalized, they have to take days off for recovery, and those amputated may face longer treatment as it may involve physical and emotional rehabilitation. The absence of work owing to the severity of workplace injuries may affect the organization’s lost-time injuries (LTI). LTIs are indicators of the effectiveness of workplace safety and health in preventing work-related accidents and injuries. A high rate of LTIs indicates poor safety and health monitoring in the organization [83]. Moreover, workplace injury severity is related to the chances of injured workers returning to work [8]. Amputation due to workplace injury may result in permanent disability, and functional deterioration are the main reason for not returning to work. This may generate long-term psychological effects, especially on mental health, thus, prolonging injury recovery [84].
As a result, this study offers a number of contributions to the actual applications used in industry. The ultimate objective of this research was to develop an accurate forecasting system that would be of assistance to safety and health professionals, particularly in the area of estimating the severity of occupational injuries. Intervention techniques for workplace safety can be applied as first preventive steps to lessen the severity of injuries based on the severity outcomes that have been forecasted for them. This newly created multimodal prediction model has the capability of successfully identifying high-risk regions and activities within the workplace. This is accomplished by precisely predicting the possibility of occupational injury severity. Using this information to guide, the adoption of targeted safety interventions and remedial strategies to lower the likelihood of future injuries is beneficial.
Additionally, this predictive system can aid in the early screening and identification of at-risk workers with severe occupational injury outcomes, thereby allowing the prioritization of safety interventions and support systems for those workers, such as providing specialize safety training and offer support for physical and mental well-being. The proactive approach of this predictive system can lead to improved workplace safety, health, and overall well-being. Furthermore, this predictive system potential to be useful to industry practitioners in the field of resource management. When workers are incapable of working due to hospitalization and rehabilitation, this can cause significant productivity losses to the company. Therefore, management may reallocate resources, such as assigning additional manpower or redesigning work assignments to ensure that job productivity is still underway. Next, additional ongoing support, such as physical and counselling support, including job retraining can be allocated in assisting those injured workers to recover safely and timely manner.
Consequently, occupational injury severity predictive analytics utilizing multimodal learning are essential for early screening, anticipation, and identification tools for at-risk workers with severe occupational injury outcomes. Correspondingly, the information obtained from multimodal dataset analysis is beneficial in addressing the compelling concerns among Safety and Health Practitioners to foresee effective intervention strategies for preventing the severity of workplace accidents [85], [86], thereby, promoting workplace environment that is safer and healthier for employees. Worker safety, health, and well-being are of the greatest priority in occupational safety and health; thus, it is vital to employ the latest advanced Artificial Intelligence (AI) approach in constructing an accurate and robust occupational injury severity prediction model.
D. Comparison With Recent Similar Approach
Sarkar et al. [18] employed a multimodal dataset (structured and unstructured injury reports) from the steel manufacturing industry to develop an occupational injury prediction model. They developed a simple DNN model and tuned it using three optimizers: Adam, RMSprop, and SGD, with a 10-fold cross-validation method. The findings revealed that the DNN with Adam-optimizer (ADNN) achieved the best classifier with 0.79 accuracy compared to SGD-DNN, RMSprop-DNN, KNN, SVM, and RF. Our study was in agreement with their study in terms of using the Adam-optimizer, 10-fold cross validation scheme, and compared with similar state-of-the-art ML models.
In other related studies, Mahajan et al. [28] developed a baseline multimodal predictive model using the LR algorithm to predict 30-day readmission for heart failure. They used structured data from EHR and combined them with unstructured clinical notes. The multimodal LR prediction model achieved 0.65 AUC values, compared to using only structured (0.64) or unstructured notes (0.52). Moreover, Zhang et al. [27] developed a multimodal DNN named ‘Fusion-LSTM’ to predict mortality, hospitalization stay and hospital readmission using the ‘MIMIC-III’ health records. They utilized unstructured clinical notes and static information, and the model produced more accurate predictions and outperformed the baseline methods: LR and RF models with AUC scores of 0.87. Compared with our proposed approach, the multimodal Bi-LSTM model achieved a higher AUC of 0.90.
Recent approaches have become more advanced in terms of the diverse integration of data modalities with advanced DNN architectures. For example, Saleh and Murab [87] developed a Convolutional Neural Network (CNN) to predict fall injuries. In their study, they combined images and sensor data into multimodal representations. The findings revealed that the multimodal CNN model outperformed other conventional ML methods: SVM, KNN, DT, and RF with an accuracy of 0.97. The latest work by Jujjavarapu et al. [88] integrated structured and unstructured health data, consisting of patients’ personal information, diagnosis codes, drug names, and diagnostic imaging reports, to predict decompression surgery for low back pain due to occupational back injury. They proposed a multimodal deep learning architecture composed of (i) a layer of CNN, (ii) a layer of Gated Recurrent Unit (GRU) model, an advanced simpler architecture of LSTM, and (iii) 2-layer-Fully Connected, compared to baseline LASSO Logistic Regression. The findings revealed that multimodal deep learning achieved a better AUC value of 0.73.
Our findings and reviewed studies consistently indicate that multimodal deep learning architectures generate better predictive performance than traditional ML models. Further exploration in terms of data accessibility and advancement in adopting standardized multimodal deep learning methodologies in the occupational injury domain is required, with the potential to assist decision-making, resource allocation, and enhance workplace injury intervention strategies in real-industry applications. In the following, we highlight the limitations of study and potential opportunities for future research.
E. Limitations of Study
Although this study utilized occupational injury data across broad industrial sectors, we noticed the necessity of performing additional transfer learning to evaluate the generalizability of the developed model. However, most occupational injury datasets are restricted and did not reveal sufficient features to indicate the severity of hospitalization and amputation [15], thus limiting the accessibility and data quality to undergo the predictive analysis process. Besides, most of the dataset are ‘domain-specific’ and difficult to be transferred to other settings [27], such as the ‘technical-language’ related to workplace safety, including ‘manner of injury’ may vary between industries [17], [89].
Additionally, the dataset relies on ‘human-labelled’ data; as each Safety and Health Practitioner may have diverse interpretation of workplace injury severities due to their experience and training level, thereby impacting the consistency of data labelling and categorization. This limitation requires an extensive human assistance to clean and sanitize the data labelling before it can be proceeded for further analysis.
The scalability of our machine learning models is another limitation to consider. Given a large dataset comprising multiple modalities, our computational resources were constrained, thereby limiting the exploration of a wider range of model architectures and parameters. Although our study provides valuable insights into multimodal data analysis, the impact of the parameter choices of each machine-learning classifier cannot be overlooked. More in-depth investigations of the effects of specific parameters on different aspects of the analysis would enrich the understanding of our findings.
F. Future Research Trend
There are multiple sources of occupational safety data that can be used to develop occupational injury prediction models, such as workplace safety audit reports, hazard evaluation reports, and injured workers’ compensation records. Integrating these data sources could improve the comprehensiveness, generalizability, and transferability of the model. As a way forward, we anticipate that future research should integrate injured worker information from occupational injury reports and worker compensation documents to further analyze the pattern of workplace injury severity and the accurate cost implication, thereby enhancing the model interpretability for efficient utilization as an intelligent occupational injury decision support system for real industrial applications.
Another direction is to improve multimodal learning in occupational injury research by exploring other types of multimodal data fusion strategies, such as joint fusion and late fusion, including hybrid fusion in generating more robust occupational injury prediction model. Moreover, multimodal learning using workplace injury images with structured data and unstructured text is recommended; thus, alternative neural architectures such as convolutional neural networks (CNN) have been proposed.
Finally, future research could benefit from extensive feature optimization to assess the robustness of our findings with respect to the parameter variations. Additionally, conducting multiple experiments or iterative refinement approaches to explore a wide range of parameter settings and leveraging more advanced computational resources would help enhance the efficiency and effectiveness of our multimodal analysis.
Conclusion
In conclusion, our study highlights the need to utilize all modalities in occupational injury records to determine the risk of occupational injury severity, such as hospitalization and amputation. The proposed model has been proven to work well in this multimodal learning for both prediction tasks. These findings are significant in practicing workplace accidents and injury analytics because the model shows a high predictive and accurate classification performance.
To the best of our knowledge, this study is the first to propose multimodal integration learning with traditional machine learning algorithms and recurrent neural network variants; hence, our study serves as a crucial foundation and benchmark for further advancements in multimodal deep learning for occupational injury prediction. In addition, we merged a large historical workplace injury-specific dataset to classify the severity of occupational injuries across broad industrial sectors.
Code Availability Statement
The sample code used in this study is available at https://github.com/mzf23/oshinjury. Any updates or improvements to the code will be made available in the repository to ensure accessibility and sustainability of the research findings.