Advancing Aviation Safety Through Machine Learning and Psychophysiological Data: A Systematic Review

In the aviation industry, safety remains vital, often compromised by pilot errors attributed to factors such as workload, fatigue, stress, and emotional disturbances. To address these challenges, recent research has increasingly leveraged psychophysiological data and machine learning techniques, offering the potential to enhance safety by understanding pilot behavior. This systematic literature review rigorously follows a widely accepted methodology, scrutinizing 80 peer-reviewed studies out of 3352 studies from five key electronic databases. The paper focuses on behavioral aspects, data types, preprocessing techniques, machine learning models, and performance metrics used in existing studies. It reveals that the majority of research disproportionately concentrates on workload and fatigue, leaving behavioral aspects like emotional responses and attention dynamics less explored. Machine learning models such as tree-based and support vector machines are most commonly employed, but the utilization of advanced techniques like deep learning remains limited. Traditional preprocessing techniques dominate the landscape, urging the need for advanced methods. Data imbalance and its impact on model performance is identified as a critical, under-researched area. The review uncovers significant methodological gaps, including the unexplored influence of preprocessing on model efficacy, lack of diversification in data collection environments, and limited focus on model explainability. The paper concludes by advocating for targeted future research to address these gaps, thereby promoting both methodological innovation and a more comprehensive understanding of pilot behavior.


I. INTRODUCTION
As the global aviation industry undergoes transformative technological advancements, the role of pilots is concurrently evolving from simply operating machinery to making critical decisions in high-stakes, dynamic environments [1].In light of the complex nature of contemporary aviation operations, a comprehensive understanding of pilot behavior becomes paramount for enhancing aviation safety.Machine The associate editor coordinating the review of this manuscript and approving it for publication was Gang Wang .Learning (ML) technologies, particularly when integrated with psychophysiological data such as electroencephalogram (EEG), present a promising route for in-depth investigation into this vital area.These cutting-edge methodologies enable researchers to acquire nuanced insights into various facets of pilot behavior, including cognitive states and emotional responses.This paper serves as a systematic literature review, conducted in accordance with the rigorous methodological guidelines [2], [3], [4].It aims to offer an exhaustive synthesis of existing research on the application of ML techniques and psychophysiological data for understanding pilot behavior.

A. IMPOTANCE OF AVIATION SAFTEY
As a critical component of modern transportation infrastructure, the aviation industry plays an indispensable role in both global commerce and individual mobility.The industry facilitates the movement of millions of passengers and vast amounts of cargo annually, thereby serving as a linchpin in the global economy.Given this scale of operation, the imperative for ensuring aviation safety cannot be overstated; the consequences of failure are cataclysmic, both in terms of human life and economic impact [5].
However, the achievement of optimal safety levels is a complex endeavor, influenced by a myriad of factors ranging from technological innovation to regulatory oversight [6].Advances in technology have undeniably contributed to enhanced safety mechanisms, from state-of-the-art air traffic control systems to predictive maintenance algorithms that preempt mechanical failures.Nonetheless, the industry is not immune to challenges [7], [8], [9], [10].Factors such as increasing air traffic, geopolitical tensions, and even natural disasters pose new kinds of risks that require continuous scrutiny and innovation in safety protocols [11].
Moreover, the stakes are not merely quantitative but also qualitative.A single aviation accident can have a ripple effect, undermining public confidence in air travel and triggering economic repercussions that extend far beyond the aviation sector.Regulatory bodies, therefore, are in a perpetual state of vigilance, working in tandem with airlines, aircraft manufacturers, and other stakeholders to formulate and implement safety guidelines that are both rigorous and adaptive to changing circumstances [12].
In summary, aviation safety is a multifaceted and everevolving concern that requires a holistic approach, embracing technological, human, and systemic factors.The high stakes involved, both in terms of human lives and economic implications, make it a subject of paramount importance that warrants ongoing research and continual improvement.

B. ROLE OF PILOT BEHAVIOUR IN AVIATION
In the intricate system of aviation safety, the role of pilot behavior emerges as a focal point, governed by an intricate interplay of cognitive processes, emotional states, and physiological responses.Pilots, situated at the nexus of multifarious human-machine interactions, bear the colossal responsibility of safeguarding not just the aircraft and its passengers, but also the integrity of the entire aviation system.Their actions, or lack thereof, can have immediate and farreaching consequences that extend from the cockpit to the broader aviation ecosystem [13].
With the advent of increasingly automated flight systems, the role of pilots has evolved significantly.While automation has undeniably enhanced safety and efficiency, it has also engendered new forms of cognitive workload and psychological stress.Pilots are no longer solely vehicle operators but have become complex decision-makers tasked with managing an array of automated systems.They must maintain situational awareness and be prepared to intervene effectively in unexpected circumstances [14].This shift has introduced challenges related to attention allocation, decision-making under pressure, and even ethical considerations, such as how to respond in unavoidable emergency situations.
Psychophysiological markers, such as EEG data, have emerged as invaluable tools for gaining insights into pilots' internal states, particularly during high-stakes scenarios like take-offs, landings, and emergency situations.These data types allow researchers to delve into the nuances of cognitive load, attentional focus, and emotional regulation, which are crucial for understanding how pilots make decisions under stress [15], [16].
Moreover, the role of pilot behavior has systemic implications that ripple through the aviation safety ecosystem, influencing everything from regulatory frameworks to the design of new technologies [17], [18], [19].For example, a nuanced understanding of how pilots handle attentional tunneling could inform the design of more intuitive cockpit interfaces.Similarly, insights into emotional and physiological responses to unexpected events could be invaluable for the development of realistic training simulations.
In summary, the multifaceted and systemic impact of pilot behavior necessitates its thorough investigation.Given its complexity and far-reaching implications, it warrants not just academic exploration, but also practical, real-world applications, ideally supported by advanced methodologies like ML and psychophysiological data analysis.

C. MACHINE LEARNING AND PSYCHOPHYSIOLOGICAL DATA IN AVIATION RESEARCH
The advent of ML technologies represents a pivotal milestone in aviation research, especially in the nuanced domain of pilot behavior.These advanced computational techniques offer a comprehensive framework for analyzing intricate, highdimensional psychophysiological data sets like EEG, which are often beyond the scope of traditional statistical methods to interpret in a meaningful manner [20].
ML algorithms, encompassing a broad array of models such as tree-based, support vector machine (SVM), and various neural networks, have proven to be immensely effective in predicting and understanding multiple facets of pilot behavior.These include, but are not limited to, cognitive workload, emotional states, and even task engagement.The capacity to leverage the voluminous and complex variables available in psychophysiological data sets speaks volumes about the transformative potential of ML in this research domain [21].The applications of these capabilities extend far beyond academic inquiry and are making inroads into realworld applications, including but not limited to, predictive monitoring, adaptive cockpit interfaces, and even real-time decision support systems.
Furthermore, the confluence of ML with psychophysiological data yields an interdisciplinary approach that capitalizes on the strengths inherent in both domains.Psychophysiological data provides an unparalleled window into the complex internal states of pilots, including cognitive and emotional variables [22].ML, on the other hand, serves as the analytical framework capable of extracting granular insights from this data.This synergistic relationship has given rise to groundbreaking studies that have significantly extended our understanding of human performance and decision-making within aviation contexts [23], [24], [25], [26], [27], [28].
The structure of this paper is meticulously designed to provide a holistic overview of the current state of research on the application of ML techniques to psychophysiological data for understanding pilot behavior.Following this introductory section, the paper delineates its systematic review methodology, presents a comprehensive synthesis of key findings, offers an extensive discussion contextualizing these results within the broader landscape of aviation safety and pilot behavior, and concludes by summarizing the salient insights while identifying research gaps that offer promising avenues for future inquiry.

II. METHODOLOGY
The methodology of this systematic review serves as the architectural framework, designed to furnish robust, transparent, and reproducible outcomes.Adhering scrupulously to the guidelines [2], [3], [4], this section delineates the meticulous steps taken to answer the posited research questions.It provides an exhaustive description of the protocols followed in the search, selection, and analysis of literature, in addition to quality assessment.Fig. 1 presents a graphical description of the procedure.

A. RESEARCH QUESTIONS
The present systematic review is directed by a set of carefully formulated research questions.These questions are designed not merely to clarify what is already known but to illuminate areas requiring further exploration.The principal research questions are: • RQ1: What are the primary focus areas in the application of ML to psychophysiological data for understanding pilots' behavior?
• What behavioral and cognitive states are most studied?• RQ2: How are preprocessing, data types, and feature extraction approached in existing studies on psychophysiological data for pilot behavior?
• Which psychophysiological data types are most used?
• What artifacts are commonly found in the psychophysiological data?
• What preprocessing techniques are prevalent?
• What features are commonly extracted?• RQ 3 What are the types of models utilized to understand the pilot behavior?
• Which evaluation mechanism and metrics were utilized to assess the models?• RQ4: What is the comparative performance of various ML and DL models in predicting pilot behavior?
• What implications do these performance metrics hold?
• RQ5: What are the methodological limitations in existing studies?
• What future research directions are suggested by the methodological limitations?

B. LITERATURE SEARCH STRATEGY
The integrity of a systematic review is profoundly dependent on the comprehensiveness and rigor of its literature search strategy.To ensure a robust selection of studies pertinent to the research questions, this review adopted a multi-faceted search strategy, encompassing several academic databases and employing a sophisticated set of search queries.

1) SEARCH QUERIES
Keywords and Boolean operators were strategically aligned to construct queries that are both expansive and incisive.Search terms were primarily derived from the research questions.Subsequently, terms related to ML were incorporated based on authoritative sources such as [29].Phrases such as ''machine learning,'' ''psychophysiological data,'' ''EEG,'' and ''pilot behavior'' were intricately woven together through Boolean operators like ''AND'' and ''OR,'' fashioning a search net designed for both breadth and precision.

2) ACADEMIC DATABASES
The review encompassed an exhaustive search across a selection of databases renowned for their scholarly contributions, namely IEEE Xplore, Scopus, PubMed, ScienceDirect, and Google Scholar.These databases were strategically chosen for their credibility and extensive coverage of academic articles in the fields of engineering, science, and technology.In Scopus and ScienceDirect, a comprehensive scan was conducted on titles, abstracts, and keywords for each retrieved study.For IEEE Xplore, the focus was primarily on metadata.
It is worth noting that PubMed was queried by scanning both titles and abstracts, while in Google Scholar, only titles were examined.Such differentiation in search strategies was necessitated by the unique syntax and capabilities of each database.Accordingly, modifications were made to the initial search string to suit the particular idiosyncrasies of each database.

3) TIME FRAME
The time frame selected for the search reflects a balance between historical depth and contemporary relevance.A window of the last fifteen years was delineated, allowing for an appraisal of seminal works while also encompassing the most recent advancements.This temporal scope ensures that the review remains at the cusp of contemporary scientific thought.

C. INCLUSION AND EXCLUSION CRITERIA
The efficacy of a systematic review is substantially influenced by the criteria governing the inclusion and exclusion of studies.These criteria act as sieves that sift through the amassed literature, retaining articles of relevance and discarding those that do not align with the objectives of the review.Inclusion Criteria: Quality assessment is pivotal in the context of systematic reviews for ensuring that the conclusions drawn are based on rigorous and reliable studies.Each included study was thoroughly evaluated using a predetermined set of criteria: 1. Relevance to Research Questions: Studies were assessed based on the extent to which their objectives and outcomes align with the questions posed by this review.Those highly relevant to the review's research questions are considered to offer more meaningful contributions to the aggregated findings.2. Quality of Data: The robustness of psychophysiological measures and the ML techniques used were scrutinized.

Clarity and Completeness:
The level of detail and clarity with which the study's methodology and findings are presented were also considered.Welldocumented studies contribute to the review's overall credibility and facilitate future replication efforts.

E. DATA EXTRACTION
The data extraction phase constitutes a critical juncture in the systematic review pipeline, serving as the foundational bedrock for ensuing rigorous analytical undertakings.This section meticulously outlines the orchestrated methodology and structured approach employed for gleaning pertinent data from the studies that met the previously established inclusion and exclusion criteria.

1) SEARCH PROCESS
To synthesize a collection of studies pertinent to the research aims, a rigorously formulated search query was executed across selected academic databases.This initial search yielded a total of 3352 potential studies for inclusion.Following this, a dedicated de-duplication process was undertaken, resulting in the removal of 2107 duplicate entries.This left 1245 studies for further examination.Subsequently, a comprehensive screening process was carried out, wherein titles, abstracts, and keywords of these 1245 studies were meticulously evaluated against the inclusion and exclusion criteria.This narrowed down the list to 104 studies deemed potentially relevant.A subsequent full-text screening was conducted, further subjected to quality assessment protocols, leading to the exclusion of an additional 37 studies.At this juncture, the compilation stood at 67 studies.Furthermore, to ensure a thorough and exhaustive review, the references cited in these 67 studies were also examined.This supplemental search led to the inclusion of an additional 13 studies that met the review's criteria.Thus, the final pool of studies included in this systematic review totals 80.A visual representation of this sequential selection process is illustrated in Fig. 2.

2) DATA EXTRACTION PROTOCOL
The data extraction process was designed to capture a rich set of information from each study, thereby enabling a nuanced analysis aligned with the research questions.For each study included in this systematic review, the following data were extracted: 1. Article Title: The title of the article was noted to provide a preliminary understanding of the study's focus and scope.

Year of Publication:
The publication year was recorded to assess the temporal distribution of research efforts and to identify trends or shifts in research focus over time.

Publication Venue:
The venue where the article was published.
4. Behavioral Aspects: Specific behavioral states or traits such as workload, fatigue, attention, and emotional states like stress or anxiety were identified and recorded.5. Model Type: Information regarding the types of models employed, such as ML, DL, or Statistical Models, was extracted.This facilitated a comparative analysis of the methodologies adopted in the existing literature.6. Model Categories: Within the ML models, specific categories such as tree-based models, SVM, and probabilistic models were noted to enrich the discussion on methodological diversity.7. Performance Metrics: Metrics such as accuracy, recall, precision, and F1-score were extracted where available.This data aimed to provide a detailed account of the performance evaluations conducted in each study.8. Psychophysiological Data Types: Types of psychophysiological data such as EEG, electrocardiogram (ECG), and galvanic skin response (GSR) were recorded to understand the range of data employed in assessing pilot behavior.9. Preprocessing Techniques: Methods used for preprocessing, such as independent component analysis (ICA) or bandpass filtering, were also captured.This allowed for a comprehensive review of the techniques used to refine psychophysiological data before model training.10.Features Extracted: The types of features extracted from the psychophysiological data, like power spectral density (PSD), wavelet coefficients (WC), or statistical measures, were noted.This contributed to the discussion on feature engineering practices in the existing literature.11.Limitations and Future Work: An assessment of each study's limitations and suggestions for future research contribute to an understanding of gaps in the current body of literature.This information is crucial for setting the stage for future explorations.

F. DATA SYNTHESIS
The extracted data were subjected to a multi-layered synthesis process aimed at offering a nuanced understanding of the literature.The first layer involved a descriptive statistical analysis of basic metrics such as year of publication and publication types of studies.The second layer honed in on the behavioral aspects, where specific behavioral states like workload, fatigue, and attention, as well as emotional states, were analyzed.The aim was to ascertain the breadth of human performance-limiting states explored in existing literature and identify under-researched areas.The final layer of synthesis focused on the methodological paradigms employed across the studies.Models used, types of psychophysiological data, preprocessing techniques, and performance metrics were categorized and analyzed to discern prevailing trends and potential gaps.
The synthesized data were visually represented through charts and tables, facilitating a clearer interpretation and comparison of findings.Moreover, the synthesis incorporated a narrative approach, integrating the quantitative and qualitative findings to offer a cohesive and comprehensive view of the research landscape on the application of ML and psychophysiological data in understanding pilot behavior.

III. RESULTS
This section serves as the empirical focal point of this systematic review, presenting a rigorous analysis of the data extracted from the 80 included studies.Adhering to the data extraction protocol delineated in the methodology section, this segment synthesizes the findings across multiple dimensions, including the types of ML models employed, their performance metrics, and the psychophysiological data types used for predicting pilot behavior.Furthermore, this section provides a granular breakdown of methodological choices in existing literature, including data preprocessing techniques, artifacts identified, and features extracted.The results presented herein aim to offer a comprehensive understanding of the current state of the art, serving as a foundational base for the subsequent discussion section where these findings will be interpreted, contextualized, and evaluated.

A. QUALIFIED STUDIES OVERVIEW: A SYSTEMATIC ENUMERATION OF EMPIRICAL INVESTIGATIONS
In order to provide a comprehensive overview of the empirical investigations qualified for inclusion in this review, multiple criteria have been considered for categorizing the studies.An initial enumeration of the studies is presented in Table 1, which lists each study by a unique Study ID, along with its citation and title.This table serves as a systematic reference, facilitating cross-referencing throughout this review.In addition to tabulated data, Fig. 3 offers a temporal mapping of the studies, illustrating the number of publications per year.Upon examination of Fig. 3, it is evident that there has been a notable surge in the number of studies published from 2015 onwards, signaling an increased research focus on the subject matter.This could be attributed to various factors such as technological advancements, policy changes, or shifts in research priorities.

B. TAXONOMY OF PILOT'S BEHAVIORAL AND COGNITIVE STATES
The taxonomy of behavioral and cognitive states in aviationbased empirical studies is visualized in Fig. 5, serving as a cornerstone for this analysis.It segments the research focus into five overarching categories: 'Cognitive Load Indicators,' 'Performance Metrics,' 'Attention Dynamics,' 'Emotional Responses,' and 'Miscellaneous.'Among these, 'Cognitive Load Indicators' are markedly dominant, comprising a substantial 75% of the selected studies.This predominance creates a striking contrast with the other categories, each of which constitutes a fraction of the total research corpus.Such an imbalance underscores a significant skew in existing research, leaning heavily towards quantifiable cognitive metrics.
A more granular examination reveals that within 'Cognitive Load Indicators,' 'Workload' accounts for 65% of the studies, followed by 'Fatigue' at 25%.Less represented sub-categories like 'Stress,' 'Skill Level,' 'Drowsiness,' and 'Attention Reserve' warrant attention for their minimal inclusion.In the 'Emotional Responses' domain, 'Emotion' captures 60% of the focus, with 'Reaction' and 'Situational Awareness' evenly sharing the remaining 40%.'Attention Dynamics' is chiefly concerned with 'Distraction' at 34% and 'Attention' at 22%, but critically underrepresents performance-limiting states such as 'Diverted Attention'   and 'Startle/Surprise,' each barely surpassing a 10% share.The 'Miscellaneous' category, which accounts for 8% of the studies, is primarily composed of works where the behavioral aspect was neither the central focus nor explicitly articulated.

C. METHODOLOGICAL DESIGN: PSYCHOPHYSIOLOGICAL MEASURES, DATA PREPROCESSING, AND FEATURE EXTRACTION
The following subsection focuses on delineating the methodologies adopted in existing studies, with particular attention VOLUME 12, 2024 5139 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
to psychophysiological data employed, the artifacts identified, the methods used for data preprocessing, as well as the features extracted and their corresponding extraction techniques.The distribution of psychophysiological data types used in research studies exhibits a notable range of diversity.As delineated in Fig. 6, EEG data are most commonly employed, accounting for 32% of the studies.This is followed by ECG data, which make up 19% of the studies.Interestingly, a 'Miscellaneous' category, comprising flight data and subjective measures such as NASA Task Index (NASA TLX), holds a non-trivial portion of 17%.Eye-Tracking and GSR follow suit, constituting 14% and 7%, respectively.On the lower end, Respiration (Resp.),Electrooculogram (EOG), and Electromyogram (EMG) data appear less frequently, each making up less than 5% of the studies.
Turning to Table 2, a detailed inspection reveals a rich array of artifacts and their corresponding preprocessing methods, sorted by psychophysiological data type.EEG data, for instance, are predominantly subjected to preprocessing methods such as ICA and bandpass filtering.Notably, some studies collected EOG, ECG, and EMG data simultaneously with EEG data and used them to identify heartbeats, muscle, and eye-related artifacts in the EEG data using ICA.For users of MATLAB, Artifact Subspace Reconstruction (ASR) is frequently employed.These techniques mitigate challenges posed by ocular and muscular artifacts common to EEG data collection.Interestingly, some studies did not employ any preprocessing techniques and proceeded directly to feature extraction.A range of other preprocessing techniques, including normalization, standardization, resampling, and detrending, were also employed.Some studies opted for manual inspection of the data to remove corrupted segments.ECG and GSR data, although less varied in preprocessing methods, also have unique sets of challenges and corresponding techniques.ECG data commonly undergo QRS detection to accurately identify heartbeats, while GSR data frequently are subjected to low-pass filtering to remove high-frequency noise.
Complementing this, Table 3 offers a more nuanced examination of the features extracted from these psychophysiological data types.Within the domain of EEG, features such as PSD and WC are frequently extracted, often employing Fourier and WT.Some studies also extracted statistical features like mean, median, skewness, and kurtosis, often using time-domain methods.In addition, a cohort of studies explored the extraction of non-linear, spatial, and higher-level features like entropy, coherence, and phaselocking value.Several methodologies for feature extraction were noted, including Welch's method, Morlet wavelet, and Common Spatial Patterns.Furthermore, statistical tests  In sum, the methodological paradigms underpinning the existing literature are diverse, intricate, and tailored to the unique challenges and opportunities presented by each type of psychophysiological data.These empirically-grounded observations provide a foundational base for subsequent interpretive and evaluative discussions.

D. TAXONOMY OF MODELS TYPES AND PERFORMANCE METRICS
The ensuing analysis is dedicated to providing a comprehensive breakdown of the types of predictive models currently deployed in the literature for the nuanced understanding of pilot behavior.Fig. 7 shows a more nuanced analysis of the model's types employed to identify the pilot's behavior.A compelling trend that demands attention is the preeminent use of ML models, which constitute a significant 65% of the total models utilized.This prevalence likely reflects the ML models' capability for handling complex, highdimensional data.DL models are also noteworthy, albeit to a lesser extent, representing 27% of the models used.
Statistical models account for the remaining 8%, indicating a less frequent but nonetheless important role in the landscape.
Delving into the category of ML models, the analysis reveals a rich and varied methodological landscape.Leading this category are tree-based models, which account for 29% of ML models.Such models are frequently favored for their interpretability and robustness to noisy data.Following closely is SVM, which make up 26% of ML models, often chosen for their ability to handle high-dimensional spaces effectively.Dimensionality reduction models, which are crucial for simplifying complex datasets, comprise 14%.KNN algorithm is also significant, accounting for 13%, and are often employed for their simplicity and effectiveness in classification tasks.Probabilistic models, which offer nuanced probabilistic interpretations, account for 8%, while linear models, known for their ease of interpretation, make up 7%.Ensemble methods, which combine predictions from multiple models to improve performance, hold a smaller share of 3%.For further granularity, the tree-based models include a variety of algorithms like DT, XGBoost, and RF among others.Linear models predominantly feature LR and Lasso Regression, while Dimensionality Reduction models include techniques like LDA and PCA.Probabilistic models encompass BNN and GP, and ensemble methods feature techniques like Bagging and Boosting.the sphere of DL models, the existing studies unveil extensive array of architectures, reflecting the burgeoning interest in leveraging complex neural network structures for identifying pilot behavior.Contributing to 27% of the total models deployed, DL techniques demonstrate their growing influence in this domain.The architectures span a variety of models, from traditional ANN to more complex and specialized types like CNN and LSTM.Each of these architectures brings unique advantages to the analysis of psychophysiological data.For instance, CNNs are often employed for their ability to automatically and adaptively learn spatial hierarchies of features, making them ideal for image and sequence data.LSTMs, on the other hand, excel in handling time-series data, capturing long-term dependencies which are often crucial in understanding behavioral states.The presence of DBN and RNN further enriches the landscape, signifying ongoing experimentation in the field to identify the most effective DL approaches for specific tasks.Even more specialized architectures like SCN and LVQ find mentions, indicating that the field is continually evolving with an expanding repertoire of advanced techniques.Statistical models, while less frequently employed, consist primarily of traditional techniques like ANOVA and MANOVA.These models are often used for hypothesis testing and the exploration of relationships between variables, providing a contrast to the predictive focus of ML and DL models.the metric as shown in Table 4, accuracy is the most reported performance metric, featured in 65% of the studies, likely due to its simplicity and straightforward interpretation.Recall, which focuses on the model's ability to identify all relevant instances, is reported in 29% of the studies, indicating its importance in applications where missing a positive instance is particularly costly.Precision appears in 21% of the studies, often employed alongside recall to provide a more complete picture of model performance.Specificity and F1-score, metrics that consider both false positives and negatives, are reported in 11% and 13% of the studies respectively.The AUC, RMSE, MSE, MAE, and Pearson's correlation metrics appear less frequently, suggesting their application in more specific or specialized contexts.Notably, some studies adopt a multimetric approach, indicating a comprehensive methodology for performance evaluation.
In summary, the existing literature exhibits a varied and intricate array of predictive models and performance metrics, reflecting the methodological diversity inherent in the field.These findings serve as a robust foundation for subsequent interpretative discussions and scholarly evaluations, offering a comprehensive view of the methodological paradigms shaping current research.

E. COMPARATIVE PERFORMANCE OF MACHINE LEARNING AND DEEP LEARNING MODELS IN PREDICTING PILOT BEHAVIOR
The current subsection seeks to offer an exhaustive analytical examination of the average performance of diverse categories of ML and DL models.As a robust methodological approach, the performance accuracy for each category were extracted from the selected studies, meticulously averaged, and subsequently visualized in a bar chart, denoted as Fig. 8.The SVM and KNN models both share an identical average performance accuracy of 77%.While these numbers are certainly respectable, they do not represent the pinnacle of performance among the categories.Remarkably, Ensemble models eclipse other methodologies with an exceptional average performance rate of This exceptional performance could be attributed to the inherent capability of Ensemble models to combine multiple weak learners, thereby enhancing generalizability and robustness against overfitting.
Closely following Ensemble models, tree-based models exhibit an average performance rate of 78%.As illustrated in Fig. 9, XGBoost and GBM show a higher lower quartile at 64% and 86%, respectively, as well as a tighter interquartile range within this category, suggesting greater robustness in performance.Notably, ET appear to be exceptionally consistent, with all quintiles at 97%.DL models also command attention with their average performance accuracy of 82%.For DL models, ANN and CNN display robust performances with medians at 80% and 83%, respectively.LSTM models show a lower quartile at 62% but reach as high as 87%, indicating potential for high performance but also room for improvement.
In contrast, Dimensionality Reduction and Probabilistic models both manifest a relatively lower average performance rate of 71%.Within Dimensionality Reduction models, LDA and QDA show a broad range in their performance.LDA has a lower quartile at 65% and an upper quartile at 81%, while QDA exhibits a wider distribution with a lower quartile at 48% and an upper quartile at 89%.Similarly, Linear models register an average performance rate of 77%, which is in line with SVM and KNN.Probabilistic BNN show remarkably consistent performance, with all quintiles at 67%.In contrast, GP and NB manifest wide performance ranges, from 43% to 96% and 37% to 91%, respectively.

IV. DISCUSSION
The Discussion section serves as a critical forum for interpreting the empirical findings presented in the results section.In line with the research questions posited, this section aims to offer an in-depth analysis of the current state of research on the application of ML models and psychophysiological data in understanding pilot behavior.It further contextualizes these findings within the broader academic discourse and identifies both methodological limitations and avenues for future research.

A. EVALUATION OF RESEARCH FOCUS ON PILOT'S BEHAVIOURAL AND COGNITIVE STATES (RQ1)
The analysis encapsulated in the results' subsection B offers a nuanced perspective on the existing body of research surrounding pilot behavior.While 'Cognitive Load Indicators' occupy a dominant position in the academic discourse, it is essential to interrogate the reasons behind such focused attention.One could speculate that the quantifiable nature of indicators like 'Workload' and 'Fatigue' makes them attractive candidates for empirical studies, possibly offering more straightforward avenues for data collection and analysis.However, this concentration exposes a conspicuous void in other pivotal areas.The paucity of research on performance-limiting states such as 'Channelized Attention,' 'Diverted Attention,' and 'Startle/Surprise' is particularly concerning.Given the critical nature of aviation operations and the potential ramifications of performance-limiting states on both safety and efficiency, this research gap represents a glaring omission.In considering the methodological underpinnings of the existing literature, we encounter two predominant approaches: multi-level and binary classifications.Multilevel classifications, which are often employed to dissect complex behavioral aspects like 'Workload,' offer a more textured understanding but suffer from challenges related to comparability and standardization across studies.The absence of a universally accepted metric for defining 'low,' 'medium,' or 'high' levels of a behavioral aspect could potentially muddle the collective insights drawn from various studies.In contrast, binary classifications, commonly used for attributes like 'Fatigue,' offer clarity and are more easily interpretable.However, this reductionist approach might not capture the continuum of behavioral states pilots may experience, potentially leading to an incomplete or skewed understanding.
These methodological choices have far-reaching implications.For instance, the prevalent use of binary classifications might be well-suited for real-time monitoring systems in cockpits, where quick decisions are paramount.However, such systems, if based solely on existing binary-classification research, might lack the sensitivity to detect nuanced changes in a pilot's behavioral state, thereby reducing their overall efficacy.Thus, a balanced methodological approach seems warranted for future research.Adopting a hybrid model that incorporates both multi-level and binary classifications could offer a more holistic view, capturing both the nuanced complexities and the actionable insights needed in practical applications.

B. INTERPRETING METHODOLOGICAL PARADIGMS IN PILOT BEHAVIOR RESEARCH (RQ2)
The present analysis of existing studies offers a comprehensive perspective on the intricate methodologies adopted in the domain of pilot behavior research, revealing both the depth and the complexity of the current landscape.This diversity not only reflects the multidisciplinary nature of the field but also raises questions about methodological coherence 5144 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and standardization, providing fertile ground for academic scrutiny.
At the forefront of psychophysiological measures is EEG data, which constitutes 32% of the studies reviewed.This prevalence attests to EEG's high temporal resolution and its capability to capture complex neural activities, factors that have rendered it a popular choice among researchers.However, the data landscape is far from being monolithic.ECG data, which accounts for 19% of the studies, is also pivotal, often serving as an indicator of physiological stress and cognitive workload.The role of Eye-Tracking data is similarly significant, often employed to assess attentional states and situational awareness.These alternative data types underscore the multi-faceted nature of pilot behavior, which cannot be comprehensively understood through neural activities alone.The 'Misc.' category, comprising 17% of the studies and including flight data and subjective measures like the NASA TLX and Karolinska Sleepiness Scale, adds another layer of complexity.This category suggests an emerging trend towards the incorporation of multi-modal and subjective data, potentially offering a more rounded understanding of pilot behavior, a point that merits further investigation in future studies.
Diving into data preprocessing, the study identifies a wide array of techniques, each with its unique strengths and limitations.For EEG data, the prevalent use of ICA and bandpass filtering signifies a focus on mitigating ocular and muscular artifacts.However, the rise of ASR technique among MATLAB users signals the adoption of specialized methods that are tailored to specific research needs.Interestingly, some studies bypass preprocessing altogether, a choice that may have implications for data quality and interpretability.This diversity in preprocessing methods raises critical questions about the standardization and comparability of research outcomes.In the feature extraction stage, the landscape is equally diverse.While Fourier and WT are commonly employed for frequencydomain feature extraction from EEG data, the study also identifies a growing interest in statistical features and higherlevel non-linear and spatial features.This methodological diversity is further enriched by the use of ML and DL techniques, not just for classification but also for feature extraction and selection.
The observed methodological paradigms thus present both opportunities and challenges.On the one hand, the diversity of methods enriches our understanding of pilot behavior from multiple psychophysiological perspectives.On the other hand, the lack of methodological standardization hampers cross-study comparisons and meta-analyses, an issue that warrants attention in future research.The existing literature on pilot behavior showcases a complex tapestry of methodological approaches, each designed to tackle the unique challenges posed by different types of psychophysiological data.This diversity offers a rich yet complex view of current research practices, providing a foundational base for subsequent academic discussions and critical evaluations.

C. INTERPRETATIVE DISCUSSION FOR MODEL TYPES AND EVALUATION METRICS (RQ3)
This discussion aims to delve into the use of detection models and performance metrics observed in existing literature, particularly in the context of utilizing psychophysiological data to identify pilot behavior.One of the most salient aspects is the predominant deployment of ML models, which constitute 65% of the total models utilized.This considerable emphasis on ML models raises pertinent questions about their comparative efficacy, especially in contexts where complex, high-dimensional data are involved.
In the domain of DL models, the diversity of architectures is particularly noteworthy.Contributing to 27% of the total models used, DL models signify their burgeoning influence in this area.Researchers have proposed various architectures, some combining CNN with Long LSTM networks for layered complexity.Other innovative proposals include deep contractive autoencoder networks with softmax classifiers, deep sparse autoencoder networks, and feature mapping layers in stacked denoising autoencoders.This suggests that the field is in a state of methodological flux, continuously exploring and adapting to find the most effective DL models for specific tasks.Traditional statistical models, although foundational, appear less frequently, making up 8% of the total models.Their limited use possibly suggests a methodological shift towards more data-driven models.
The metrics employed for performance evaluation also deserve critical examination.The prominence of accuracy, reported in 66% of the studies, could indicate a focus on overall classification effectiveness.However, the metric may not suffice in cases where the dataset is heavily imbalanced, underlining the need for more nuanced evaluation metrics like recall or precision.The adoption of multiple metrics in some studies indicates a multi-faceted approach to performance assessment but also points to a lack of standardization that could impede cross-study comparisons.
In conclusion, the existing literature exhibits a rich array of methodologies, from traditional statistical and ML models to advanced DL models, each with their unique merits and limitations.The variety of performance metrics used, while indicative of methodological diversity, also suggests the need for further standardization and comparative evaluation.

D. INTERPRETATIVE ANALYSIS BASED ON MODEL PERFORMANCE (RQ4)
The comprehensive results presented in the results' subsection E offer rich insights into the relative performance of various ML and DL models in the domain of pilot behavior prediction.The standout performance of Ensemble models, averaging at an exceptional rate of 97%, is particularly noteworthy.This could be attributed to the capacity of Ensemble models to synthesize insights from multiple weak learners, thereby enhancing their generalizability and robustness against overfitting.However, this high performance also raises questions about the diversity of base VOLUME 12, 2024 learners employed in these ensemble models and how that contributes to their effectiveness.Tree-based models, with an average performance rate of 78%, offer another interesting point for discussion.While they perform well on average, the variance in performance across different types of tree-based models, such as RF and GBM, suggests that the choice of specific tree algorithms and their hyperparameters could be a crucial factor in achieving optimal performance.
The performance of DL models, averaging at 82%, is notable for its potential to capture intricate patterns in high-dimensional data.Yet, the distribution of performance across various DL architectures such as CNNs, LSTMs, and ANNs indicates that no single architecture dominates in terms of efficacy.This divergence could be indicative of the specialized nature of these architectures, optimized for specific kinds of data or tasks within the broader realm of pilot behavior prediction.
Dimensionality Reduction and Probabilistic models, with their lower average performance rates of 71%, warrant a discussion on their applicability and limitations.Given the complex, high-dimensional nature of psychophysiological data, these models may not capture the full scope of relevant features or patterns, thus limiting their performance.Future work might explore hybrid models that combine these methods with higher-performing models to improve accuracy.Moreover, the fact that some models show a wide distribution in their performance statistics, such as GP and NB, suggests a sensitivity to the specific conditions or configurations under which they are employed.This could be an important area for future investigation, particularly in identifying what those conditions or configurations are.
In summary, the detailed results on model performance and their distribution provide a multifaceted view of the current methodological landscape in predicting pilot behavior.They elucidate not just the strengths and weaknesses of various model categories but also point to numerous questions and directions for future research.This could include the exploration of hybrid models, methodological innovations to improve the performance of underperforming categories, and more nuanced applications tailored to the specific needs and challenges of psychophysiological data in pilot behavior analysis.

E. METHODOLOGICAL LIMITATIONS AND FUTURE RESEARCH DIRECTIONS (RQ5)
The assessment of the current literature reveals significant gaps and areas for improvement, necessitating a focused discussion on methodological limitations and future research directions.One pressing concern is the largely unexamined impact of preprocessing techniques on ML models.Although numerous preprocessing methods are employed across studies, the extent to which these choices influence ML models outcomes remains largely unexplored.This represents a critical avenue for future research, as a better understanding of this interplay could lead to more robust and generalizable models.
Another limitation is the reliance on traditional preprocessing techniques.The complexity of psychophysiological data, fraught with various artifacts, calls for the exploration of advanced preprocessing methods.Incorporating such methods could potentially lead to more accurate and reliable models for understanding pilot behavior, and thus should be a focus of future research efforts.
Additionally, the impact of employing data imbalance techniques on the performance of ML models has not been fully explored and evaluated.Given the frequent occurrence of imbalanced datasets in this domain, this lack of focus raises concerns about the generalizability and reliability of the reported results.Further, the disproportionate emphasis on accuracy as the principal metric for evaluating model performance becomes problematic, especially in cases involving imbalanced datasets.A focus on accuracy alone may not accurately reflect the model's ability to predict minority classes.Therefore, a multi-metric evaluation framework, incorporating additional metrics like recall, precision, and the F1-score, is crucial for a more balanced and comprehensive assessment of model performance.
In the domain of DL, the utilization of 1D-CNN appears to be underexplored in the context of psychophysiological data analysis for pilot behavior.The architecture of 1D-CNNs is well-suited for handling time-series data, offering the potential for enhanced feature extraction and, ultimately, more accurate predictions.Few studies in the selected corpus have addressed the critical issue of model interpretability or explainability, a paramount concern for real-world applications where understanding model decisions can have significant implications.This glaring omission in the current literature underscores the need for greater methodological rigor in future studies.
The literature's focus on data from specific environmental settings constrains the generalizability of the findings.Future studies could benefit from collecting and analyzing data from different environmental contexts, thereby enhancing the ecological validity of the research and providing a more comprehensive understanding of pilot behavior under varying conditions.Furthermore, the feature extraction methods employed in existing studies demonstrate a limited focus on traditional statistical and frequency-domain features.The exploration of spatial features, such as tangent space, remains largely untapped.Given the promise of such features in providing nuanced insights into cognitive states, their exploration could be a significant contribution to the field.
Lastly, the multidisciplinary nature of the field suggests an overarching need for cross-disciplinary collaboration to develop a more unified methodological framework.Such a framework could accommodate the complexities inherent in each type of psychophysiological data and the unique challenges they present.This standardization could, in turn, facilitate meta-analyses and cross-study comparisons, enriching our understanding of pilot behavior from multiple vantage points.

V. CONCLUSION
This systematic literature review endeavors to offer a nuanced and comprehensive understanding of the current state of research that applies ML models for the interpretation of psychophysiological data, specifically focusing on the behavior of pilots.A multifaceted array of findings have emerged from this review, which span the gamut from the types of psychophysiological data employed to the specific ML methodologies and their corresponding performance metrics.Firstly, this review uncovers a pronounced heterogeneity in the types of psychophysiological data employed across studies, with EEG data standing out as the most commonly used.This prominence of EEG data could be indicative of the broader acceptance of its reliability and efficacy in capturing cognitive states, yet it also raises questions about the underutilization of other types of data like ECG, GSR, and eye-tracking metrics.
Significantly, the review has identified a substantial gap in the behavioral aspects studied, most notably the underrepresentation of emotional responses and attention dynamics in the existing literature.These areas, although critical to understanding human performance-limiting states, have been less explored compared to workload and fatigue.Emotional states and attention levels are not only crucial for aviation safety but also enrich the understanding of pilot behavior in a more holistic manner.The current methodological approaches often categorize these aspects into broader categories, thereby potentially missing nuanced interrelations between different behavioral and cognitive states.Therefore, a more balanced academic inquiry into these areas is warranted for a more comprehensive understanding of pilot behavior.
When it comes to preprocessing methodologies, a diverse range exists; however, a notable gap lies in the absence of rigorous empirical evaluation exploring how these preprocessing choices could impact the outcomes of ML models.Given the intricacy of psychophysiological data, which often contains various types of noise and artifacts, understanding this relationship is not just academically interesting but also practically vital.Additionally, a remarkable methodological limitation is the scant attention given to the critical issue of model interpretability and explainability.Given that ML models are increasingly being considered for real-world applications in aviation, the lack of focus on this aspect is a significant shortcoming that future research must address.
The review also spotlights several key avenues for future investigation.It suggests that examining the impact of advanced preprocessing techniques, and how they interact with different model types, could offer new pathways to enhance model performance.The exploration of data imbalance correction methods, the use of spatial features like tangent space, and the incorporation of innovative model architectures such as 1D-CNNs represent other promising directions.
In sum, while the existing literature provides an invaluable starting point for the scientific understanding of pilot behavior through the lens of ML and psychophysiological data, there is ample room for methodological refinement and exploration.Addressing the identified gaps and underresearched areas will not only elevate the scientific rigor but also contribute to more nuanced, comprehensive, and practically applicable insights into pilot behavior.By focusing on these aspects, future research can aim to substantially advance the field, enriching both its academic depth and its practical applicability in the broader context of aviation safety and efficiency.

FIGURE 1 .
FIGURE 1.The adopted steps of the systematic review.

FIGURE 3 .
FIGURE 3. Study publication distribution using a yearly calendar.

Fig. 4
Fig. 4 supplements this by delineating the division between journal articles and conference papers among the selected studies.According to Fig. 4, a majority of the research is published in journal articles, which often undergo rigorous peer-review processes.The prevalence of journal articles could be indicative of the maturity and established nature of this research area.

FIGURE 6 .
FIGURE 6.Comprehensive distribution of psychophysiological and other data types in existing literature on pilot behavior.
and information-theoretic measures such as PCA, Analysis of Variance (ANOVA), Multivariate Analysis of Variance (MANOVA), Friedman tests, and Mutual Information Coefficient (MIC) were not uncommon.ML and DL methods also appeared as tools not only for classification but for feature extraction and selection as well.

FIGURE 8 .
FIGURE 8.The performance accuracy of the models utilized in the literature.

FIGURE 9 .
FIGURE 9. A box plot for each model type category.

Non-English Publications: Research published
in languages other than English was not considered.4.

Unspecified or Ambiguous Methods: Studies
lacking transparent methodology were excluded to ensure the integrity and reproducibility of the review.

TABLE 1 .
List of the qualified studies.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 1 .
(Continued.) List of the qualified studies.

TABLE 2 .
Summary of artifacts and corresponding preprocessing methods.

TABLE 3 .
Summary of features extracted and extraction methods.

TABLE 4 .
The metrics used to evaluate the models' performance.