Introduction
Mathematical modeling is widely employed across scientific disciplines to investigate the behavior of systems. Leveraging the input-output relationship of a system enables the estimation of model parameters that are not directly measurable. This has led to a philosophy which is most prominent today: Given enough data, any system can be modeled. This thinking has led to an immense increase in data generation and the genesis of the age of Big Data, which has accelerated many industries such as banking [1], transportation [2], agriculture [3], Internet of Things (IoT) [4], medicine [5], [6], and language processing [7], thus enabling effective modeling of complex systems.
As reported in [8], data generation has accelerated at an increasingly alarming rate; however, the vast volume of data comes at the price of higher energy consumption due to the large amount of processing and communication required. Moreover, Big Data applications are also still challenged by the quality of the generated data [9]. Data quality typically refers to the volume, accuracy, and completeness of the data and phenomena such as noise, outlying data points, and missing data are often cited as causes of low accuracy and/or completeness. This study adopts a different view of data quality which refers to how relevant and, by extension, how useful the data are to perform a specific task. Although the phenomena mentioned before can lead to a degradation in quality, even ”clean” data can be irrelevant. This view of data quality directly challenges the idea that more data is necessarily better, since the performance of the task can be compromised by low-quality data. In light of this, it is possible and preferable to achieve better performance while using fewer resources, thus increasing the energy efficiency and sustainability of Big Data solutions.
Appealing to the relevancy of the data is hardly a new idea. In 1996, a highly influential paper categorized different kinds of data quality, herein the contextual quality of the data to the task at hand [10]. In mathematical modeling, this idea can be realized by carefully selecting the types of input data which are most relevant for the model or, in the case of Machine Learning (ML), which features of the data are most relevant for the given task. This study will explore a second interpretation: Given a data set, is it possible to determine if a data point within that set is useful for estimating a parameter of a mathematical model, and can this be used to include useful data points and exclude non-useful data points in the modeling process, thus improving the modeling and saving on resources?
Methods for answering the above will be referred to as data selection methods. Data selection is an interdisciplinary field of research which is used in many scientific areas and the definition of a data point greatly depends on the application. A data point can be a collection of features, a segment from a time series signal, an image, a collection of measurement signals, a piece of text, etc. Data selection methods can be highly beneficial, since they can significantly reduce the data burden of complex modeling tasks, thus saving computational resources.
Data selection techniques have been discussed for specific domains such as mobile crowd sensing [11], and more recently large language model (LLM) training [12]. However, to the best of our knowledge, no previous work has been carried out which provides a comprehensive overview of these methods across different scientific domains. A general interdisciplinary survey of data selection methods is highly important for the scientific field as a whole to facilitate knowledge sharing between disciplines and to encourage the wider usability of data selection methods in other data-intensive areas of research. This study provides an overview of different data selection methods which have come to light in recent years and in different fields of engineering with the purpose of categorizing them in a more structural and functional way. Specifically, this study provides the following contributions:
A first-of-its-kind high-level overview of data selection methodologies and their application to different fields of engineering.
An establishment of three different dichotomies used to characterize data selection methodologies.
A qualitative comparison of different data selection methodologies based on key performance metrics.
Discussion of opportunities, challenges, and future research directions for data selection methods.
A technical roadmap for this study can be found in Fig. 1. The remainder of the paper is organized as follows: Section II describes the methodology for the search and selection of relevant works included in this study along with a preliminary analysis. Section III formulates the data selection problem in greater detail and introduces three dichotomies for classifying fundamentally different data selection methods. Sections IV, V, and VI present and discuss individual data selection methods that have been identified in Section III. A comparison of the data selection methods presented in this study is done in Section VII along with a broader discussion on data selection. Concluding remarks are given in Section VIII.
Searching Methodology
In this section, the methodology used to search, screen, and analyze the scientific literature relevant to data selection methods is described. The search focused on the application of data selection in various fields of engineering, as described in Section I. To this end, we searched for papers in engineering databases such as IEEE Xplore (IEEE), Engineering Village (Elsevier), ACM Digital Library (ACM), and Web of Science Core Collection (Clarivate). The search was conducted using search terms relevant to data selection such as ”data-selection”, ”data-selective”, ”selection of data”, and terms relevant to parameter and/or state estimation. Moreover, searches were also conducted using keywords from different fields which were identified to be synonymous with data selection, e.g., ”measurement selection”, ”training data selection”, ”sensor selection”, etc. Subsequent searches were also made in Google Scholar for additional papers. The search was limited to journal articles, conference articles, and review articles published between 2000 and 2023 (inclusive).
In the selection process, articles that did not fit the scope of this study were deemed irrelevant and therefore excluded from review. This includes papers focusing on different aspects of model development, such as model selection methods, determining relevant parameters to include in the model, and the type of input data which can be relevant to the model as mentioned in Section I. For the application of machine learning models, papers which focus on determining relevant features and reducing the dimension of the input data were also excluded. Furthermore, papers that were not written in English were excluded along with papers in which the full-text document could not be retrieved. The articles were screened using a two-step approach. First, the extracted papers were screened based on their perceived relevance from their titles and abstracts, using the considerations above. The screening was conducted using the open source tool ASReview, which utilizes machine learning to order the papers in terms of predicted relevance based on the authors judgment of previous papers [13]. Secondly, papers that were deemed relevant based on the abstract and title were further screened based on their content. To help facilitate this process, the machine learning tool ChatPDF (see [14]) was used to generate a report of each article based on a designed prompt with general questions regarding their content. This includes questions about how the article used data selection, what the stated motivation is for using data selection, which application and context the data selection was performed in, and what the stated results and conclusions were. The selection of the articles was performed using both the generated report and a manual review to ensure that the articles were not unjustifiably deemed irrelevant. Moreover, the authors screened newer papers more favorably than older works in order to keep the discussion relevant toward recent developments.
A. Preliminary Analysis
Fig. 2 shows the distribution of included publications over time. As mentioned above, the screening process was intentionally biased towards recent publications. 56 publications are included in this study, of which 16 have been published in conferences, while the rest are published in journals. To get an overview of the scientific fields which are represented in these papers, a wordcloud was generated based on the titles of the journals and conferences in which the papers were published in. The result can be seen in Fig. 3. Many different fields of engineering are represented including signal processing, system modeling, Artificial Intelligence (AI), electrical engineering, transportation, communication, aerospace, language processing, robotics, control, and biomedical analysis. This further demonstrates the fact that data selection is a highly interdisciplinary methodology with wide usage in different applications. However, this also presents a significant challenge, as the notation used to formulate the data selection problem varies across the different fields. Furthermore, the data selection methodologies used in the different fields can vary significantly depending on the specific challenges and requirements and each field tends to have different conventions and may not illustrate their data selection methodology relative to a general interdisciplinary framework. Using discipline-specific conventions may alienate researchers from other fields, leading them to perceive the methodology as irrelevant for their needs. Thus, having different conventions for each discipline may restrict the transfer of knowledge between disciplines. In the next section, we will propose a general framework for formulating the data selection problem which encompasses all types of data selection methods which were identified from the relevant works.
Wordcloud of the 50 most used words in the journal and conference titles for papers included in this study.
Problem Formulation
This section defines the general framework for data selection methods and common mathematical notation. We define the
\begin{equation*} \hat{\mathbf {y}}_{i} = M(\mathbf {x}_{i};\hat{\boldsymbol{\theta }}_{i}),\quad i=1,2,\ldots, N \tag{1} \end{equation*}
In the data selection framework, a selection algorithm is introduced to decide if
\begin{equation*} \delta _{i} = S\left(Q(D_{i};\cdot )\right),\quad \delta _{i} \in \lbrace 0,1\rbrace \tag{2} \end{equation*}
A. A Priori Data Selection
A diagram showing the general framework of a priori data selection can be seen in Fig. 4. Here, the selection algorithm calculates
B. A Posteriori Data Selection
Fig. 5 illustrates the general framework of a posteriori data selection. Unlike a priori data selection, the selection happens after model inference and is used to decide if new model parameters should be estimated based on
C. Identified Data Selection Methods
From the literature review, this study has identified seven different methodologies of data selection. The methodologies were defined such that they are general enough to be applicable across multiple scientific domains while being specific enough to create meaningful distinctions. Fig. 6 shows how each data selection method described above fits in the two dichotomies defined in the previous subsections. What follows is a short description of each method:
Error-based selection: The quality of the data is characterized by the prediction error of the model.
Confidence-based selection: The quality of the data is based on how confident the model is in its inference. In a priori confidence-based selection, the confidence measure is predicted prior to model inference.
Learning-based selection: The quality of the data is learned by a machine learning algorithm based on previous experiences from the model.
Information-theoretic selection: The quality of data is characterized by how much information it carries about a certain model parameter.
Similarity-based selection: The quality of the data is based on the similarity of it with other data points, using either a distance or a correlation metric.
Distribution-based selection: The quality of the data is based on its distribution-characteristics and how well the data fits in the overall distribution of the data set.
Data selection methods identified in this study including relevant references for each method.
In the following sections, an overview of each identified data selection method is provided. Each section includes a table summarizing the selected work with the application of the selection and relevant design choices of the selection algorithm. In addition, two quantitative metrics have been evaluated for each reference. These are the total reduction of the data usage in percent and the overall change in performance of the model. As the span of works covers many domains, not one evaluation metric is shared between the included references. For works which evaluate their model performance in percentage values (such as classification accuracy), the performance change is evaluated in percentage points (noted as pp). Improved performances are indicated by the
Additionally, the selection methods will be analyzed with respect to 6 qualitative metrics. These metrics are:
Potential Resource Savings: Potential reduction in used resources compared to non-selective methods or random selection. The type of resource is specific to the application and can be computational resources, energy consumption, resources related to manual labeling etc.
Robustness: General robustness towards phenomena that may disturb the modeling such as outlying data-points, background noise, training set size etc.
Complexity: General complexity of the selection algorithm with respect to the number of computations required to run the algorithm as well as memory resources required to accommodate it.
Required System Knowledge: The amount of knowledge needed about the system to implement the selection algorithm and be able to get a good separation between high-quality and low-quality data points.
Implementation Difficulty: Difficulty in implementing the selection algorithm including prior work needed before implementation, implementing the algorithm in code and the hardware required to support the algorithm.
Versatility: How well the selection method generalizes to different applications.
Model-Guided Selection Methods
A. Error-Based Selection
Error-based data selection has gained a lot of popularity in recent years, due to its intuitive approach in system modeling. Relevant articles in error-based selection can be found in Table I. The selection method closely follows the general a posteriori selection framework in Fig. 5 where
Absolute Error (AE) [15], [16]:
\begin{equation*} E(y_{i},\hat{y}_{i}) = |y_{i}-\hat{y}_{i}| \tag{3} \end{equation*} View Source\begin{equation*} E(y_{i},\hat{y}_{i}) = |y_{i}-\hat{y}_{i}| \tag{3} \end{equation*}
Absolute Squared Error (ASE) [17]:
\begin{equation*} E(y_{i},\hat{y}_{i}) = |y_{i}-\hat{y_{i}}|^{2} \tag{4} \end{equation*} View Source\begin{equation*} E(y_{i},\hat{y}_{i}) = |y_{i}-\hat{y_{i}}|^{2} \tag{4} \end{equation*}
Normalised Absolute Error (NAE) [18], [19], [20]:
where\begin{equation*} E(y_{i}, \hat{y}_{i}) = \frac{|y_{i}-\hat{y}_{i}|}{\sigma _{n}} \tag{5} \end{equation*} View Source\begin{equation*} E(y_{i}, \hat{y}_{i}) = \frac{|y_{i}-\hat{y}_{i}|}{\sigma _{n}} \tag{5} \end{equation*}
is the standard deviation of the measurement noise.$\sigma _{n}$ Normalized Sum of Squared Error (NSSE) [21]:
where\begin{equation*} E(\mathbf {y}_{i}, \hat{\mathbf {y}}_{i}) = (\mathbf {y}_{i}-\hat{\mathbf {y}}_{i})^{H}\mathbf {S}_{i}^{-1}(\mathbf {y}_{i}-\hat{\mathbf {y}}_{i}) \tag{6} \end{equation*} View Source\begin{equation*} E(\mathbf {y}_{i}, \hat{\mathbf {y}}_{i}) = (\mathbf {y}_{i}-\hat{\mathbf {y}}_{i})^{H}\mathbf {S}_{i}^{-1}(\mathbf {y}_{i}-\hat{\mathbf {y}}_{i}) \tag{6} \end{equation*}
is the covariance matrix of$\mathbf {S}_{i}$ .$\mathbf {y}_{i}-\hat{\mathbf {y}}_{i}$ Maximum Correntropy Criterion (MCC) cost [23]:
where\begin{equation*} E(y_{i}, \hat{y}_{i}) = \exp \bigg (-\frac{(y_{i}-\hat{y}_{i})^{2}}{2\sigma ^{2}}\bigg ) \tag{7} \end{equation*} View Source\begin{equation*} E(y_{i}, \hat{y}_{i}) = \exp \bigg (-\frac{(y_{i}-\hat{y}_{i})^{2}}{2\sigma ^{2}}\bigg ) \tag{7} \end{equation*}
is the size of the Gaussian kernel.$\sigma$
The appropriate definition of the error depends on the application. For example, ASE may be favorable to amplify the prediction error of outliers and prevent suspected channel attacks [17]. In [21], it was argued that the NSSE is equivalent to using the Kullback-Leibler (KL) divergence for linear Gaussian systems, although the inversion in (6) is computationally expensive for large
Typically, a threshold-based policy is adopted to determine the selection parameter, i.e.,
\begin{equation*} \delta _{i} = S(E;\tau _{low}, \tau _{up}) = \left\lbrace \begin{array}{ll}1,& \tau _{low} \leq E \leq \tau _{up},\\ 0,& \text{otherwise} \end{array}\right. \tag{8} \end{equation*}
Resource Savings: As evidenced by Table I, error-based data selection can offer high savings in estimate updates and, in most cases, the selection will result in unchanged or improved model performance. However, being an a posteriori selection method, the saved computational costs come from a reduced use of the estimation algorithm. Being selective in when to update the parameter could justify using more complex and accurate estimation methods.
Robustness: Multiple works have reported excellent robustness to noisy outliers and system changes after implementing error-based data selection, particularly due to the inclusion of an upper threshold [15], [16], [17], [18], [19].
Complexity: Error-based selection has the lowest computational complexity of all the selection methods due to the relatively simple error metrics and selection policy.
Required System Knowledge: The key task in error-based selection is to properly design thresholds that are appropriate for the given system. The works discussed in this section all consider a relatively simple linear system model and the maturity of the selection method towards more complex systems is yet to be shown. Increasing the complexity of the system yields a greater challenge in identifying optimal thresholds which may change over time. Thus, developing strategies to find appropriate thresholds which can adapt to changes should be considered in future research along with its efficacy in complex non-linear systems.
Implementation Difficulty: Due to the low computational complexity, error-based data selection methods are very simple to implement in low-complexity systems. The main difficulty comes from designing appropriate thresholds which may not be trivial.
Versatility: The intuitive and simple implementation of error-based data selection makes it potentially highly versatile for applications where online system modeling is a core component. The relative low complexity in calculating
B. Confidence-Based Selection
Although error-based selection methods rely on the error between the true output
A posteriori confidence-based selection is typically used in semi-supervised ML methods where
Confidence-based selection methods are most prominent in classification tasks. In
\begin{equation*} \hat{\mathbf {y}}_{i} = {\begin{bmatrix}P(y_{1} | \mathbf {x}_{i}) & P(y_{2} | \mathbf {x}_{i}) & \cdots & P(y_{k} | \mathbf {x}_{i}) \end{bmatrix}}^{T} \tag{9} \end{equation*}
Maximum posterior class probability [26], [34]:
\begin{equation*} C(\hat{\mathbf {y}}_{i}|\mathbf {x}_{i}) = \underset{j=1,2,\ldots,k}{{\max}} P(y_{j} | \mathbf {x}_{i}). \tag{10} \end{equation*} View Source\begin{equation*} C(\hat{\mathbf {y}}_{i}|\mathbf {x}_{i}) = \underset{j=1,2,\ldots,k}{{\max}} P(y_{j} | \mathbf {x}_{i}). \tag{10} \end{equation*}
Negative posterior class probability entropy [27], [28], [29], [30], [34]:
\begin{equation*} C(\hat{\mathbf {y}}_{i}|\mathbf {x}_{i}) = \sum _{j=1}^{k}P(y_{j}|\mathbf {x}_{i})\log P\left(y_{j}|\mathbf {x}_{i}\right). \tag{11} \end{equation*} View Source\begin{equation*} C(\hat{\mathbf {y}}_{i}|\mathbf {x}_{i}) = \sum _{j=1}^{k}P(y_{j}|\mathbf {x}_{i})\log P\left(y_{j}|\mathbf {x}_{i}\right). \tag{11} \end{equation*}
Posterior class probability margin [34], [37]:
\begin{align*} C(\hat{\mathbf {y}}_{i}|\mathbf {x}_{i}) =& \underset{j=1,2,\ldots,k}{{\max}}P(y_{j}|\mathbf {x}_{i})\\ &- \underset{{{\scriptstyle {\begin{array}{c}m=1,2,\ldots,k \\ m \ne j\end{array}}}}}{{\max}}P(y_{m}|\mathbf {x}_{i}). \tag{12} \end{align*} View Source\begin{align*} C(\hat{\mathbf {y}}_{i}|\mathbf {x}_{i}) =& \underset{j=1,2,\ldots,k}{{\max}}P(y_{j}|\mathbf {x}_{i})\\ &- \underset{{{\scriptstyle {\begin{array}{c}m=1,2,\ldots,k \\ m \ne j\end{array}}}}}{{\max}}P(y_{m}|\mathbf {x}_{i}). \tag{12} \end{align*}
We note that the negative entropy is used in (11) to make it consistent with the notion of confidence, i.e., a higher negative entropy (and thus lower entropy) leads to higher confidence in the prediction.
Another common approach is to define a committee of models
Committee variance [31]:
where\begin{equation*} C(\hat{y}_{i}|\mathbf {x}_{i}) = \frac{1}{L} \sum _{l=1}^{L} \left(M_{l}(\mathbf {x}_{i};\boldsymbol{\theta }_{l})-\hat{y}_{i}\right)^{2} \tag{13} \end{equation*} View Source\begin{equation*} C(\hat{y}_{i}|\mathbf {x}_{i}) = \frac{1}{L} \sum _{l=1}^{L} \left(M_{l}(\mathbf {x}_{i};\boldsymbol{\theta }_{l})-\hat{y}_{i}\right)^{2} \tag{13} \end{equation*}
.$\hat{y}_{i}=\frac{1}{L}\sum _{l=1}^{L} M_{l}(\mathbf {x}_{i};\boldsymbol{\theta }_{l})$ Average committee KL divergence [32], [34]:
where\begin{equation*} C(\hat{\mathbf {y}}_{i}|\mathbf {x}_{i}) = \frac{1}{L} \sum _{l=1}^{L} \sum _{j=1}^{k} P_{M_{l}}(y_{j}|\mathbf {x}_{i})\log \frac{P_{M_{l}}(y_{j}|\mathbf {x}_{i})}{P_{avg}(y_{j}|\mathbf {x}_{i})} \tag{14} \end{equation*} View Source\begin{equation*} C(\hat{\mathbf {y}}_{i}|\mathbf {x}_{i}) = \frac{1}{L} \sum _{l=1}^{L} \sum _{j=1}^{k} P_{M_{l}}(y_{j}|\mathbf {x}_{i})\log \frac{P_{M_{l}}(y_{j}|\mathbf {x}_{i})}{P_{avg}(y_{j}|\mathbf {x}_{i})} \tag{14} \end{equation*}
is the posterior probability of class$P_{M_{l}}(y_{j}|\mathbf {x}_{i})$ under$j$ and$M_{l}$ .$P_{avg}(y_{j}|\mathbf {x}_{i}) = \frac{1}{L}\sum _{l=1}^{L}P_{M_{l}}(y_{j}|\mathbf {x}_{i})$ Committee vote entropy [33], [34]:
where\begin{equation*} C(\hat{\mathbf {y}}_{i}|\mathbf {x}_{i}) = -\sum _{j=1}^{k} \frac{V(y_{j}|\mathbf {x}_{i})}{k}\log \frac{V(y_{j}|\mathbf {x}_{i})}{k} \tag{15} \end{equation*} View Source\begin{equation*} C(\hat{\mathbf {y}}_{i}|\mathbf {x}_{i}) = -\sum _{j=1}^{k} \frac{V(y_{j}|\mathbf {x}_{i})}{k}\log \frac{V(y_{j}|\mathbf {x}_{i})}{k} \tag{15} \end{equation*}
is the number of models predicting$V(y_{j}|\mathbf {x}_{i})$ to be in class$\mathbf {x}_{i}$ .$j$
Note that QBC enables confidence-based selection methods to be used for regression tasks by employing a confidence score such as in (13).
Two different approaches are used to select
A priori confidence-based data selection closely follows the general a priori framework in Fig. 4 with
Resource Savings: A posteriori confidence-based selection methods have shown great promise in reducing the amount of training data needed to update the model over time. This reduction in data usage correspond to a reduction in computational power needed to update the model, less transmission costs in the case of edge computing, and fewer manual annotations needed in the case of active learning applications. A priori confidence-based selection has the added benefit of reducing costs related to model inference which in the case of edge computing can lead to more efficient use of communication resources [36], [37].
Robustness: The model requires a labeled training data set as a jumping off point to get reliable confidence estimates, although previous work has shown that the active learning framework can be robust with respect to the size of the initial training set [40]. Furthermore, if low-confidence data points are selected, the selection may not be robust against outliers. If high-confidence points are selected, incorrect pseudo-labels can introduce confirmation bias to the model [41].
Complexity: In classification tasks, the complexity of (10)-(12) is relatively low even for a substantial amount of classes. Using (13)-(15) significantly adds to computational and memory complexity since multiple models are required for inference. The complexity of a priori confidence-based selection depends on how the confidence is predicted. By introducing a smaller proxy model, the complexity can be lower than using QBC [37].
Required System Knowledge: Since confidence-based data selection is used primarily in a machine learning context, little prior knowledge is needed about the relationship between
Implementation Difficulty: The hardware requirements for confidence-based selection depend on the computational and memory complexity of the selection algorithm, as discussed above. However, code implementation of the a posteriori confidence metrics discussed above is straightforward. Implementation of a priori confidence-based selection algorithms is potentially more difficult due to the added complexity in prior prediction of model confidence. However, as seen in [36] and [37], adding small proxy models can be a good way to circumvent this issue.
Versatility: Since confidence-based selection is used mainly in ML, it is highly versatile across different applications, as is also clear in Table II, provided that conventional challenges in ML such as data availability and hardware requirements are adequately addressed. Thus, confidence-based selection has the potential of being highly relevant in emerging technologies utilizing ML, e.g., edge computing, autonomous driving, Language processing and more.
C. Information-Theoretic Selection
Information-theoretic data selection methods leverage information-theoretic metrics to measure the amount of information gained from the selection. The selection method closely follows the general a priori framework in Fig. 4 where
Two common measures of information are:
Fisher Information Matrix (FIM) [42], [43], [44], [45], [47]
where\begin{equation*} \mathbf {F}(\mathbf {x}_{i},\boldsymbol{\theta }) = -\mathbb {E}_{\boldsymbol{\theta }}\left[\frac{\partial ^{2}}{\partial \boldsymbol{\theta }\partial \boldsymbol{\theta }^{T}}\ln {f(\mathbf {x}_{i};\boldsymbol{\theta })}\Big |\boldsymbol{\theta }\right] \tag{16} \end{equation*} View Source\begin{equation*} \mathbf {F}(\mathbf {x}_{i},\boldsymbol{\theta }) = -\mathbb {E}_{\boldsymbol{\theta }}\left[\frac{\partial ^{2}}{\partial \boldsymbol{\theta }\partial \boldsymbol{\theta }^{T}}\ln {f(\mathbf {x}_{i};\boldsymbol{\theta })}\Big |\boldsymbol{\theta }\right] \tag{16} \end{equation*}
is the likelihood function of$f$ , and$\mathbf {x}_{i}$ is the expectation operator under the distribution of$\mathbb {E}_{\boldsymbol{\theta }}[\cdot ]$ [48].$\boldsymbol{\theta }$ Mutual Information (MI) [43]:
where\begin{equation*} I(\mathbf {x}_{i}, \boldsymbol{\theta }) = H(\boldsymbol{\theta }) - H(\boldsymbol{\theta }|\mathbf {x}_{i}) \tag{17} \end{equation*} View Source\begin{equation*} I(\mathbf {x}_{i}, \boldsymbol{\theta }) = H(\boldsymbol{\theta }) - H(\boldsymbol{\theta }|\mathbf {x}_{i}) \tag{17} \end{equation*}
and$H(\boldsymbol{\theta })$ are the entropy and conditional entropy of$H(\boldsymbol{\theta }|\mathbf {x}_{i})$ .$\boldsymbol{\theta }$
The FIM is the most used information metric. The inverse of the FIM provides a lower bound on the variance of any unbiased estimator [48]. Thus, data points which maximize the FIM can lead to better estimates. The FIM is typically calculated recursively [42], [44], [45] or analytically [47]. For linear systems with Gaussian noise, the FIM can be calculated as a linear expression [44]. However, for non-linear systems where the distribution of
Resource Savings: As shown in Table III, information-theoretic data selection can yield impressive reductions in data usage. As the main application of information-theoretic selection is in sensor selection, this data reduction results in lower costs in communication which has a significant effect on the overall energy consumption of the system.
Robustness: Information-theoretic selection methods have been shown to be robust to the uncertainties inherent in sensors [42], [45] and can be used to account for uncertainties and biases in the modeling process [47]. However, it should be noted that the FIM only provides a lower bound on the estimation variance if the estimate is unbiased, which is rarely the case. Thus, there are rarely guarantees for the effectiveness of the FIM to provide better estimates.
Computational Complexity: The computational complexity of the selection method is highly dependent on the complexity of the system. For most systems, Monte Carlo methods are usually required to estimate underlying distributions which adds to the computational complexity of the selection method by a significant amount. To avoid complexity issues, the system model can be reduced to a linear form which can greatly simplify calculations [46] but the performance of the estimation may break down for highly non-linear systems.
Required System Knowledge: For efficient implementation, a high degree of system knowledge is usually required. This is especially the case if the FIM is calculated analytically, as in [47].
Implementation Difficulty: The difficulty of the implementation depends on the methods chosen for calculating the information metric. Monte Carlo methods can easily be implemented on most hardware, although at the expense of high computational complexity. To alleviate this, a mathematical analysis of the system is required before hand which raises the complexity of the implementation.
Versatility: Information-theoretic data selection is highly represented in wireless sensor network applications for system state estimation, particularly sensor selection in object tracking applications [49]. However, in these applications, simple assumptions about the motion of the object, such as linearity, are used [42], [44], [45], [46]. For a more complex model, such as in [47], a greater amount of effort is needed to implement the selection method. Thus, the versatility of the selection method is limited by the complexity of the model. Developing effective ways of estimating parameter distributions in a cost-efficient manner will highly reduce the difficulty of implementation and make this selection method much more versatile. Nonetheless, information-theoretic selection has shown to be highly useful for optimizing transmission costs and may play a key role in future communication systems.
D. Learning-Based Selection
Unlike previously discussed methods, learning-based selection does not rely on predefined heuristics such as error, confidence, or information gain. Rather, the quality of data points is learned over time by a machine learning model. A general framework for learning-based selection is shown in Fig. 8. In this framework, the quality of
Reinforcement Learning (RL) is highly prominent in the literature because of its wide utility in different applications. In the general RL framework, an agent decides an action based on the state of an environment. Through a reward feedback system, the agent learns which actions it should take in each state to maximize its accumulated reward [59]. In learning-based data selection, the selection model performs an action based on the state of the target model and the data and adjusts its strategy depending on feedback from the target model. In [50], a teacher-student framework was proposed based on data features, as well as base model states such as iteration number and historical training accuracy. In [51] and [52], data features were encoded using a Convolutional Neural Network (CNN) to represent the states of the RL agent along with the posterior class probabilities predicted by the target model. In [53], a data valuation framework was proposed for various machine learning tasks. The reward was constructed by following the evolution of
Resource Savings: The works in Table IV have reported great reductions in data for a small to significant performance improvement in most applications. As a result of reduced data usage, authors have reported several benefits such as lower computational costs [50], [57], [58], reduced training time for the target model [57], [58], lower communication costs [57], and, in the case of active learning, reduced manual labeling costs [51], [52]. Thus, learning-based selection has the potential to introduce immense resource savings.
Robustness: In [56], it was shown that the proposed method led to better temperature estimates than a standard SVM, while also being robust to outliers. Learning-based selection methods are more robust to poor initial models since they can learn which data points are more useful in the early stages of the learning process. However, the learned selection algorithms are heavily dependent on the training data used for the selection model and the robustness of the selection may fail in unseen scenarios.
Computational Complexity: Since learning-based selection methods are designed for use in machine learning applications, the overall computational complexity is too high for conventional CPU hardware, thus requiring the need for hardware accelerators such as GPUs. However, it is important to note that the vast majority of computational costs are due to the training of the selection algorithm. A pretrained selection algorithm requires less computational resources for inference and can thus be implemented on hardware of lower complexity. For online applications, care must be taken that the computational complexity of the selection model is small enough compared to the target model to ensure that the selection process does not lead to an overall increase in computations, although this can also restrict the performance of the selection model.
Required System Knowledge: Little or no system knowledge is required to implement the learning-based selection methods, since the utility of the data is learned over time. However, as shown in [54], improvements can be made to the selection algorithm by defining specific features beforehand.
Implementation Difficulty: Learning-based selection methods have a high difficulty of implementation, particularly due to the hardware required for training and implementing the selection algorithms. Methods for reducing the size of the selection model, such as model pruning, can help alleviate the computational burden and make it more applicable for low-complexity systems. Additionally, great care must be taken to construct a comprehensive training set for the selection algorithm to learn from which can further complicate the implementation of the selection method in the specific use case.
Versatility: Learning-based data selection can be very flexible for different applications and target models. Specifically, the frameworks presented in [50] and [53] have been shown to be agnostic to the type of target model used. This makes learning-based selection methods highly attractive for many prominent research areas in AI such medical imaging, language processing, and protein folding. However, the majority of the works discussed in this section only consider data selection to improve target model training. As shown in [56], learning-based data selection can also be used to identify if the data should be used in a potentially complex model identification process or skipped entirely, saving limited resources. Likewise, learning-based data selection can also be used to determine whether data samples should be used for online model inference or not based on past experiences.
Data-Driven Selection Methods
A. Similarity-Based Selection
Similarity-based selection methods leverage similarity metrics to determine if
Euclidian distance [28], [62]:
where\begin{equation*} d(\mathbf {v},\mathbf {v}^{\prime }) = ||\mathbf {v} - \mathbf {v}^{\prime }||_{2} \tag{18} \end{equation*} View Source\begin{equation*} d(\mathbf {v},\mathbf {v}^{\prime }) = ||\mathbf {v} - \mathbf {v}^{\prime }||_{2} \tag{18} \end{equation*}
is the$||\cdot ||_{2}$ norm.$l_{2}$ Squared Euclidian distance [31], [63]:
\begin{equation*} d(\mathbf {v},\mathbf {v}^{\prime }) = ||\mathbf {v} - \mathbf {v}^{\prime }||_{2}^{2}. \tag{19} \end{equation*} View Source\begin{equation*} d(\mathbf {v},\mathbf {v}^{\prime }) = ||\mathbf {v} - \mathbf {v}^{\prime }||_{2}^{2}. \tag{19} \end{equation*}
Generalized Inner Product (GIP) [29]:
where\begin{equation*} d(\mathbf {v}, \mathbf {v}^{\prime }) = \mathbf {v}^{H}\mathbf {M}\mathbf {v}^{\prime } \tag{20} \end{equation*} View Source\begin{equation*} d(\mathbf {v}, \mathbf {v}^{\prime }) = \mathbf {v}^{H}\mathbf {M}\mathbf {v}^{\prime } \tag{20} \end{equation*}
denotes the Hermitian transpose and$H$ is a positive-definite symmetric matrix.$\mathbf {M}$
Note that
Another common type of similarity measure is the correlation between
Cosine similarity [27], [68], [69], [70]:
where\begin{equation*} c(\mathbf {v},\mathbf {v}^{\prime }) = \frac{\langle \mathbf {v},\mathbf {v}^{\prime }\rangle }{||\mathbf {v}||_{2}||\mathbf {v}^{\prime }||_{2}} \tag{21} \end{equation*} View Source\begin{equation*} c(\mathbf {v},\mathbf {v}^{\prime }) = \frac{\langle \mathbf {v},\mathbf {v}^{\prime }\rangle }{||\mathbf {v}||_{2}||\mathbf {v}^{\prime }||_{2}} \tag{21} \end{equation*}
denotes the inner product.$\langle \cdot,\cdot \rangle$ Pearson correlation coefficient [71]:
where\begin{equation*} c(\mathbf {v}, \mathbf {v}^{\prime }) = \frac{\langle \tilde{\mathbf {v}},\tilde{\mathbf {v}}^{\prime }\rangle }{||\tilde{\mathbf {v}}||_{2}||\tilde{\mathbf {v}}^{\prime }||_{2}} \tag{22} \end{equation*} View Source\begin{equation*} c(\mathbf {v}, \mathbf {v}^{\prime }) = \frac{\langle \tilde{\mathbf {v}},\tilde{\mathbf {v}}^{\prime }\rangle }{||\tilde{\mathbf {v}}||_{2}||\tilde{\mathbf {v}}^{\prime }||_{2}} \tag{22} \end{equation*}
and$\tilde{\mathbf {v}} = \mathbf {v} - \bar{\mathbf {v}}$ is the sample mean of$\bar{\mathbf {v}}$ .$\mathbf {v}$ Coefficient of determination (
) [72]:$R^{2}$ \begin{equation*} c(\mathbf {v},\mathbf {v}^{\prime }) = 1-\frac{||\mathbf {v}-\mathbf {v}^{\prime }||_{2}^{2}}{||\tilde{\mathbf {v}}||_{2}^{2}}. \tag{23} \end{equation*} View Source\begin{equation*} c(\mathbf {v},\mathbf {v}^{\prime }) = 1-\frac{||\mathbf {v}-\mathbf {v}^{\prime }||_{2}^{2}}{||\tilde{\mathbf {v}}||_{2}^{2}}. \tag{23} \end{equation*}
Magnitude-squared coherence [73]:
\begin{equation*} c(\mathbf {v},\mathbf {v}^{\prime }) = \frac{|\langle \mathbf {v}, \mathbf {v}^{\prime } \rangle |^{2}}{||\mathbf {v}||_{2}^{2}||\mathbf {v}^{\prime }||_{2}^{2}}. \tag{24} \end{equation*} View Source\begin{equation*} c(\mathbf {v},\mathbf {v}^{\prime }) = \frac{|\langle \mathbf {v}, \mathbf {v}^{\prime } \rangle |^{2}}{||\mathbf {v}||_{2}^{2}||\mathbf {v}^{\prime }||_{2}^{2}}. \tag{24} \end{equation*}
Correlations close to 0 mean low similarity, while correlations close to 1 mean high similarity. The cosine similarity measure is most prominent in the literature. We note that the cosine similarity measure is not strictly a measure of correlation; however, its interpretation is similar to a correlation metric. It is equivalent to the Pearson correlation coefficient for mean-centered
Typically,
In [69], the cosine similarity measure was used to compare embeddings of the same data point
In most applications,
Resource Savings: Similarity-based selection methods are very attractive for introducing information on the relationship between data points in the same data set. This has become very useful in pruning the initial training data set by removing redundant data points prior to training [62], [69] or, in the case of active learning, choosing more efficiently which data points to label [27], [63]. Similarity-based selection has also been implemented for more efficiently selecting the coreset of large image data sets [62], [68], [69], thus saving resources for training image classifiers.
Robustness: The robustness of the selection method depends on the selection policy. If dissimilar points are preferred, outlying data points may also be selected and interfere with the estimation. If similar points are preferred, noisy data points can be filtered out, making the selection more robust against outliers and noise measurements, as seen in [72], [73].
Computational Complexity: Whether a distance or correlation measure is used, one major limitation of similarity-based selection methods is the high computational complexity associated with comparing between data points in the data set. This is especially problematic for high-dimensional data and necessitates the use of preprocessing methods for reducing the data dimension such as feature extraction and possibly dimensionality reduction techniques. Although the largest data reductions can be gained on large datasets, this also requires significant computation and memory resources. These methods are therefore generally reserved for offline use to accommodate the high computation and memory requirements. However, efforts have been made to reduce the computational complexity. In [62], a cluster-based algorithm was proposed instead of directly computing the distance between points. The authors in [71] also suggested using K-means clustering to limit storage use. In [67], an algorithm was proposed to approximate the convex hull of high-dimensional data sets. In cases where similarity metrics are only used to compute intra-point similarities, such as between
Required System Knowledge: Correlation measures can be used to ensure higher uniformity and prevent outliers from negatively affecting the results, although they can break down for highly complex and non-linear systems. For example, the similarity measures in (22) and (24) only measure linear correlation between data points; however, nonlinear correlations can go unnoticed. An effective use of correlation thus requires additional knowledge of the underlying system. Furthermore, distance-based selection methods can be effective on clustered data by thinning out data points in the same cluster [63] or only selecting data points within the same cluster [64], [65]. Thus, prior knowledge about the structure of the data set can be beneficial.
Implementation Difficulty: Similarity-based selection methods generally have a high difficulty in implementation, mainly due to the requirement of hardware with a large amount of computational and memory resources. However, when the amount of data points which have to be compared are relatively low or inter-point similarities are not considered these hardware requirements are lessened. Another way to reduce computation and memory complexity for online applications is by using a time-varying
Versatility: As can be seen in Table V, similarity measures are highly versatile and can easily be implemented for a wide variety of applications to effectively reduce the size of the data set. However, one should consider the choice of similarity measure and whether it is appropriate for the specific application. Distance measures were found to be used mainly to ensure diversity, although this may inadvertently favor outlying data points. Similarity-based selection methods can also be readily combined with other selection methods. In particular, similarity metrics have been used in conjunction with confidence measures to ensure the selection of low-confidence points while maintaining a diverse training set [27], [28], [29], [31].
B. Distribution-Based Selection
Distribution-based selection methods aim to define their selection strategy based on the distribution characteristics of the whole data set. The applications, motivation, and intuition resemble what was seen from similarity-based methods; however, while similarity-based methods aim to measure inter-point similarities, distribution-based methods aim to provide a more comprehensive view of the underlying data distribution. The general approach of distribution-based data selection can be seen in Fig. 9. While data selection methods are generally used to reduce the data usage as much as possible, the works employing distribution-based selection aim to make the estimation more robust towards outliers and measurement noise. Selected works are summarized in Table VI. Note that data reduction and performance change are not included in the table, since the works did not adequately report these results.
A common approach in distribution-based selection is to use mixture models to estimate the distribution of
For signal processing applications, the covariance matrix has been shown to be particularly useful for identifying and removing outlying signals. In [77], a two-step data selection procedure was proposed using the GIP, as defined in (20), of each signal using the inverted normalized sample covariance matrix (see [85]). Signals that exhibit a high GIP are then considered outliers. The method was later extended in [78] by considering cases where the statistical characteristics of the signals are unknown and conventional covariance estimates fail. In the case where the data are made up of multiple parallel signals, the authors in [79] proposed a quality metric based on the diagonalization of the sample covariance matrix of those signals, although this process also faces complexity issues. In [80] and [81], the authors leveraged the power spectral density which estimates the power distribution of the signal in the frequency domain. Signals with a high signal-to-noise ratio (SNR) were selected to minimize the influence of measurement noise.
Resource Savings: As mentioned before, the main objective of distribution-based selection methods is to improve overall robustness towards outliers, rather than reducing the data usage as much as possible. Thus, the resource savings obtained from distribution-based methods are lower relative to those obtained by other selection methods.
Robustness: Distribution-based selection yields excellent robustness towards outliers.
Computational Complexity: Using the statistical information of the distribution to calculate good quality measures can be computationally complex and may not be practical for large data sets or online implementation without the need of supporting algorithms to alleviate the computational burdens. In particular, using the GIP requires an inversion of the covariance matrix, increasing the computational complexity. However, if the distribution of the dataset is known or computed beforehand, then the computational complexity of the selection method drastically drops.
Required System Knowledge: The methods used in distribution-based selection typically require assumptions on the distribution of the data set which may not hold in practice. This is particularly true for the EM algorithm, which assumes a distribution of the data to be fitted. Usually, a Gaussian mixture distribution is assumed [76], [77], [78], [79], although in [75] a Gamma mixture model was used.
Implementation Difficulty: Distribution-based methods generally have a low implementation difficulty due to the use of mature algorithms such as the EM algorithm. Additionally, once the distribution of the data set is characterized, point-wise selection and filtering can be implemented fairly easily, even in low-complexity hardware.
Versatility: Distribution-based selection is highly versatile over different domains, as can be seen in the relevant works. While the selection method is mainly used for the removal of outliers, it can also be paired with other selection methods to further improve the robustness of the selection algorithm, making it a powerful tool for any data selection algorithm, in particular for distilling AI training sets in order to make more robust models.
Resource-Aware Selection
Resource-aware data selection is relevant when auxiliary data related to resource consumption are available or if reasonable assumptions can be made about resource costs. Relevant resource data could be energy consumption, required storage, number of computations among others and can be relevant when energy and data efficiency is prioritized. Resource-aware selection methods differ from other selection methods seen so far since they do not aim at defining a measure of quality, but rather puts the quality measure in perspective of the associated costs in selecting the data. More simply, resource data are utilized directly in the selection policy and not in the quality metric. This makes resource-aware data selection highly versatile, as it can be readily combined with all other selection strategies in cases where resource data are available. Fig. 10 shows how resource costs can be used in the general a priori data selection framework, but a similar extension can be made for a posteriori data selection methods. Table VII shows relevant works that incorporate resource-aware data selection. Note that these works have been introduced in previous sections; however, in this section, we will focus on how they use resource data in their selection policy.
For a data set
Weighted optimization [44], [80]:
where\begin{equation*} \boldsymbol{\delta }^{*} = \underset{\boldsymbol{\delta }\in \lbrace 0,1\rbrace ^{N}}{\text{arg max }} \mathcal {Q}(\mathcal {D};\boldsymbol{\delta };\cdot )-\gamma \mathcal {C}(\mathcal {D};\boldsymbol{\delta };\cdot ) \tag{25} \end{equation*} View Source\begin{equation*} \boldsymbol{\delta }^{*} = \underset{\boldsymbol{\delta }\in \lbrace 0,1\rbrace ^{N}}{\text{arg max }} \mathcal {Q}(\mathcal {D};\boldsymbol{\delta };\cdot )-\gamma \mathcal {C}(\mathcal {D};\boldsymbol{\delta };\cdot ) \tag{25} \end{equation*}
is a user-defined weighting parameter.$\gamma$ Cost-constrained optimization [27], [30], [31], [36], [45], [46]:
where\begin{align*} \boldsymbol{\delta }^{*} =\; & \underset{\boldsymbol{\delta }}{\text{arg max }} \mathcal {Q}(\mathcal {D};\boldsymbol{\delta };\cdot )\\ & \text{s.t. } \mathcal {C}(\mathcal {D};\boldsymbol{\delta };\cdot ) \leq \tau \tag{26} \end{align*} View Source\begin{align*} \boldsymbol{\delta }^{*} =\; & \underset{\boldsymbol{\delta }}{\text{arg max }} \mathcal {Q}(\mathcal {D};\boldsymbol{\delta };\cdot )\\ & \text{s.t. } \mathcal {C}(\mathcal {D};\boldsymbol{\delta };\cdot ) \leq \tau \tag{26} \end{align*}
is a user-defined threshold.$\tau$ Quality-constrained optimization [81]:
\begin{align*} \boldsymbol{\delta }^{*} =\; & \underset{\boldsymbol{\delta }}{\text{arg min }} \mathcal {C}(\mathcal {D};\boldsymbol{\delta };\cdot )\\ & \text{s.t. } \mathcal {Q}(\mathcal {D};\boldsymbol{\delta };\cdot ) \geq \tau . \tag{27} \end{align*} View Source\begin{align*} \boldsymbol{\delta }^{*} =\; & \underset{\boldsymbol{\delta }}{\text{arg min }} \mathcal {C}(\mathcal {D};\boldsymbol{\delta };\cdot )\\ & \text{s.t. } \mathcal {Q}(\mathcal {D};\boldsymbol{\delta };\cdot ) \geq \tau . \tag{27} \end{align*}
Weighted optimization is most commonly used in sensor selection applications, where there can be significant trade-offs between the overall quality of the data and its transmission cost, since sending more data usually leads to better estimation performance but increases energy consumption. The reported results show a significant reduction in communication costs for little to no increase in estimation error [80], [86] and increased estimation robustness in high-noise environments [87].
Other works use a cost-constrained optimization function as shown in (26) to maximize the quality of the selection while keeping the cost down to an acceptable level. Simple assumptions can be made in cases where the actual resource costs are unknown. For example, in wireless sensor networks, one typical assumption is that the energy consumption is the same for all sensors. Thus, the sensor selection problem is reduced to finding a sparse set of sensors that maximize the estimation performance such that
In quality-constrained optimization, the focus is to minimize overall resource costs while ensuring acceptable data quality. An example of this can be found in [81] where they search for the subset of microphones that provide the lowest resource cost while ensuring that the microphones selected provide a reasonable SNR.
A different approach was proposed in [74] which does not utilize cost optimization. Instead, the approach uses the perceived quality of the data, the associated cost of transmission, and the residual energy of each sensor to prescribe probabilities of transmitting data, the idea being that sensors that have low residual energy should not send their data as often in order to save energy and prolong their lifetime.
Resource-aware strategies provide an exciting extension to conventional data selection methods by directly utilizing resource costs in the selection decision. While the relevant works presented mainly focus on energy costs related to communication from edge devices, the idea can be extended to applications such as artificial intelligence where model inference and/or training require considerable computational and energy costs. However, suitable assumptions on resource costs are needed to ensure an efficient selection policy which requires more knowledge about the physical system and its energy usage. Moreover, turning the selection policy into an optimization problem can incur higher computational complexity, compared to simpler threshold-based policies, and thus needs to be justifiable based on the overall energy savings.
Discussion
In this section, the selection strategies introduced in the previous sections will be discussed along with future potentials and challenges for data selection as a whole. A summary of the advantages and disadvantages for each data selection method can be found in Table VIII.
A. Comparative Analysis
In this section, the reviewed data selection methods will be compared with respect to the performance metrics described in Section III-C. Each method has been assigned a qualitative score in each metric based on the analysis of the method in Sections IV and V. The scores range from 1 to 5, with 1, 3 and 5 being low, medium, and high, respectively. As an example, methods which have a relatively low computational complexity will get a low score compared to other methods, while methods with high versatility in applications receive a high score. A priori confidence-based data selection methods were excluded from the analysis due to the limited amount of research in this method. Additionally, resource-aware selection methods are also absent due to them not being directly comparable to other methods. Radar diagrams for model-guided and data-driven methods can be seen in Fig. 11.
Radar diagram of the scores for each data selection method except a priori confidence-based and resource-aware data selection.
For model-guided data selection methods, there is a significant trade-off between the required a priori knowledge of the system and the versatility of the method across different applications. This is largely due to confidence-based and learning-based selection methods taking an offset from conventional ML methods, which are already highly versatile. This also means higher potential resource savings depending on the specific application. For learning-based selection methods specifically, the implementation difficulty is much higher than for a posteriori confidence-based selection methods. The reason for this is that the implementation of learning-based selection may add the need for acceleration hardware or memory storage to a modeling process that may not otherwise need this equipment. Adding a posteriori confidence-based selection to an ML pipeline should not require additional hardware. Nevertheless, the higher hardware requirements for learning-based and confidence-based selection, whether it be in the selection algorithm, the model, or both, may restrict them to offline applications where these needs can be accommodated, although this, of course, depends on the complexity of the task at hand.
For error-based and information-theoretic selection methods, a great deal of prior system knowledge is required for an effective implementation of the selection method. However, this is generally rewarded by a highly robust selection algorithm. The relatively low versatility of these methods is primarily due to the rather simple models on which they have been applied so far; however, exploring how these methods can be applied to more complex models, such as in [47], can potentially increase their versatility in the future.
In the case of data-driven selection methods, their usability for online applications depends on the size of the data set and the overall goal of the selection. For instance, distance-based similarity measures and distribution-based methods may be useful for pruning a training data set for AI training, although this restricts their use for offline cases due to the general large size of a typical training set. For online applications where data storage comes at a premium, only a defined number of data points may be stored at a time, giving only a local view of the overall distribution of the measured data. Correlation-based data selection can be useful for online applications, as was seen in [73], in cases where intra-point correlations are used. However, as discussed in Section V-A, the efficacy of correlation measures depends on the system which restricts the method's overall versatility.
In general, employing a priori selection methods such as information-theoretic, learning-based, and similarity-based methods can potentially result in more savings in resources, due to these methods not requiring model inference prior to selection. However, not relying on the model prediction may potentially increase the computational complexity of the selection method. Thus, great care must be taken when deciding whether a priori selection methods are applicable to the required task. If the extra amount of computational resources of the selection algorithm is greater than what would be saved from the model inference, then the overall resource costs might go up compared to using a posteriori selection methods or no selection at all.
In most cases, the appropriate selection algorithm and intended outcome of the selection is dictated by domain-specific nuances. For example, in LLM training, certain phrases or words may be filtered out of the training process to reduce undesirable behaviors of the model such as bias and toxicity [12]. Thus, when designing the selection algorithm, one should always consider the given task, its goal, and its limitations.
B. Potentials, Challenges, and Future Directions
Data selection is an important tool for energy-efficient use of data and, as seen from the analyzed literature, it is widely applicable for different engineering fields. In particular, data selection is highly useful for reducing communication resources by determining which data are useful enough to transmit. This thinking is one of the cornerstones in the next generation of wireless communication and sensing, also known as 6 G. One of the hot topics in 6 G is semantic communication, which is concerned with only transmitting information relevant to a specific task [89]. Thus, data selection lends itself well to many 6 G applications such as satellite sensing and computing, smart home applications, robotic communication, cloud-based digital twins, compression, and much more. Another exciting application is the field of electrification of vehicles such as electric cars, trucks, buses, trains, ships, and aircraft, which are dependent on on-board diagnostics to ensure safe operation of the battery and prolong its lifetime. Minimizing the energy usage of diagnostic tasks can free up battery utilization and extend their range of operation.
As was seen in many selection methods, data selection can significantly reduce the number of training data needed to train a machine learning model, thus making AI training much more efficient. This is especially relevant in the emergence of large language models such as OpenAI's ChatGPT and Google's Gemini [90], which require large amounts of data and resources for training [91], [92]. This is especially true for applications which work with high-dimensional data such image and video processing. Identifying non-informative training data can dramatically reduce energy costs for future training and AI research.
Resource-aware selection is especially interesting since it can adapt existing selection policies to be more conscious about the costs associated with transmitting, storing, and/or using the data, and can, as was seen in Section VI, dramatically decrease resource usage for a small decrease in estimation performance. This is particularly interesting in the application of communication, where the transmission of data can be costly. In semantic communication, an interesting concept called Value of Information (VoI) has been introduced that is directly related to resource-aware selection [93]. However, implementing resource-aware methods requires more knowledge of the costs inherent in the system and more complex selection policies.
As discussed above, the trade-off between system knowledge and the versatility of the selection method can be a significant challenge, either for the computational hardware or the designer. Thus, the designer should choose the selection method based on the specific application and its requirements. For online applications, high system knowledge may be needed to obtain good results, which can be challenging for highly complex systems. Conversely, learning-based methods can be promising for complex systems; however, their efficacy is highly dependent on their training data which should be analogous to the data one expects to retrieve in real life. However, learning-based methods can be useful in accelerating the discovery of new measures, which are useful for quantifying the quality of the data by incorporating these measures into the selection model. Finding new quality measures which are simple to implement and can accurately reflect the quality of the data should be of paramount importance for future research in data selection methods. Other measures that are not specific to the structure of the data can also be investigated and included in the selection policy. For example, the total time elapsed since the last model update can be used to avoid long periods with no updates.
Conclusion
In recent years, data selection has been shown to be highly effective in enabling data-efficient modeling, obtaining good estimates of system parameters while reducing energy costs at the same time. In this study, a comprehensive overview of different data selection methods has been provided, together with an analysis of relevant works in various fields of engineering. Most of the selection methods can be characterized as model-guided or data-driven depending on the information used to measure the data quality. Additionally, model-guided methods were classified as either a priori or a posteriori depending on the time of selection relative to model inference. The different selection methods were analyzed and compared using six key metrics including the potential resource savings, the complexity of the selection process, and the versatility of the selection methods across various applications. Due to the diverse set of quality metrics and selection policies, data selection is highly versatile across many engineering disciplines and can be readily deployed in many systems, although the specific data selection method to use depends on the requirements of the application. Moreover, resource-aware selection methods have been shown to be particularly interesting in ensuring resource efficiency across all applications. It is the opinion of the authors that data selection will play a vital role in the age of Big Data and be integral in associated technologies such as Internet of Things, smart transportation, AI research and development, 6G, on-board system modeling, and much more.