Journals & Magazines >IEEE Transactions on Knowledg... >Volume: 37 Issue: 5

Doing More With Less: A Survey of Data Selection Methods for Mathematical Modeling

Abstract:

Big data applications such as Artificial Intelligence (AI) and Internet of Things (IoT) have in recent years been leading to many technological breakthroughs in system mo...Show More

Metadata

Abstract:

Big data applications such as Artificial Intelligence (AI) and Internet of Things (IoT) have in recent years been leading to many technological breakthroughs in system modeling. However, these applications are typically data intensive, thus requiring an increasing cost of resources. In this paper, a first-of-its-kind comprehensive review of data selection methods across different engineering disciplines is given in order to analyze the effectiveness of these methods in improving the data efficiency of mathematical modeling algorithms. Eight distinct selection methods have been identified and subsequently analyzed and discussed on the basis of the relevant literature. In addition, the selection methods have been classified according to three dichotomies established by the survey. A comparative analysis of these methods was conducted along with a discussion of potentials, challenges, and future research directions for the research area. Data selection was found to be widely used in many engineering applications and has the potential to play an important role in making more sustainable Big Data applications, especially those in which transmission of data across large distances is required. Furthermore, making resource-aware decisions about the use of data has been shown to be highly effective in reducing energy costs while ensuring high performance of the model.

Published in: IEEE Transactions on Knowledge and Data Engineering ( Volume: 37, Issue: 5, May 2025)

Page(s): 2420 - 2439

Date of Publication: 26 February 2025

ISSN Information:

DOI: 10.1109/TKDE.2025.3545965

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Mathematical modeling is widely employed across scientific disciplines to investigate the behavior of systems. Leveraging the input-output relationship of a system enables the estimation of model parameters that are not directly measurable. This has led to a philosophy which is most prominent today: Given enough data, any system can be modeled. This thinking has led to an immense increase in data generation and the genesis of the age of Big Data, which has accelerated many industries such as banking [1], transportation [2], agriculture [3], Internet of Things (IoT) [4], medicine [5], [6], and language processing [7], thus enabling effective modeling of complex systems.

As reported in [8], data generation has accelerated at an increasingly alarming rate; however, the vast volume of data comes at the price of higher energy consumption due to the large amount of processing and communication required. Moreover, Big Data applications are also still challenged by the quality of the generated data [9]. Data quality typically refers to the volume, accuracy, and completeness of the data and phenomena such as noise, outlying data points, and missing data are often cited as causes of low accuracy and/or completeness. This study adopts a different view of data quality which refers to how relevant and, by extension, how useful the data are to perform a specific task. Although the phenomena mentioned before can lead to a degradation in quality, even ”clean” data can be irrelevant. This view of data quality directly challenges the idea that more data is necessarily better, since the performance of the task can be compromised by low-quality data. In light of this, it is possible and preferable to achieve better performance while using fewer resources, thus increasing the energy efficiency and sustainability of Big Data solutions.

Appealing to the relevancy of the data is hardly a new idea. In 1996, a highly influential paper categorized different kinds of data quality, herein the contextual quality of the data to the task at hand [10]. In mathematical modeling, this idea can be realized by carefully selecting the types of input data which are most relevant for the model or, in the case of Machine Learning (ML), which features of the data are most relevant for the given task. This study will explore a second interpretation: Given a data set, is it possible to determine if a data point within that set is useful for estimating a parameter of a mathematical model, and can this be used to include useful data points and exclude non-useful data points in the modeling process, thus improving the modeling and saving on resources?

Methods for answering the above will be referred to as data selection methods. Data selection is an interdisciplinary field of research which is used in many scientific areas and the definition of a data point greatly depends on the application. A data point can be a collection of features, a segment from a time series signal, an image, a collection of measurement signals, a piece of text, etc. Data selection methods can be highly beneficial, since they can significantly reduce the data burden of complex modeling tasks, thus saving computational resources.

Data selection techniques have been discussed for specific domains such as mobile crowd sensing [11], and more recently large language model (LLM) training [12]. However, to the best of our knowledge, no previous work has been carried out which provides a comprehensive overview of these methods across different scientific domains. A general interdisciplinary survey of data selection methods is highly important for the scientific field as a whole to facilitate knowledge sharing between disciplines and to encourage the wider usability of data selection methods in other data-intensive areas of research. This study provides an overview of different data selection methods which have come to light in recent years and in different fields of engineering with the purpose of categorizing them in a more structural and functional way. Specifically, this study provides the following contributions:

A first-of-its-kind high-level overview of data selection methodologies and their application to different fields of engineering.
An establishment of three different dichotomies used to characterize data selection methodologies.
A qualitative comparison of different data selection methodologies based on key performance metrics.
Discussion of opportunities, challenges, and future research directions for data selection methods.

A technical roadmap for this study can be found in Fig. 1. The remainder of the paper is organized as follows: Section II describes the methodology for the search and selection of relevant works included in this study along with a preliminary analysis. Section III formulates the data selection problem in greater detail and introduces three dichotomies for classifying fundamentally different data selection methods. Sections IV, V, and VI present and discuss individual data selection methods that have been identified in Section III. A comparison of the data selection methods presented in this study is done in Section VII along with a broader discussion on data selection. Concluding remarks are given in Section VIII.

Fig. 1.

Technical roadmap of this study.

Show All

SECTION II.

Searching Methodology

In this section, the methodology used to search, screen, and analyze the scientific literature relevant to data selection methods is described. The search focused on the application of data selection in various fields of engineering, as described in Section I. To this end, we searched for papers in engineering databases such as IEEE Xplore (IEEE), Engineering Village (Elsevier), ACM Digital Library (ACM), and Web of Science Core Collection (Clarivate). The search was conducted using search terms relevant to data selection such as ”data-selection”, ”data-selective”, ”selection of data”, and terms relevant to parameter and/or state estimation. Moreover, searches were also conducted using keywords from different fields which were identified to be synonymous with data selection, e.g., ”measurement selection”, ”training data selection”, ”sensor selection”, etc. Subsequent searches were also made in Google Scholar for additional papers. The search was limited to journal articles, conference articles, and review articles published between 2000 and 2023 (inclusive).

In the selection process, articles that did not fit the scope of this study were deemed irrelevant and therefore excluded from review. This includes papers focusing on different aspects of model development, such as model selection methods, determining relevant parameters to include in the model, and the type of input data which can be relevant to the model as mentioned in Section I. For the application of machine learning models, papers which focus on determining relevant features and reducing the dimension of the input data were also excluded. Furthermore, papers that were not written in English were excluded along with papers in which the full-text document could not be retrieved. The articles were screened using a two-step approach. First, the extracted papers were screened based on their perceived relevance from their titles and abstracts, using the considerations above. The screening was conducted using the open source tool ASReview, which utilizes machine learning to order the papers in terms of predicted relevance based on the authors judgment of previous papers [13]. Secondly, papers that were deemed relevant based on the abstract and title were further screened based on their content. To help facilitate this process, the machine learning tool ChatPDF (see [14]) was used to generate a report of each article based on a designed prompt with general questions regarding their content. This includes questions about how the article used data selection, what the stated motivation is for using data selection, which application and context the data selection was performed in, and what the stated results and conclusions were. The selection of the articles was performed using both the generated report and a manual review to ensure that the articles were not unjustifiably deemed irrelevant. Moreover, the authors screened newer papers more favorably than older works in order to keep the discussion relevant toward recent developments.

A. Preliminary Analysis

Fig. 2 shows the distribution of included publications over time. As mentioned above, the screening process was intentionally biased towards recent publications. 56 publications are included in this study, of which 16 have been published in conferences, while the rest are published in journals. To get an overview of the scientific fields which are represented in these papers, a wordcloud was generated based on the titles of the journals and conferences in which the papers were published in. The result can be seen in Fig. 3. Many different fields of engineering are represented including signal processing, system modeling, Artificial Intelligence (AI), electrical engineering, transportation, communication, aerospace, language processing, robotics, control, and biomedical analysis. This further demonstrates the fact that data selection is a highly interdisciplinary methodology with wide usage in different applications. However, this also presents a significant challenge, as the notation used to formulate the data selection problem varies across the different fields. Furthermore, the data selection methodologies used in the different fields can vary significantly depending on the specific challenges and requirements and each field tends to have different conventions and may not illustrate their data selection methodology relative to a general interdisciplinary framework. Using discipline-specific conventions may alienate researchers from other fields, leading them to perceive the methodology as irrelevant for their needs. Thus, having different conventions for each discipline may restrict the transfer of knowledge between disciplines. In the next section, we will propose a general framework for formulating the data selection problem which encompasses all types of data selection methods which were identified from the relevant works.

Fig. 2.

Distribution of included publications over time.

Show All

Fig. 3.

Wordcloud of the 50 most used words in the journal and conference titles for papers included in this study.

Show All

SECTION III.

Problem Formulation

This section defines the general framework for data selection methods and common mathematical notation. We define the $i$’th data point to be selected as a tuple $D_{i}=(\mathbf {x}_{i},\mathbf {y}_{i})$ consisting of an input $\mathbf {x}_{i} \in \mathbb {C}^{d}$ and an output $\mathbf {y}_{i} \in \mathbb {C}^{k}$ where $\mathbb {C}$ is used for the sake of generality to include both real and complex values. The data point can represent any type of data depending on the application such as an input signal and output signal, an image and a label, a feature vector and a label among others. We define a data set $\mathcal {D}=\lbrace D_{1},D_{2},\ldots, D_{n} \rbrace$ as the set of $N$ data points. For the $i$’th data point, the input data is used by a model $M(\cdot ;\boldsymbol{\theta })$ parameterized by $\boldsymbol{\theta }$ to obtain a prediction \begin{equation*} \hat{\mathbf {y}}_{i} = M(\mathbf {x}_{i};\hat{\boldsymbol{\theta }}_{i}),\quad i=1,2,\ldots, N \tag{1} \end{equation*} View Sourcewhere $\hat{\mathbf {y}}_{i} \in \mathbb {C}^{k}$ is the prediction of the true output $\mathbf {y}_{i}$ and $\hat{\boldsymbol{\theta }}_{i}$ is the $i$’th estimation of $\boldsymbol{\theta }$. We will henceforth refer to the process in (1) as the model inference. The model output $\hat{\mathbf {y}}_{i}$ and the true output $\mathbf {y}_{i}$ are then used in an estimation algorithm to obtain an updated parameter estimate $\hat{\boldsymbol{\theta }}_{i+1}$ which is used in subsequent model inferences.

In the data selection framework, a selection algorithm is introduced to decide if $D_{i}$ should be selected. First, the selection algorithm calculates $Q(D_{i};\cdot )$ which is the quality of $D_{i}$, depending on the information in $D_{i}$ and any additional information required. After calculating $Q(D_{i};\cdot )$, a binary selection parameter is determined \begin{equation*} \delta _{i} = S\left(Q(D_{i};\cdot )\right),\quad \delta _{i} \in \lbrace 0,1\rbrace \tag{2} \end{equation*} View Sourcewhere $S$ denotes a user-defined selection policy. The data point $D_{i}$ is selected if $\delta _{i}=1$. The different selection algorithms found in the research fields can generally be characterized by how $Q$ is calculated and how the selection parameter $\delta _{i}$ affects the conventional modeling framework. In this study, three dichotomies are identified. First, a difference is made between data-driven and model-guided data selection depending on if the computation of $Q$ requires information from the model or not. The second dichotomy is between a priori and a posteriori data selection and describes whether the selection occurs before or after model inference. Finally, a distinction is made between resource-aware and non resource-aware methods depending on if the selection policy requires additional information regarding the potential resource costs of selecting $D_{i}$. These general types of data selection methods are described in more detail below.

A. A Priori Data Selection

A diagram showing the general framework of a priori data selection can be seen in Fig. 4. Here, the selection algorithm calculates $Q$ and determines $\delta _{i}$ prior to model inference on $D_{i}$. The selection algorithm is data-driven if it relies on information about $\mathcal {D}$ and it is model-guided if it relies on information regarding the model such as $\hat{\boldsymbol{\theta }}_{i}$, the architecture of the model, its current or historical performance, etc. Note that the terms data-driven and model-guided do not relate to the design of the selection algorithm itself. For instance, a model-guided selection algorithm can be based on data-driven methods, such as machine-learning methods, as long as those methods use data which contain information about the model.

Fig. 4.

General framework of a priori data selection methods.

Show All

B. A Posteriori Data Selection

Fig. 5 illustrates the general framework of a posteriori data selection. Unlike a priori data selection, the selection happens after model inference and is used to decide if new model parameters should be estimated based on $\hat{\mathbf {y}}_{i}$. Note that all a posteriori data selection methods are necessarily model-guided since they directly depend on the inference of the model.

Fig. 5.

General framework of a posteriori data selection methods.

Show All

C. Identified Data Selection Methods

From the literature review, this study has identified seven different methodologies of data selection. The methodologies were defined such that they are general enough to be applicable across multiple scientific domains while being specific enough to create meaningful distinctions. Fig. 6 shows how each data selection method described above fits in the two dichotomies defined in the previous subsections. What follows is a short description of each method:

Error-based selection: The quality of the data is characterized by the prediction error of the model.
Confidence-based selection: The quality of the data is based on how confident the model is in its inference. In a priori confidence-based selection, the confidence measure is predicted prior to model inference.
Learning-based selection: The quality of the data is learned by a machine learning algorithm based on previous experiences from the model.
Information-theoretic selection: The quality of data is characterized by how much information it carries about a certain model parameter.
Similarity-based selection: The quality of the data is based on the similarity of it with other data points, using either a distance or a correlation metric.
Distribution-based selection: The quality of the data is based on its distribution-characteristics and how well the data fits in the overall distribution of the data set.

Fig. 6.

Data selection methods identified in this study including relevant references for each method.

Show All

In the following sections, an overview of each identified data selection method is provided. Each section includes a table summarizing the selected work with the application of the selection and relevant design choices of the selection algorithm. In addition, two quantitative metrics have been evaluated for each reference. These are the total reduction of the data usage in percent and the overall change in performance of the model. As the span of works covers many domains, not one evaluation metric is shared between the included references. For works which evaluate their model performance in percentage values (such as classification accuracy), the performance change is evaluated in percentage points (noted as pp). Improved performances are indicated by the $+$ prefix while degraded performance is indicated by the $-$ prefix. For works that use an absolute metric, the relative change in performance is calculated, for example, a performance change of 2x means that the performance is twice as good compared to without data selection. Some works report different data reductions and performance changes depending on specific use-cases and experimental settings. To account for this, a range is given between the worst reported value and the best. Note that these performance metrics are meant to give an overview of the selection performance and may fail to capture specific nuances in the results of the references. For more detailed results, we encourage the reader to explore the relevant references.

Additionally, the selection methods will be analyzed with respect to 6 qualitative metrics. These metrics are:

Potential Resource Savings: Potential reduction in used resources compared to non-selective methods or random selection. The type of resource is specific to the application and can be computational resources, energy consumption, resources related to manual labeling etc.
Robustness: General robustness towards phenomena that may disturb the modeling such as outlying data-points, background noise, training set size etc.
Complexity: General complexity of the selection algorithm with respect to the number of computations required to run the algorithm as well as memory resources required to accommodate it.
Required System Knowledge: The amount of knowledge needed about the system to implement the selection algorithm and be able to get a good separation between high-quality and low-quality data points.
Implementation Difficulty: Difficulty in implementing the selection algorithm including prior work needed before implementation, implementing the algorithm in code and the hardware required to support the algorithm.
Versatility: How well the selection method generalizes to different applications.

SECTION IV.

Model-Guided Selection Methods

A. Error-Based Selection

Error-based data selection has gained a lot of popularity in recent years, due to its intuitive approach in system modeling. Relevant articles in error-based selection can be found in Table I. The selection method closely follows the general a posteriori selection framework in Fig. 5 where $Q=E(\mathbf {y}_{i}, \hat{\mathbf {y}}_{i})$ is the error between the output $\mathbf {y}_{i}$ and the model prediction $\hat{\mathbf {y}}_{i}$. Some common choices for the error are:

Absolute Error (AE) [15], [16]: \begin{equation*} E(y_{i},\hat{y}_{i}) = |y_{i}-\hat{y}_{i}| \tag{3} \end{equation*} View Source
Absolute Squared Error (ASE) [17]: \begin{equation*} E(y_{i},\hat{y}_{i}) = |y_{i}-\hat{y_{i}}|^{2} \tag{4} \end{equation*} View Source
Normalised Absolute Error (NAE) [18], [19], [20]: \begin{equation*} E(y_{i}, \hat{y}_{i}) = \frac{|y_{i}-\hat{y}_{i}|}{\sigma _{n}} \tag{5} \end{equation*} View Sourcewhere $\sigma _{n}$ is the standard deviation of the measurement noise.
Normalized Sum of Squared Error (NSSE) [21]: \begin{equation*} E(\mathbf {y}_{i}, \hat{\mathbf {y}}_{i}) = (\mathbf {y}_{i}-\hat{\mathbf {y}}_{i})^{H}\mathbf {S}_{i}^{-1}(\mathbf {y}_{i}-\hat{\mathbf {y}}_{i}) \tag{6} \end{equation*} View Sourcewhere $\mathbf {S}_{i}$ is the covariance matrix of $\mathbf {y}_{i}-\hat{\mathbf {y}}_{i}$.
Maximum Correntropy Criterion (MCC) cost [23]: \begin{equation*} E(y_{i}, \hat{y}_{i}) = \exp \bigg (-\frac{(y_{i}-\hat{y}_{i})^{2}}{2\sigma ^{2}}\bigg ) \tag{7} \end{equation*} View Sourcewhere $\sigma$ is the size of the Gaussian kernel.

TABLE I Summary of Relevant References for Error-Based Data Selection

The appropriate definition of the error depends on the application. For example, ASE may be favorable to amplify the prediction error of outliers and prevent suspected channel attacks [17]. In [21], it was argued that the NSSE is equivalent to using the Kullback-Leibler (KL) divergence for linear Gaussian systems, although the inversion in (6) is computationally expensive for large $k$. A sequential estimation of the KL divergence was later proposed using a per-entry normalized prediction error [22].

Typically, a threshold-based policy is adopted to determine the selection parameter, i.e., \begin{equation*} \delta _{i} = S(E;\tau _{low}, \tau _{up}) = \left\lbrace \begin{array}{ll}1,& \tau _{low} \leq E \leq \tau _{up},\\ 0,& \text{otherwise} \end{array}\right. \tag{8} \end{equation*} View Sourcewhere $\tau _{low}$ filters out unnecessary updates while $\tau _{up}$ accounts for possible outliers which may give excessive error values. The design of the thresholds can be defined based on the variance in measurement noise [15], [16], [17], [20], [23], a prescribed update probability [17], [18], [19], [20], [21], [23], prior system information [17], [18], [24], whether the state of the system is transient or steady [16], or as a function of time [22]. Although many works consider fixed thresholds, other works suggest that a variable threshold will allow higher update rates at the beginning of the estimation process and will account for system changes [16], [22].

Resource Savings: As evidenced by Table I, error-based data selection can offer high savings in estimate updates and, in most cases, the selection will result in unchanged or improved model performance. However, being an a posteriori selection method, the saved computational costs come from a reduced use of the estimation algorithm. Being selective in when to update the parameter could justify using more complex and accurate estimation methods.

Robustness: Multiple works have reported excellent robustness to noisy outliers and system changes after implementing error-based data selection, particularly due to the inclusion of an upper threshold [15], [16], [17], [18], [19].

Complexity: Error-based selection has the lowest computational complexity of all the selection methods due to the relatively simple error metrics and selection policy.

Required System Knowledge: The key task in error-based selection is to properly design thresholds that are appropriate for the given system. The works discussed in this section all consider a relatively simple linear system model and the maturity of the selection method towards more complex systems is yet to be shown. Increasing the complexity of the system yields a greater challenge in identifying optimal thresholds which may change over time. Thus, developing strategies to find appropriate thresholds which can adapt to changes should be considered in future research along with its efficacy in complex non-linear systems.

Implementation Difficulty: Due to the low computational complexity, error-based data selection methods are very simple to implement in low-complexity systems. The main difficulty comes from designing appropriate thresholds which may not be trivial.

Versatility: The intuitive and simple implementation of error-based data selection makes it potentially highly versatile for applications where online system modeling is a core component. The relative low complexity in calculating $E$ and determining $\delta _{i}$ makes the methodology particularly suited for low-complexity systems such as embedded systems. Error-based selection methods can also be used in conjunction with other selection methods by first identifying if a model needs to be updated and then trigger an update process which employs more sophisticated selection methods. This may be particularly useful in applications using digital twinning or where updating the model requires communication with a central server. Although the included references typically use time series data, the selection method can be generalized to any type of data that can be provided sequentially to the model.

B. Confidence-Based Selection

Although error-based selection methods rely on the error between the true output $\mathbf {y}_{i}$ and the predicted output $\hat{\mathbf {y}}_{i}$, in some applications the true output is not known. To account for this, confidence-based selection methods aim to select data based on how confident the model is in its prediction. In the current literature, confidence-based selection is mostly used in conjunction with machine learning models for various classification and regression tasks and leverages uncertainties in the model to measure its confidence. The selection can be a priori or a posteriori, depending on the application, although a posteriori methods are most prominent in the literature. A summary of relevant articles can be found in Table II.

TABLE II Summary of Relevant References for Confidence-Based Data Selection

A posteriori confidence-based selection is typically used in semi-supervised ML methods where $\mathcal {D}$ consists of both labeled and unlabeled data points. The framework is shown in Fig. 7. The ML model $M(\cdot ;\boldsymbol{\theta })$ is pre-trained on a training set $\mathcal {T} = \lbrace \mathcal {X}, \mathcal {Y}\rbrace$ consisting of a set of inputs $\mathcal {X}$ and a set of their corresponding known labels $\mathcal {Y}$. For an unlabeled input $\mathbf {x}_{i}$, the model can predict the label $\hat{\mathbf {y}}_{i} = M(\mathbf {x}_{i};\hat{\boldsymbol{\theta }}_{i})$. By quantifying the confidence, $C(\hat{\mathbf {y}}_{i}|\mathbf {x}_{i})$, of the model in its prediction given $\mathbf {x}_{i}$, the algorithm determines if $\mathbf {x}_{i}$ should be labeled and included in $\mathcal {T}$ for future model updates.

Fig. 7.

Framework for a posteriori confidence-based data selection.

Show All

Confidence-based selection methods are most prominent in classification tasks. In $k$-class classification, the classifier outputs the posterior class probabilities \begin{equation*} \hat{\mathbf {y}}_{i} = {\begin{bmatrix}P(y_{1} | \mathbf {x}_{i}) & P(y_{2} | \mathbf {x}_{i}) & \cdots & P(y_{k} | \mathbf {x}_{i}) \end{bmatrix}}^{T} \tag{9} \end{equation*} View Sourcewhere $P(y_{j} | \mathbf {x}_{i}) = P(y_{j} = 1 | \mathbf {x}_{i})$ is the probability of $\mathbf {x}_{i}$ belonging to the $j$’th class. Posterior probabilities can be used as an intuitive measure of confidence. For instance, if $P(y_{j}| \mathbf {x}_{i}) \approx 1$ then the model is very confident in its prediction. Some common confidence scores based on posterior class probabilities are:

Maximum posterior class probability [26], [34]: \begin{equation*} C(\hat{\mathbf {y}}_{i}|\mathbf {x}_{i}) = \underset{j=1,2,\ldots,k}{{\max}} P(y_{j} | \mathbf {x}_{i}). \tag{10} \end{equation*} View Source
Negative posterior class probability entropy [27], [28], [29], [30], [34]: \begin{equation*} C(\hat{\mathbf {y}}_{i}|\mathbf {x}_{i}) = \sum _{j=1}^{k}P(y_{j}|\mathbf {x}_{i})\log P\left(y_{j}|\mathbf {x}_{i}\right). \tag{11} \end{equation*} View Source
Posterior class probability margin [34], [37]: \begin{align*} C(\hat{\mathbf {y}}_{i}|\mathbf {x}_{i}) =& \underset{j=1,2,\ldots,k}{{\max}}P(y_{j}|\mathbf {x}_{i})\\ &- \underset{{{\scriptstyle {\begin{array}{c}m=1,2,\ldots,k \\ m \ne j\end{array}}}}}{{\max}}P(y_{m}|\mathbf {x}_{i}). \tag{12} \end{align*} View Source

We note that the negative entropy is used in (11) to make it consistent with the notion of confidence, i.e., a higher negative entropy (and thus lower entropy) leads to higher confidence in the prediction.

Another common approach is to define a committee of models $\lbrace M_{1}, M_{2},\ldots, M_{L}\rbrace$, which are each trained on a different subset of $\mathcal {T}$, and define $C$ based on the disagreement between the models. This is called the Query-By-Committee (QBC) approach [38]. Some common choices of $C$ are:

Committee variance [31]: \begin{equation*} C(\hat{y}_{i}|\mathbf {x}_{i}) = \frac{1}{L} \sum _{l=1}^{L} \left(M_{l}(\mathbf {x}_{i};\boldsymbol{\theta }_{l})-\hat{y}_{i}\right)^{2} \tag{13} \end{equation*} View Sourcewhere $\hat{y}_{i}=\frac{1}{L}\sum _{l=1}^{L} M_{l}(\mathbf {x}_{i};\boldsymbol{\theta }_{l})$.
Average committee KL divergence [32], [34]: \begin{equation*} C(\hat{\mathbf {y}}_{i}|\mathbf {x}_{i}) = \frac{1}{L} \sum _{l=1}^{L} \sum _{j=1}^{k} P_{M_{l}}(y_{j}|\mathbf {x}_{i})\log \frac{P_{M_{l}}(y_{j}|\mathbf {x}_{i})}{P_{avg}(y_{j}|\mathbf {x}_{i})} \tag{14} \end{equation*} View Sourcewhere $P_{M_{l}}(y_{j}|\mathbf {x}_{i})$ is the posterior probability of class $j$ under $M_{l}$ and $P_{avg}(y_{j}|\mathbf {x}_{i}) = \frac{1}{L}\sum _{l=1}^{L}P_{M_{l}}(y_{j}|\mathbf {x}_{i})$.
Committee vote entropy [33], [34]: \begin{equation*} C(\hat{\mathbf {y}}_{i}|\mathbf {x}_{i}) = -\sum _{j=1}^{k} \frac{V(y_{j}|\mathbf {x}_{i})}{k}\log \frac{V(y_{j}|\mathbf {x}_{i})}{k} \tag{15} \end{equation*} View Sourcewhere $V(y_{j}|\mathbf {x}_{i})$ is the number of models predicting $\mathbf {x}_{i}$ to be in class $j$.

Note that QBC enables confidence-based selection methods to be used for regression tasks by employing a confidence score such as in (13).

Two different approaches are used to select $\mathbf {x}_{i}$ based on $C$. In the self-training approach (see [39]), high-confidence data points are selected with the pseudo-label $\tilde{\mathbf {y}}_{i}$ based on the most confident class. In most of the works, a threshold policy is used [25] although a greedy policy can also be applicable. In the active learning approach (see [38]), the least confident data points are selected [26], [27], [28], [29], [30], [31], [32], [33], [34]. This is mostly applicable in situations where it is possible but costly to obtain the true label $\mathbf {y}_{i}$. Intuitively, low-confidence data points should be prioritized, since they have the highest potential in improving the model. While thresholding can be used in active learning just like self-training (see [26]), it is more common to use a greedy approach, either selecting the least confident points one at a time [32] or a batch of the least confident points [30], [31], [33] before re-training the model or committee.

A priori confidence-based data selection closely follows the general a priori framework in Fig. 4 with $Q$ being the predicted confidence of $\hat{\mathbf {y}}_{i}$ prior to model inference. A priori confidence-based selection methods are typically used in situations where inference comes with a high associated cost. This adds complexity to the selection process, which can make this selection method highly domain-specific. In [35], a smaller proxy model is used to predict the inference error of the main model, thus only selecting data points which have a large prediction inference error for further analysis. A priori confidence-based selection has seen an increased use in edge computing by first having a low-complexity model on the edge and only transmitting the data point to a high-complexity model in the cloud when the confidence of the model on the edge is below a certain threshold [36], [37].

Resource Savings: A posteriori confidence-based selection methods have shown great promise in reducing the amount of training data needed to update the model over time. This reduction in data usage correspond to a reduction in computational power needed to update the model, less transmission costs in the case of edge computing, and fewer manual annotations needed in the case of active learning applications. A priori confidence-based selection has the added benefit of reducing costs related to model inference which in the case of edge computing can lead to more efficient use of communication resources [36], [37].

Robustness: The model requires a labeled training data set as a jumping off point to get reliable confidence estimates, although previous work has shown that the active learning framework can be robust with respect to the size of the initial training set [40]. Furthermore, if low-confidence data points are selected, the selection may not be robust against outliers. If high-confidence points are selected, incorrect pseudo-labels can introduce confirmation bias to the model [41].

Complexity: In classification tasks, the complexity of (10)-(12) is relatively low even for a substantial amount of classes. Using (13)-(15) significantly adds to computational and memory complexity since multiple models are required for inference. The complexity of a priori confidence-based selection depends on how the confidence is predicted. By introducing a smaller proxy model, the complexity can be lower than using QBC [37].

Required System Knowledge: Since confidence-based data selection is used primarily in a machine learning context, little prior knowledge is needed about the relationship between $\mathbf {x}_{i}$ and $\mathbf {y}_{i}$ as it is learned through training.

Implementation Difficulty: The hardware requirements for confidence-based selection depend on the computational and memory complexity of the selection algorithm, as discussed above. However, code implementation of the a posteriori confidence metrics discussed above is straightforward. Implementation of a priori confidence-based selection algorithms is potentially more difficult due to the added complexity in prior prediction of model confidence. However, as seen in [36] and [37], adding small proxy models can be a good way to circumvent this issue.

Versatility: Since confidence-based selection is used mainly in ML, it is highly versatile across different applications, as is also clear in Table II, provided that conventional challenges in ML such as data availability and hardware requirements are adequately addressed. Thus, confidence-based selection has the potential of being highly relevant in emerging technologies utilizing ML, e.g., edge computing, autonomous driving, Language processing and more.

C. Information-Theoretic Selection

Information-theoretic data selection methods leverage information-theoretic metrics to measure the amount of information gained from the selection. The selection method closely follows the general a priori framework in Fig. 4 where $Q$ quantifies the gain in information from using $D_{i}$ in the estimation process. Data that have the highest contribution of information to the estimation of a model parameter are subsequently selected. Table III summarizes relevant works which employ information-theoretic selection.

TABLE III Summary of Relevant References for Information-Theoretic Data Selection

Two common measures of information are:

Fisher Information Matrix (FIM) [42], [43], [44], [45], [47] \begin{equation*} \mathbf {F}(\mathbf {x}_{i},\boldsymbol{\theta }) = -\mathbb {E}_{\boldsymbol{\theta }}\left[\frac{\partial ^{2}}{\partial \boldsymbol{\theta }\partial \boldsymbol{\theta }^{T}}\ln {f(\mathbf {x}_{i};\boldsymbol{\theta })}\Big |\boldsymbol{\theta }\right] \tag{16} \end{equation*} View Sourcewhere $f$ is the likelihood function of $\mathbf {x}_{i}$, and $\mathbb {E}_{\boldsymbol{\theta }}[\cdot ]$ is the expectation operator under the distribution of $\boldsymbol{\theta }$ [48].
Mutual Information (MI) [43]: \begin{equation*} I(\mathbf {x}_{i}, \boldsymbol{\theta }) = H(\boldsymbol{\theta }) - H(\boldsymbol{\theta }|\mathbf {x}_{i}) \tag{17} \end{equation*} View Sourcewhere $H(\boldsymbol{\theta })$ and $H(\boldsymbol{\theta }|\mathbf {x}_{i})$ are the entropy and conditional entropy of $\boldsymbol{\theta }$.

The FIM is the most used information metric. The inverse of the FIM provides a lower bound on the variance of any unbiased estimator [48]. Thus, data points which maximize the FIM can lead to better estimates. The FIM is typically calculated recursively [42], [44], [45] or analytically [47]. For linear systems with Gaussian noise, the FIM can be calculated as a linear expression [44]. However, for non-linear systems where the distribution of $\boldsymbol{\theta }$ is not known, Monte Carlo methods are usually employed [42], [43], [44]. The MI quantifies the amount of information obtained about $\boldsymbol{\theta }$ by observing $\mathbf {x}_{i}$ and is typically approximated by Monte Carlo methods.

Resource Savings: As shown in Table III, information-theoretic data selection can yield impressive reductions in data usage. As the main application of information-theoretic selection is in sensor selection, this data reduction results in lower costs in communication which has a significant effect on the overall energy consumption of the system.

Robustness: Information-theoretic selection methods have been shown to be robust to the uncertainties inherent in sensors [42], [45] and can be used to account for uncertainties and biases in the modeling process [47]. However, it should be noted that the FIM only provides a lower bound on the estimation variance if the estimate is unbiased, which is rarely the case. Thus, there are rarely guarantees for the effectiveness of the FIM to provide better estimates.

Computational Complexity: The computational complexity of the selection method is highly dependent on the complexity of the system. For most systems, Monte Carlo methods are usually required to estimate underlying distributions which adds to the computational complexity of the selection method by a significant amount. To avoid complexity issues, the system model can be reduced to a linear form which can greatly simplify calculations [46] but the performance of the estimation may break down for highly non-linear systems.

Required System Knowledge: For efficient implementation, a high degree of system knowledge is usually required. This is especially the case if the FIM is calculated analytically, as in [47].

Implementation Difficulty: The difficulty of the implementation depends on the methods chosen for calculating the information metric. Monte Carlo methods can easily be implemented on most hardware, although at the expense of high computational complexity. To alleviate this, a mathematical analysis of the system is required before hand which raises the complexity of the implementation.

Versatility: Information-theoretic data selection is highly represented in wireless sensor network applications for system state estimation, particularly sensor selection in object tracking applications [49]. However, in these applications, simple assumptions about the motion of the object, such as linearity, are used [42], [44], [45], [46]. For a more complex model, such as in [47], a greater amount of effort is needed to implement the selection method. Thus, the versatility of the selection method is limited by the complexity of the model. Developing effective ways of estimating parameter distributions in a cost-efficient manner will highly reduce the difficulty of implementation and make this selection method much more versatile. Nonetheless, information-theoretic selection has shown to be highly useful for optimizing transmission costs and may play a key role in future communication systems.

D. Learning-Based Selection

Unlike previously discussed methods, learning-based selection does not rely on predefined heuristics such as error, confidence, or information gain. Rather, the quality of data points is learned over time by a machine learning model. A general framework for learning-based selection is shown in Fig. 8. In this framework, the quality of $D_{i}$ is predicted by a selection model based on features extracted from $\mathbf {x}_{i}$ and possibly the state of the model. The selection parameter $\delta _{i}$ is then determined based on a given selection policy. The selected data points are passed to a target model for inference. A loss function $\mathcal {L}^{M}_{i}$ is calculated based on $\mathbf {y}_{i}$ and $\hat{\mathbf {y}}_{i}$ is subsequently used to update the model. A second loss function $\mathcal {L}^{S}_{i}$ is defined based on $\mathcal {L}^{M}_{i}$ to update the selection model. Thus, the selection model learns in conjunction with the target model and adapts the selection strategy based on how well the target model learns from the given data. The target model can in theory be any type of mathematical model, although machine learning/deep learning models are most common in the literature. Table IV summarizes relevant works.

TABLE IV Summary of Relevant References for Learning-Based Data Selection

Fig. 8.

Framework for learning-based data selection.

Show All

Reinforcement Learning (RL) is highly prominent in the literature because of its wide utility in different applications. In the general RL framework, an agent decides an action based on the state of an environment. Through a reward feedback system, the agent learns which actions it should take in each state to maximize its accumulated reward [59]. In learning-based data selection, the selection model performs an action based on the state of the target model and the data and adjusts its strategy depending on feedback from the target model. In [50], a teacher-student framework was proposed based on data features, as well as base model states such as iteration number and historical training accuracy. In [51] and [52], data features were encoded using a Convolutional Neural Network (CNN) to represent the states of the RL agent along with the posterior class probabilities predicted by the target model. In [53], a data valuation framework was proposed for various machine learning tasks. The reward was constructed by following the evolution of $\mathcal {L}^{M}_{i}$. The authors in [54] proposed two selection methods for selecting data in an actor-critic framework, one considering data features while the other considers the data itself. The feature-based selection strategy showed better performance than the sample-based strategy and baseline models. In [55], a genetic algorithm (see [60]) was used to select training data to classify brain signals using a CNN. The classification accuracy was used as a fitness score to select a new group of training data to be classified with the goal of enhancing the adaptation of the target model. In [56], a deep neural network was used to classify time series data windows as good or bad for identifying parameters in a thermal model. However, this requires creating labels for the selection model to learn from. In the paper, the authors used the prediction error of a Kalman Filter used for tracking the temperature, and labeled data windows as good if the average error was below a defined threshold. The framework can be seen as an a priori version of error-based data selection, since the error is predicted prior to model inference. In [57] and [58], they aim to find a subset which approximates the gradient of $\mathcal {L}_{i}^{M}$ over the full training set of the target model which is one of the key tasks of coreset selection methods [61].

Resource Savings: The works in Table IV have reported great reductions in data for a small to significant performance improvement in most applications. As a result of reduced data usage, authors have reported several benefits such as lower computational costs [50], [57], [58], reduced training time for the target model [57], [58], lower communication costs [57], and, in the case of active learning, reduced manual labeling costs [51], [52]. Thus, learning-based selection has the potential to introduce immense resource savings.

Robustness: In [56], it was shown that the proposed method led to better temperature estimates than a standard SVM, while also being robust to outliers. Learning-based selection methods are more robust to poor initial models since they can learn which data points are more useful in the early stages of the learning process. However, the learned selection algorithms are heavily dependent on the training data used for the selection model and the robustness of the selection may fail in unseen scenarios.

Computational Complexity: Since learning-based selection methods are designed for use in machine learning applications, the overall computational complexity is too high for conventional CPU hardware, thus requiring the need for hardware accelerators such as GPUs. However, it is important to note that the vast majority of computational costs are due to the training of the selection algorithm. A pretrained selection algorithm requires less computational resources for inference and can thus be implemented on hardware of lower complexity. For online applications, care must be taken that the computational complexity of the selection model is small enough compared to the target model to ensure that the selection process does not lead to an overall increase in computations, although this can also restrict the performance of the selection model.

Required System Knowledge: Little or no system knowledge is required to implement the learning-based selection methods, since the utility of the data is learned over time. However, as shown in [54], improvements can be made to the selection algorithm by defining specific features beforehand.

Implementation Difficulty: Learning-based selection methods have a high difficulty of implementation, particularly due to the hardware required for training and implementing the selection algorithms. Methods for reducing the size of the selection model, such as model pruning, can help alleviate the computational burden and make it more applicable for low-complexity systems. Additionally, great care must be taken to construct a comprehensive training set for the selection algorithm to learn from which can further complicate the implementation of the selection method in the specific use case.

Versatility: Learning-based data selection can be very flexible for different applications and target models. Specifically, the frameworks presented in [50] and [53] have been shown to be agnostic to the type of target model used. This makes learning-based selection methods highly attractive for many prominent research areas in AI such medical imaging, language processing, and protein folding. However, the majority of the works discussed in this section only consider data selection to improve target model training. As shown in [56], learning-based data selection can also be used to identify if the data should be used in a potentially complex model identification process or skipped entirely, saving limited resources. Likewise, learning-based data selection can also be used to determine whether data samples should be used for online model inference or not based on past experiences.

SECTION V.

Data-Driven Selection Methods

A. Similarity-Based Selection

Similarity-based selection methods leverage similarity metrics to determine if $D_{i} \in \mathcal {D}$ is suitable for selection. Unlike previously discussed measures, which are based on the information contained in a single data point, similarity measures contain information about the relationship between data which can be leveraged to add diversity, detect outliers, or filter out noise. Table V summarizes selected works for this section. The selection method closely follows the general framework in Fig. 4, where $Q$ is a similarity metric. It is common to take a geometric approach to defining similarity by leveraging distance metrics. For two generic vectors, $\mathbf {v}, \mathbf {v}^{\prime }$, typical distance metrics seen in the literature include:

Euclidian distance [28], [62]: \begin{equation*} d(\mathbf {v},\mathbf {v}^{\prime }) = ||\mathbf {v} - \mathbf {v}^{\prime }||_{2} \tag{18} \end{equation*} View Sourcewhere $||\cdot ||_{2}$ is the $l_{2}$ norm.
Squared Euclidian distance [31], [63]: \begin{equation*} d(\mathbf {v},\mathbf {v}^{\prime }) = ||\mathbf {v} - \mathbf {v}^{\prime }||_{2}^{2}. \tag{19} \end{equation*} View Source
Generalized Inner Product (GIP) [29]: \begin{equation*} d(\mathbf {v}, \mathbf {v}^{\prime }) = \mathbf {v}^{H}\mathbf {M}\mathbf {v}^{\prime } \tag{20} \end{equation*} View Sourcewhere $H$ denotes the Hermitian transpose and $\mathbf {M}$ is a positive-definite symmetric matrix.

TABLE V Summary of Relevant References for Similarity-Based Data Selection

Note that $d(\mathbf {v},\mathbf {v}^{\prime }) >> 0$ means that $\mathbf {v}$ and $\mathbf {v}^{\prime }$ are not similar. Although distance measures in (18)-(19) have a simple implementation, other higher complexity distance metrics have also been proposed such as the distance calculated by the Dynamic Time Warping (DTW) algorithm for temporal signals [64], [65] or the distance between a data point and a convex hull of the data set [66], [67].

Another common type of similarity measure is the correlation between $\mathbf {v}$ and $\mathbf {v}^{\prime }$. Some common correlation metrics are:

Cosine similarity [27], [68], [69], [70]: \begin{equation*} c(\mathbf {v},\mathbf {v}^{\prime }) = \frac{\langle \mathbf {v},\mathbf {v}^{\prime }\rangle }{||\mathbf {v}||_{2}||\mathbf {v}^{\prime }||_{2}} \tag{21} \end{equation*} View Sourcewhere $\langle \cdot,\cdot \rangle$ denotes the inner product.
Pearson correlation coefficient [71]: \begin{equation*} c(\mathbf {v}, \mathbf {v}^{\prime }) = \frac{\langle \tilde{\mathbf {v}},\tilde{\mathbf {v}}^{\prime }\rangle }{||\tilde{\mathbf {v}}||_{2}||\tilde{\mathbf {v}}^{\prime }||_{2}} \tag{22} \end{equation*} View Sourcewhere $\tilde{\mathbf {v}} = \mathbf {v} - \bar{\mathbf {v}}$ and $\bar{\mathbf {v}}$ is the sample mean of $\mathbf {v}$.
Coefficient of determination ($R^{2}$) [72]: \begin{equation*} c(\mathbf {v},\mathbf {v}^{\prime }) = 1-\frac{||\mathbf {v}-\mathbf {v}^{\prime }||_{2}^{2}}{||\tilde{\mathbf {v}}||_{2}^{2}}. \tag{23} \end{equation*} View Source
Magnitude-squared coherence [73]: \begin{equation*} c(\mathbf {v},\mathbf {v}^{\prime }) = \frac{|\langle \mathbf {v}, \mathbf {v}^{\prime } \rangle |^{2}}{||\mathbf {v}||_{2}^{2}||\mathbf {v}^{\prime }||_{2}^{2}}. \tag{24} \end{equation*} View Source

Correlations close to 0 mean low similarity, while correlations close to 1 mean high similarity. The cosine similarity measure is most prominent in the literature. We note that the cosine similarity measure is not strictly a measure of correlation; however, its interpretation is similar to a correlation metric. It is equivalent to the Pearson correlation coefficient for mean-centered $\mathbf {v}$ and $\mathbf {v}^{\prime }$ as seen from (21) and (22).

Typically, $\mathbf {v}=\mathbf {x}_{i} \in D_{i}$ and $\mathbf {v}^{\prime }=\mathbf {x}_{j} \in D_{j}$ where $D_{i},D_{j} \in \mathcal {D}$ and $i\ne j$. Here, $\mathbf {x}_{i}$ and $\mathbf {x}_{j}$ are most commonly feature representations of measured data such as images or signals [27], [28], [29], [62], [63], [64], [66], [67], [70], although in [31], [65], [68] directly measured data was used instead. In [72], [73], the correlation between $\mathbf {x}_{i} \in D_{i}$ and $\mathbf {y}_{i} \in D_{i}$ was used to filter out data points that may be corrupted by noise, resulting in a lower correlation between the input and output signals.

In [69], the cosine similarity measure was used to compare embeddings of the same data point $\mathbf {x}_{i}$ after different augmentations, i.e., $\mathbf {v} = g_{1}(\mathbf {x}_{i})$ and $\mathbf {v}^{\prime } = g_{2}(\mathbf {x}_{i})$ where $g_{1}$ and $g_{2}$ are two different augmentation functions. The idea is that data points, such as images, with dissimilar embeddings after different augmentations are more sensitive to changes and thus carry more semantic information.

In most applications, $\mathbf {v}$ and $\mathbf {v}^{\prime }$ have the same dimension, as is required to compute the distance and correlation metrics above. In cases where $\mathbf {v}$ and $\mathbf {v}^{\prime }$ do not have the same dimension, gray correlation measures can be used instead [74].

Resource Savings: Similarity-based selection methods are very attractive for introducing information on the relationship between data points in the same data set. This has become very useful in pruning the initial training data set by removing redundant data points prior to training [62], [69] or, in the case of active learning, choosing more efficiently which data points to label [27], [63]. Similarity-based selection has also been implemented for more efficiently selecting the coreset of large image data sets [62], [68], [69], thus saving resources for training image classifiers.

Robustness: The robustness of the selection method depends on the selection policy. If dissimilar points are preferred, outlying data points may also be selected and interfere with the estimation. If similar points are preferred, noisy data points can be filtered out, making the selection more robust against outliers and noise measurements, as seen in [72], [73].

Computational Complexity: Whether a distance or correlation measure is used, one major limitation of similarity-based selection methods is the high computational complexity associated with comparing between data points in the data set. This is especially problematic for high-dimensional data and necessitates the use of preprocessing methods for reducing the data dimension such as feature extraction and possibly dimensionality reduction techniques. Although the largest data reductions can be gained on large datasets, this also requires significant computation and memory resources. These methods are therefore generally reserved for offline use to accommodate the high computation and memory requirements. However, efforts have been made to reduce the computational complexity. In [62], a cluster-based algorithm was proposed instead of directly computing the distance between points. The authors in [71] also suggested using K-means clustering to limit storage use. In [67], an algorithm was proposed to approximate the convex hull of high-dimensional data sets. In cases where similarity metrics are only used to compute intra-point similarities, such as between $\mathbf {x}_{i}$ and $\mathbf {y}_{i}$, the complexity of the method is greatly reduced compared to methods using inter-point similarities, making it more viable for online use.

Required System Knowledge: Correlation measures can be used to ensure higher uniformity and prevent outliers from negatively affecting the results, although they can break down for highly complex and non-linear systems. For example, the similarity measures in (22) and (24) only measure linear correlation between data points; however, nonlinear correlations can go unnoticed. An effective use of correlation thus requires additional knowledge of the underlying system. Furthermore, distance-based selection methods can be effective on clustered data by thinning out data points in the same cluster [63] or only selecting data points within the same cluster [64], [65]. Thus, prior knowledge about the structure of the data set can be beneficial.

Implementation Difficulty: Similarity-based selection methods generally have a high difficulty in implementation, mainly due to the requirement of hardware with a large amount of computational and memory resources. However, when the amount of data points which have to be compared are relatively low or inter-point similarities are not considered these hardware requirements are lessened. Another way to reduce computation and memory complexity for online applications is by using a time-varying $\mathcal {D}$ that only contains the most recent data points.

Versatility: As can be seen in Table V, similarity measures are highly versatile and can easily be implemented for a wide variety of applications to effectively reduce the size of the data set. However, one should consider the choice of similarity measure and whether it is appropriate for the specific application. Distance measures were found to be used mainly to ensure diversity, although this may inadvertently favor outlying data points. Similarity-based selection methods can also be readily combined with other selection methods. In particular, similarity metrics have been used in conjunction with confidence measures to ensure the selection of low-confidence points while maintaining a diverse training set [27], [28], [29], [31].

B. Distribution-Based Selection

Distribution-based selection methods aim to define their selection strategy based on the distribution characteristics of the whole data set. The applications, motivation, and intuition resemble what was seen from similarity-based methods; however, while similarity-based methods aim to measure inter-point similarities, distribution-based methods aim to provide a more comprehensive view of the underlying data distribution. The general approach of distribution-based data selection can be seen in Fig. 9. While data selection methods are generally used to reduce the data usage as much as possible, the works employing distribution-based selection aim to make the estimation more robust towards outliers and measurement noise. Selected works are summarized in Table VI. Note that data reduction and performance change are not included in the table, since the works did not adequately report these results.

TABLE VI Summary of Relevant References for Distribution-Based Data Selection

Fig. 9.

Framework for distribution-based data selection.

Show All

A common approach in distribution-based selection is to use mixture models to estimate the distribution of $\mathcal {D}$ and discard points $D_{i} \in \mathcal {D}$ which do not adhere well to the fitted distribution. Usually, the Expectation-Maximization (EM) algorithm (see [82, pp. 430-455]) is used for mixture model fitting and its application to outlier detection can date back to [83] where the authors argued that the EM algorithm can be used to determine two clusters, one for inliers and one for outliers. This was applied more recently in [76]. In [84], an extra step was added to the EM algorithm by not only estimating the overall distribution of the data set, but also estimating the weighted variances of the individual data points and discarding measurements whose variance is much higher than the average. The authors in [75] applied a similar idea, assigning weights to data points that reflect their relevance and estimating these weights based on the data distribution obtained from the EM algorithm.

For signal processing applications, the covariance matrix has been shown to be particularly useful for identifying and removing outlying signals. In [77], a two-step data selection procedure was proposed using the GIP, as defined in (20), of each signal using the inverted normalized sample covariance matrix (see [85]). Signals that exhibit a high GIP are then considered outliers. The method was later extended in [78] by considering cases where the statistical characteristics of the signals are unknown and conventional covariance estimates fail. In the case where the data are made up of multiple parallel signals, the authors in [79] proposed a quality metric based on the diagonalization of the sample covariance matrix of those signals, although this process also faces complexity issues. In [80] and [81], the authors leveraged the power spectral density which estimates the power distribution of the signal in the frequency domain. Signals with a high signal-to-noise ratio (SNR) were selected to minimize the influence of measurement noise.

Resource Savings: As mentioned before, the main objective of distribution-based selection methods is to improve overall robustness towards outliers, rather than reducing the data usage as much as possible. Thus, the resource savings obtained from distribution-based methods are lower relative to those obtained by other selection methods.

Robustness: Distribution-based selection yields excellent robustness towards outliers.

Computational Complexity: Using the statistical information of the distribution to calculate good quality measures can be computationally complex and may not be practical for large data sets or online implementation without the need of supporting algorithms to alleviate the computational burdens. In particular, using the GIP requires an inversion of the covariance matrix, increasing the computational complexity. However, if the distribution of the dataset is known or computed beforehand, then the computational complexity of the selection method drastically drops.

Required System Knowledge: The methods used in distribution-based selection typically require assumptions on the distribution of the data set which may not hold in practice. This is particularly true for the EM algorithm, which assumes a distribution of the data to be fitted. Usually, a Gaussian mixture distribution is assumed [76], [77], [78], [79], although in [75] a Gamma mixture model was used.

Implementation Difficulty: Distribution-based methods generally have a low implementation difficulty due to the use of mature algorithms such as the EM algorithm. Additionally, once the distribution of the data set is characterized, point-wise selection and filtering can be implemented fairly easily, even in low-complexity hardware.

Versatility: Distribution-based selection is highly versatile over different domains, as can be seen in the relevant works. While the selection method is mainly used for the removal of outliers, it can also be paired with other selection methods to further improve the robustness of the selection algorithm, making it a powerful tool for any data selection algorithm, in particular for distilling AI training sets in order to make more robust models.

SECTION VI.

Resource-Aware Selection

Resource-aware data selection is relevant when auxiliary data related to resource consumption are available or if reasonable assumptions can be made about resource costs. Relevant resource data could be energy consumption, required storage, number of computations among others and can be relevant when energy and data efficiency is prioritized. Resource-aware selection methods differ from other selection methods seen so far since they do not aim at defining a measure of quality, but rather puts the quality measure in perspective of the associated costs in selecting the data. More simply, resource data are utilized directly in the selection policy and not in the quality metric. This makes resource-aware data selection highly versatile, as it can be readily combined with all other selection strategies in cases where resource data are available. Fig. 10 shows how resource costs can be used in the general a priori data selection framework, but a similar extension can be made for a posteriori data selection methods. Table VII shows relevant works that incorporate resource-aware data selection. Note that these works have been introduced in previous sections; however, in this section, we will focus on how they use resource data in their selection policy.

TABLE VII Summary of Relevant References for Resource-Aware Data Selection

Fig. 10.

Framework for resource-aware data selection.

Show All

For a data set $\mathcal {D}$ of $N$ points, resource-aware selection methods aim to find a selection parameter $\boldsymbol{\delta }\in \lbrace 0,1\rbrace ^{N}$ that optimizes an objective function which includes the quality of the selection and the associated cost. Let $\mathcal {Q}(\mathcal {D};\boldsymbol{\delta };\cdot )$ be an aggregation of the quality of the data points in $\mathcal {D}$ selected by $\boldsymbol{\delta }$, possibly depending on information not contained in $\mathcal {D}$ as for model-guided selection methods. Let $\mathcal {C}(\mathcal {D};\boldsymbol{\delta };\cdot )$ be the associated cost resulting from the selection. The goal is then to find $\boldsymbol{\delta }^{*}$ which yields a high value of $\mathcal {Q}(\mathcal {D};\boldsymbol{\delta };\cdot )$ while keeping $\mathcal {C}(\mathcal {D};\boldsymbol{\delta };\cdot )$ low. Three optimization methods are commonly used:

Weighted optimization [44], [80]: \begin{equation*} \boldsymbol{\delta }^{*} = \underset{\boldsymbol{\delta }\in \lbrace 0,1\rbrace ^{N}}{\text{arg max }} \mathcal {Q}(\mathcal {D};\boldsymbol{\delta };\cdot )-\gamma \mathcal {C}(\mathcal {D};\boldsymbol{\delta };\cdot ) \tag{25} \end{equation*} View Sourcewhere $\gamma$ is a user-defined weighting parameter.
Cost-constrained optimization [27], [30], [31], [36], [45], [46]: \begin{align*} \boldsymbol{\delta }^{*} =\; & \underset{\boldsymbol{\delta }}{\text{arg max }} \mathcal {Q}(\mathcal {D};\boldsymbol{\delta };\cdot )\\ & \text{s.t. } \mathcal {C}(\mathcal {D};\boldsymbol{\delta };\cdot ) \leq \tau \tag{26} \end{align*} View Sourcewhere $\tau$ is a user-defined threshold.
Quality-constrained optimization [81]: \begin{align*} \boldsymbol{\delta }^{*} =\; & \underset{\boldsymbol{\delta }}{\text{arg min }} \mathcal {C}(\mathcal {D};\boldsymbol{\delta };\cdot )\\ & \text{s.t. } \mathcal {Q}(\mathcal {D};\boldsymbol{\delta };\cdot ) \geq \tau . \tag{27} \end{align*} View Source

Weighted optimization is most commonly used in sensor selection applications, where there can be significant trade-offs between the overall quality of the data and its transmission cost, since sending more data usually leads to better estimation performance but increases energy consumption. The reported results show a significant reduction in communication costs for little to no increase in estimation error [80], [86] and increased estimation robustness in high-noise environments [87].

Other works use a cost-constrained optimization function as shown in (26) to maximize the quality of the selection while keeping the cost down to an acceptable level. Simple assumptions can be made in cases where the actual resource costs are unknown. For example, in wireless sensor networks, one typical assumption is that the energy consumption is the same for all sensors. Thus, the sensor selection problem is reduced to finding a sparse set of sensors that maximize the estimation performance such that $\sum _{i=1}^{N} \delta _{i} \leq \tau$. However, due to the optimization problem being non-convex, it is typically relaxed to a more manageable form. In [44], they relax the requirement of $\delta _{i} \in \lbrace 0,1\rbrace$ to $0 \leq \delta _{i} \leq 1$, making the optimization function convex. In active learning strategies, the aim is to ensure that the number of manual annotations is below a certain number, called the budget [27]. Other works have also considered resource-aware active learning strategies with non-uniform costs [88].

In quality-constrained optimization, the focus is to minimize overall resource costs while ensuring acceptable data quality. An example of this can be found in [81] where they search for the subset of microphones that provide the lowest resource cost while ensuring that the microphones selected provide a reasonable SNR.

A different approach was proposed in [74] which does not utilize cost optimization. Instead, the approach uses the perceived quality of the data, the associated cost of transmission, and the residual energy of each sensor to prescribe probabilities of transmitting data, the idea being that sensors that have low residual energy should not send their data as often in order to save energy and prolong their lifetime.

Resource-aware strategies provide an exciting extension to conventional data selection methods by directly utilizing resource costs in the selection decision. While the relevant works presented mainly focus on energy costs related to communication from edge devices, the idea can be extended to applications such as artificial intelligence where model inference and/or training require considerable computational and energy costs. However, suitable assumptions on resource costs are needed to ensure an efficient selection policy which requires more knowledge about the physical system and its energy usage. Moreover, turning the selection policy into an optimization problem can incur higher computational complexity, compared to simpler threshold-based policies, and thus needs to be justifiable based on the overall energy savings.

SECTION VII.

Discussion

In this section, the selection strategies introduced in the previous sections will be discussed along with future potentials and challenges for data selection as a whole. A summary of the advantages and disadvantages for each data selection method can be found in Table VIII.

TABLE VIII Summary of Advantages and Disadvantages for Different Data-Selection Methods

A. Comparative Analysis

In this section, the reviewed data selection methods will be compared with respect to the performance metrics described in Section III-C. Each method has been assigned a qualitative score in each metric based on the analysis of the method in Sections IV and V. The scores range from 1 to 5, with 1, 3 and 5 being low, medium, and high, respectively. As an example, methods which have a relatively low computational complexity will get a low score compared to other methods, while methods with high versatility in applications receive a high score. A priori confidence-based data selection methods were excluded from the analysis due to the limited amount of research in this method. Additionally, resource-aware selection methods are also absent due to them not being directly comparable to other methods. Radar diagrams for model-guided and data-driven methods can be seen in Fig. 11.

Fig. 11.

Radar diagram of the scores for each data selection method except a priori confidence-based and resource-aware data selection.

Show All

For model-guided data selection methods, there is a significant trade-off between the required a priori knowledge of the system and the versatility of the method across different applications. This is largely due to confidence-based and learning-based selection methods taking an offset from conventional ML methods, which are already highly versatile. This also means higher potential resource savings depending on the specific application. For learning-based selection methods specifically, the implementation difficulty is much higher than for a posteriori confidence-based selection methods. The reason for this is that the implementation of learning-based selection may add the need for acceleration hardware or memory storage to a modeling process that may not otherwise need this equipment. Adding a posteriori confidence-based selection to an ML pipeline should not require additional hardware. Nevertheless, the higher hardware requirements for learning-based and confidence-based selection, whether it be in the selection algorithm, the model, or both, may restrict them to offline applications where these needs can be accommodated, although this, of course, depends on the complexity of the task at hand.

For error-based and information-theoretic selection methods, a great deal of prior system knowledge is required for an effective implementation of the selection method. However, this is generally rewarded by a highly robust selection algorithm. The relatively low versatility of these methods is primarily due to the rather simple models on which they have been applied so far; however, exploring how these methods can be applied to more complex models, such as in [47], can potentially increase their versatility in the future.

In the case of data-driven selection methods, their usability for online applications depends on the size of the data set and the overall goal of the selection. For instance, distance-based similarity measures and distribution-based methods may be useful for pruning a training data set for AI training, although this restricts their use for offline cases due to the general large size of a typical training set. For online applications where data storage comes at a premium, only a defined number of data points may be stored at a time, giving only a local view of the overall distribution of the measured data. Correlation-based data selection can be useful for online applications, as was seen in [73], in cases where intra-point correlations are used. However, as discussed in Section V-A, the efficacy of correlation measures depends on the system which restricts the method's overall versatility.

In general, employing a priori selection methods such as information-theoretic, learning-based, and similarity-based methods can potentially result in more savings in resources, due to these methods not requiring model inference prior to selection. However, not relying on the model prediction may potentially increase the computational complexity of the selection method. Thus, great care must be taken when deciding whether a priori selection methods are applicable to the required task. If the extra amount of computational resources of the selection algorithm is greater than what would be saved from the model inference, then the overall resource costs might go up compared to using a posteriori selection methods or no selection at all.

In most cases, the appropriate selection algorithm and intended outcome of the selection is dictated by domain-specific nuances. For example, in LLM training, certain phrases or words may be filtered out of the training process to reduce undesirable behaviors of the model such as bias and toxicity [12]. Thus, when designing the selection algorithm, one should always consider the given task, its goal, and its limitations.

B. Potentials, Challenges, and Future Directions

Data selection is an important tool for energy-efficient use of data and, as seen from the analyzed literature, it is widely applicable for different engineering fields. In particular, data selection is highly useful for reducing communication resources by determining which data are useful enough to transmit. This thinking is one of the cornerstones in the next generation of wireless communication and sensing, also known as 6 G. One of the hot topics in 6 G is semantic communication, which is concerned with only transmitting information relevant to a specific task [89]. Thus, data selection lends itself well to many 6 G applications such as satellite sensing and computing, smart home applications, robotic communication, cloud-based digital twins, compression, and much more. Another exciting application is the field of electrification of vehicles such as electric cars, trucks, buses, trains, ships, and aircraft, which are dependent on on-board diagnostics to ensure safe operation of the battery and prolong its lifetime. Minimizing the energy usage of diagnostic tasks can free up battery utilization and extend their range of operation.

As was seen in many selection methods, data selection can significantly reduce the number of training data needed to train a machine learning model, thus making AI training much more efficient. This is especially relevant in the emergence of large language models such as OpenAI's ChatGPT and Google's Gemini [90], which require large amounts of data and resources for training [91], [92]. This is especially true for applications which work with high-dimensional data such image and video processing. Identifying non-informative training data can dramatically reduce energy costs for future training and AI research.

Resource-aware selection is especially interesting since it can adapt existing selection policies to be more conscious about the costs associated with transmitting, storing, and/or using the data, and can, as was seen in Section VI, dramatically decrease resource usage for a small decrease in estimation performance. This is particularly interesting in the application of communication, where the transmission of data can be costly. In semantic communication, an interesting concept called Value of Information (VoI) has been introduced that is directly related to resource-aware selection [93]. However, implementing resource-aware methods requires more knowledge of the costs inherent in the system and more complex selection policies.

As discussed above, the trade-off between system knowledge and the versatility of the selection method can be a significant challenge, either for the computational hardware or the designer. Thus, the designer should choose the selection method based on the specific application and its requirements. For online applications, high system knowledge may be needed to obtain good results, which can be challenging for highly complex systems. Conversely, learning-based methods can be promising for complex systems; however, their efficacy is highly dependent on their training data which should be analogous to the data one expects to retrieve in real life. However, learning-based methods can be useful in accelerating the discovery of new measures, which are useful for quantifying the quality of the data by incorporating these measures into the selection model. Finding new quality measures which are simple to implement and can accurately reflect the quality of the data should be of paramount importance for future research in data selection methods. Other measures that are not specific to the structure of the data can also be investigated and included in the selection policy. For example, the total time elapsed since the last model update can be used to avoid long periods with no updates.

SECTION VIII.

Conclusion

In recent years, data selection has been shown to be highly effective in enabling data-efficient modeling, obtaining good estimates of system parameters while reducing energy costs at the same time. In this study, a comprehensive overview of different data selection methods has been provided, together with an analysis of relevant works in various fields of engineering. Most of the selection methods can be characterized as model-guided or data-driven depending on the information used to measure the data quality. Additionally, model-guided methods were classified as either a priori or a posteriori depending on the time of selection relative to model inference. The different selection methods were analyzed and compared using six key metrics including the potential resource savings, the complexity of the selection process, and the versatility of the selection methods across various applications. Due to the diverse set of quality metrics and selection policies, data selection is highly versatile across many engineering disciplines and can be readily deployed in many systems, although the specific data selection method to use depends on the requirements of the application. Moreover, resource-aware selection methods have been shown to be particularly interesting in ensuring resource efficiency across all applications. It is the opinion of the authors that data selection will play a vital role in the age of Big Data and be integral in associated technologies such as Internet of Things, smart transportation, AI research and development, 6G, on-board system modeling, and much more.

References is not available for this document.

Doing More With Less: A Survey of Data Selection Methods for Mathematical Modeling

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction