A Systematic Survey of Data Value: Models, Metrics, Applications and Research Challenges

Data is central to modern decision making and value creation. Society creates, consumes and collects data at an increasing pace. Despite advances in processing power, data is expensive to maintain and curate. So, it is imperative to have methods and tools to distinguish between data based on its value. Yet, there is no consensus on what characterises the value of data or how this data value should be assessed. This results in heterogeneous data value models and inconsistent measurement techniques that are siloed in specific application domains. This limits the formalisation and exploitation of these concepts. We present in this paper a methodical literature analysis that discusses data value models, assessment metrics and current applications. We also highlight challenges hindering the development and exploitation of data value as concept. This leads to the identification of a set of research questions to help researchers contribute to this emerging field. The aim of this article is to stimulate further research and deployment of quantitative data value models and value-driven applications.


I. INTRODUCTION
Data has become an indispensable commodity for society [1].For example, the European Union has given high importance to data from strategic, legal and regulatory perspectives. 1 Big data, and especially big data analytics, is increasingly prevalent as a driver of business value [2].This is part of the trend where intangible assets have become an important source of value for businesses [3].These data assets can range from market intelligence reports to sensor readings.Such data assets can be acquired or generated internally, either directly or as a by-product of offering goods or services [4].
Data lags behind other intangible assets by not being added to the balance sheet [5] and part of the reason The associate editor coordinating the review of this manuscript and approving it for publication was Alba Amato . 1 https://ec.europa.eu/info/strategy/priorities-2019-2024/europe-fitdigital-age/european-data-strategy_enfor this is the challenge in assigning a value to the data [6].The importance of valuing intangible assets, however, is frequently pointed out in relevant research [5], [7], [8], [9], [10].The EU Open Data Directive2 stresses the importance of high-value datasets: identified as datasets covering a number of domains that provide important benefits to the society, the environment, and the economy through their use and reuse.But what is exactly meant by ''data value''?.Many publications explore this term [11], [12], [13], but there is currently no consensus on the definition of data value, its component dimensions or on how this data value can be assessed or quantified.
A major challenge in valuing data assets is the heterogeneous nature of value, often broken down into dimensions of value [14].This is exacerbated by the observation that data value, like data quality, is both subjective and highly context dependent [13], [15], [16].In the literature numerous approaches are considered to characterise data value.For example, Moody and Walsh [17] describe 7 ''laws'' of information which are applicable to data assets: 1) Information [data] is infinitely sharable; 2) The value of information [data] increases with use; 3) Information [data] is perishable; 4) The value of information [data] increases with accuracy; 5) The value of information [data] increases when combined with other information [data]; 6) More is not necessarily better; and 7) Information [data] is not depletable.These properties contrast strongly with other tangible assets that have a more easily defined value in use or exchange.Other early authors consider additional dimensions of the value of data, including the resulting business impact of new data [18], usage [6], and monetary costs or benefits [19].
The conceptual heterogeneity of data value leads to numerous approaches to quantify the value of data.For example, cost-based, market-based, and income-based approaches are frequently used to monetarily valuate intangible assets and data assets [9], [20], [21].Data value measurement has also been applied in numerous domains as part of decision support or automated systems.For instance, consider data valuation for information life cycle management [22], [23].In this context, data valuation becomes an indispensable tool for effectively governing the entire trajectory of data -from its inception to archival or disposal.By assigning a tangible value to data at each stage, organisations can make well-informed choices about data retention, archival frequency, and disposal strategies.This not only optimises data resource allocation but also ensures that valuable information is appropriately managed throughout its lifecycle.Another example is data markets and data pricing [24].Much like traditional marketplaces where goods and services are traded, data markets have emerged as platforms where data can be exchanged, bought, and sold.The value attached to various datasets forms the basis for these transactions.Organisations seeking specific datasets can engage in a market-driven approach to acquire the data they require, thereby fostering a data-driven economy.This innovative approach relies heavily on accurate data valuation to facilitate fair and mutually beneficial transactions.business decision making in agriculture [25] is also another example.With advancements in precision agriculture and data-driven decision making, farmers now leverage data to optimise crop yield, manage resources efficiently, and make informed business choices.Data value assessment allows farmers to prioritise which data streams are most critical for their operations, ensuring that investments in data collection and analysis align with tangible benefits.This is pivotal in a sector where data-driven insights can directly impact productivity and profitability.Spanning this work requires an interdisciplinary approach as it draws on information systems deployments in many fields as well as computer science research.
There are many reasons beyond balance sheet concerns to quantify the value of the data itself, and to treat data independently from specific data-driven applications or businesses.These include: Data can be shared between many applications or even organisations; It is more efficient to develop reusable assessment methods, tools and metrics that treat data in an application-independent fashion, as seen with data quality assessment methods [26].Common models for data value will enable heterogeneous tools and data value-consuming applications, such as value-driven data governance systems, to operate seamlessly in terms of data value quantification [27], [28].The literature identifies many purposes for measuring the value of data.For example to provide knowledge of the value of data/information as an asset for merger negotiations [12]; to improve organisational accountability for data by raising awareness [5]; to justify the costs of creating, maintaining or purchasing data [29], [30]; to identify relevant data for an application [5], [31]; and to enable data-driven decision-making about data like file retention [32].Most of the time, data's true value is not recognised or exploited as an asset by organisations [5], [10], [17].This is particularly evident in the latest trend of hoarding data [33] where organisations blindly capture all possible data with little means to discriminate between the data being accumulated, despite the fact that data storage still comes with costs [34].Robust, easily interpretable data value assessment techniques will give us the tools to address this problem.
It is notable that despite the width of literature available on data value, only two publications provide a comprehensive overview through systematic surveys: (1) Viet et al. [35] surveyed works published between 2006 and 2017 and focused on the related concept of value of information (VoI) in supply chain decisions; (2) Alawad and Kraemer [36] authored a systematic review focusing on VoI in wireless sensor networks and Internet of Things.Both of these papers are limited to their respective application domains.In addition, these works did not examine other related terms for VoI, such as data value, information value, data valuation, or information asset.Other publications provide a more unstructured approach to reviewing literature.For example, Faroukhi et al. [37] presented an unstructured survey of the literature related to data monetisation in big data value chains.Again this work was limited in scope and we eliminate data monetisation from our focus to minimise the overlap.Yanlin and Haijun [38] provide a timely but non-systematic survey of data value concepts focusing only on the Chinese literature.Fleckenstein et al. [39] provide a useful framework of approaches to data valuation models that we reuse but their work is not supported by a systemic search as presented here, lacks details on assessment metrics useful for automation and instead focuses on qualitative approaches to assessment.
This shows that there is both a need and a gap in the literature for a wider structured survey aimed at unifying the field and identifying a wide-ranging research agenda.In this paper we provide a systematic survey that comprehensively analyses the existing literature covering the domain of data value in terms of data value models, assessment metrics, and applications.
The rest of this paper is structured as follows: Section II defines some terminology and concepts used throughout the paper.In Section III, we describe our research method for the systematic survey.Section IV provides an analysis of the primary studies resulting out of the systematic survey with a focus on data value models, metrics and applications.We discuss our research agenda and the highlighted relevant research questions in Section V and we provide our conclusions and future work in Section VI.

II. TERMINOLOGY AND CONCEPTS
In this section we provide a number of definitions of terms related to the topic of data value, and used throughout this article.This is to provide consistency to the discussion and more broadly to the domain of data value research.

A. DATA
The Oxford dictionary defines data to be ''[f]acts and statistics collected together for reference or analysis''. 3Data is therefore facts about the world.Data can be collected (e.g. using sensors or surveys) or calculated (e.g.age from date of birth).Data can be represented in different formats, including text, images, spreadsheets, JSON, etc.Here we do not discuss the distinction between data, information, and knowledge, and for the sake of incorporating all relevant literature, we include all three terms in the scope of our search.

B. VALUE
In the generic sense of the term, the Oxford dictionary defines value to be '' [t]he regard that something is held to deserve; the importance, worth, or usefulness of something''. 4While this applies in most, if not all, contexts, different disciplines have more specific definitions.For example, in economics, the definition and measurement of value would be in terms of currency.Stern [40] identified two kinds of value in the context of natural resource scarcity indicators: use value and exchange value.Prices and rents are common measures of exchange value and unit costs as a measure of use value [40].Other definitions, such as sentimental value, would be in terms of personal or emotional associations rather than material worth.It is evident that these varying definitions are tied to the subjective and contextual nature of value.With the aim of characterising the latter concept, we define value to be a number of different data value dimensions (attributes) that in an aggregate manner represent the worth return of the thing in question.
There is a wide range of definitions of data value across different domains and for specific use cases in the literature.For instance in ML, data value is the weight or contribution of each training sample or feature in improving a model performance [41], [42], [43].In applied energy, data value is defined as the quantitative relationship among the data, uncertainty reduction, and profit enhancement [44].In business, data value is typically estimated in terms of cost (e.g.collection cost, storage cost, or cost related to the loss of the data) and revenue by selling or exploiting the data [39].This is exactly what Laney tried to do with his financial valuation models of information [data] [5].There are also some definitions of data value which are more general and can be applied to multiple domains.For example, Khokhlov and Reznik [45] defined data value as data usefulness.Another such definition of data value is: data value is the future importance of data, it expresses a probability of further use [46], [47].

D. DATA VALUE MODEL
Representations of the value of data, either as explanatory, descriptive or predictive models.These representations define the relevant data value dimensions and relationships between the dimensions that characterise the value of value to an individual, application or organisation in a specific context.

E. DATA VALUE DIMENSIONS
Attributes of the value of data assets that are relevant to data consumers, maintainers or owners.Sometimes called data value aspects.Due to the subjective and contextual nature of value, some dimensions may be considered to be more characteristic of value than others, depending on the use case, data asset, or consumer.

F. METRICS
Metrics are specified quantitative measures of data or its context that can be used to measure the data value dimensions of a specific data item.For instance, if we consider ''Usage'' as a dimension in a relational database, a metric that measure this dimension could be ''Number of writes in a day''.Metrics can be subjective or objective and qualitative or quantitative.All metrics can be mapped to one or more data value dimensions in a descriptive or predictive data value model.A set of observations of data value metrics for a specific data item quantify or measure the mapped data value dimensions for those metrics.

G. DATA VALUE QUANTIFICATION
The quantification, assessment or measurement of data value is the explicit calculation of the value of a specific data item based on a descriptive or predictive data value model.It is usually based on a set of observations of a set of data value metrics.Data value metrics could be the specific measurable elements that are used within the process of data value qualification.Data value qualification involves using these metrics to qualify, assess, and make judgements about the value of data.Expert judgement can also be used in subjective and less formal quantification methods.The focus of this survey is on more formal methods based on observations which may themselves be objective or subjective.

III. REVIEW TECHNIQUE
In this survey, we follow a methodical literature survey technique with three phases of activities -(i) activelyplanning, (ii) conducting and reporting the review results, and (iii) exploration of research challenges as per of the widely accepted guidelines and process outlined in Pai et al. [48] and Kitchenham et al. [49], [50].The remainder of this section details the research question, the process for the identification of research, and the data extraction process.

A. RESEARCH QUESTIONS
The goal of this survey is to provide a comprehensive overview of the literature that provides discussions on the value of data, and its models, dimensions, assessment techniques, metrics and applications.We therefore define the following as research questions: • Q1: What are the existing models and dimensions used to characterise the value of data?
• Q2: What metrics and measurement approaches have been created for the quantification of the value of data?
• Q3: In what application domains have data value models or assessment techniques been applied?
• Q4: What are the issues and challenges facing the data value research?The research questions are designed to cover a wide range of aspects related to data value, from theoretical models and measurement techniques to practical applications and challenges, aligning with our goal of providing a comprehensive overview of the existing literature in this area.

B. IDENTIFICATION OF RESEARCH
Research was identified from the following four electronic databases: (i) Springerlink, (ii) ScienceDirect, (iii) IEEEeXplore, and (iv) ACM Digital Library.
Figure 1 shows the search string used to query these databases based on metadata (title, abstract, keywords).In the case of IEEEexplore, the query did not produce sufficiently accurate results.As such, the query for IEEEexplore was rerun to both metadata and full texts.In addition to the papers returned by the above databases, we also included some handpicked papers [6], [29], [32], [46], [51], [52], [53], [54], [55], [56] that were recommended by the domain experts which were not returned by the search string.
The initial search returned a total of 434 research papers related to the research topic.A set of inclusion and exclusion criteria, shown in Table 1, were defined to enable the selection of papers to include in this study to be carried out in a systematic and replicable manner.In line with Kitchenham et al. [49], [50], three researchers independently screened titles.The title-based exclusion reduced the number of papers to 121.Then, the papers' abstracts were read.The abstract-based exclusion reduced the number of papers to 73.In the next phase, the full text of the papers was read.In all three phases, in case of disagreements on whether a paper was to be included or discarded, discussions were held until an agreement was reached.Following the application of these criteria, 63 research articles were included in the final review.These are listed in Table 2.

C. DATA EXTRACTION
The papers were manually reviewed by the three authors independently.For each one of the 63 papers, the following data was extracted: bibliographic data, the contribution towards the domain of data value (e.g., data value model, data value assessment, or data value use case/domain), implemented details, the data value dimensions and metrics (if applicable) under study, and the type of validation used (if applicable).The data was then compared and aligned, with discussions taking place if any inconsistencies were found.

IV. REVIEW RESULTS
In this section, we analyse the papers obtained from our systematic review.We start by reviewing their bibliographic data such as number of publications per year and also publication venues.Then we delve into their content: data value models, data value assessment metrics, and applications.

A. PRELIMINARY RESULTS
We see in Figure 2 (extracted from Table 2) that most of the publications (34 papers) are conference papers, 21 are journal papers, 3 are symposium papers, 3 are workshop papers and 2 are books.We can also see that the search in the domain of data value goes all the way back to the year of 1980.However, most of the selected articles were published in the period starting from 2003. 5 We notice an increased number of works from a year to another especially in the last decade.This reflects on the increase of interest in researching the data value field.The same pattern can be seen in terms of type of publications where more diversity in type of publication can be seen in recent years including journals, conferences, workshops and symposiums.An analysis of publication venue (in Table 2) suggests that there is no common venue for publication in this field.In fact,

B. DATA VALUE MODELS
Fleckenstein et al. [39] identified three categories of approaches to data valuation: market-based valuation, economic models, and dimension-based models.They define market based as ''estimates the value of data in terms of cost and revenue when buying and selling data or data-intensive businesses''; Economic models as ''estimating economic benefit as a result of making data available [measuring impact]''.The dimension-based approach ''examines valuation points of a specific data set both inherent to data, like data quality . . .and contextual to value [of] data [e.g.data usage]''.
This work fits within the dimension-based approach of Fleckenstein et al.It focuses on new metrics and dimensions that are specific to data value and not already known from the data quality metrics literature.For example, 26 data quality dimensions and over 80 associated metrics are described by Zaveri et al. [26] and 21 data quality dimensions are identified in the recent review of data quality dimensions by Wang et al. [99].The most commonly referenced data quality dimensions in our data value survey sample papers were accuracy [17], [35], timeliness [35], [51], completeness [35], [45], latency [70], volume [32], [47], [96] and provenance [45], [70].Note that volume in many models has an inverse or convex value curve in relation to value [17].
The research approach in this paper yielded 19 primary studies that focus on models including one or more dimensions which characterise data value, as shown in Table 3.All previously known data quality dimensions [26], [99] are grouped in the table into ''data quality''.Four new dimensions are identified: Content/Uniqueness, Usage, Utility and Financial.The most popular dimension cited is related to examining the content or uniqueness of the data and its relevance for a task.The Content dimension of data value was first identified by Even and Shankaranarayanan [32] in 2005.Usage is a well established dimension of value since [17] defined it as a distinguishing feature of data value and classed as part of the context of value by Fleckenstein et al.Utility is perhaps the original dimension of data value [31].We extend the dimensional approach to include one market-based model aspect of Fleckenstein et al., a financial dimension that represents measurable aspects of data within an organisation like cost or price.
As can be seen from the table, most current models of data value are limited in their perspective since they only focus on a subset of the data value dimensions, such as the financial dimension, and they do not provide a comprehensive view of how dimensions are related.One feature of Table 3 is that the majority of models address three or less dimensions of data value.
It may be asked are any of these dimensions antecedents of (factors that influence) value rather than dimensions of value itself.Examining Wang's original antecedents (management responsibility, operation and assurance costs, research and development, production, distribution, personal management, and legal) it seems there is some overlap, e.g. in production costs.However this neglects the contention of Fleckenstein et al. and Laney [5] that data value, even in a dimensional approach, is a wider concept than data quality.Additionally, metrics have been identified in this search for all of the dimensions, and this suggests that they are directly useful for data value calculations.
In the following subsections we provide an overview of the models covered in the mentioned primary studies, based  on the main dimension that, according to the authors, the data value would contribute towards.For each dimension we provide a table of data value assessment metrics identified in the search to facilitate reuse in new applications or further research.As has been long established for data quality, to assess specific data assets, it is necessary to define metrics to quantify or measure data value dimensions [100].Only about half of the selected papers (31 out of 63) provide specific data value metrics and this shows the relative immaturity of the field.In the tables below the metrics are classified as being suitable for subjective or objective measurement in the ''Type'' column.In total 44 metrics for data value assessment beyond those typically used for data quality assessment have been identified and allocated to data value dimensions.The Utility, Content-Uniqueness, Usage and Financial dimensions have 14, 10, 10 and 9 metrics identified respectively.These are candidates for the biggest departure from traditional data quality dimensions for data value assessment.Three metrics are used in more than one dimension (rival access loss, camera resolution and market price).Camera resolution is an example of a common sensor-based metric that can quantify the likely information content of data and therefore its value.
Moreover, we also provide a distinct section for market-based VoI models (12 papers), which we follow Fleckenstein et al. to be a separate type of data model (economic) since they focus on cost/benefit difference in outcome as opposed to specific dimensions of data value.

1) ECONOMIC OR FINANCIAL-BASED MODELS
The financial or economic dimension is one of the most popular dimensions used to determine the value of a data asset.This is probably due to the tangible nature of this aspect.This dimension is based on the metrics included in the ''Accountancy Valuation Models'' for data (information) value of Moody and Walsh [17] or the financial models of Laney [5] which includes the realised or potential cost of data, the market value of data, and the present financial value of data.Zhang et al. [90] propose a theoretical capitalisation of data assets, based on the historical cost method, the fair value, and the current value.Li et al. [81] focus on data pricing, where an entropy-based method is proposed to measure the value of a dataset based on size and information content.A pricing function is then provided to convert from entropy-based value to price.Mayle et al. [76] propose a game-centric model of a private data exchange in return for a service.The model takes into account the priority given to data items by users and the monetary value given to users in return for their data.Schuh et al. [77] model an extended, economic value definition which is proportional to the benefits and costs associated with the product or service.Based on ''technological value contributions'', the model supports manufacturing companies in evaluating if a generic set of field data generated by a smart product provides value to the user.

2) CONTENT UNIQUENESS-BASED MODELS
This dimension was identified by Even and Shankaranarayanan [32] and Viscusi and Batini [14].Measurement can be derived by assessing the content and its applicability for business use cases.The research identified here proposes measurement methods that link impartial characteristics and contextual perception to measure the potential business value associated with the data.This is related to but distinct from utility metrics which measure the value of the data in in use.Uniqueness is an important aspect of content for value and is a characterisation of the value of data based on rarity [13] or scarceness [14] of the information contained in it.Yao and Atkins [85] propose Smart Black Box (SBB) data compression decision making based on data value, calculated on data novelty and events.Shimazu et al. [72] on the other hand take a value-based approach towards setting data confidentiality.Their paper defines a method for setting data confidentiality based on risk, taking into account data value, protection level, and threat level.

3) QUALITY-BASED MODELS
Data quality dimensions are tied with Data Content Uniqueness as the most common way identified to quantify the value of data assets, and can include aspects in the Data Quality Model defined in the ISO/IEC 25012 Standard. 6 To a certain extent this is because the quality concept of ''fitness for use'' is closely aligned with the concept of value, for example possessing the same strong dependence on context of use.Measurement can be derived by assessing the dataset directly and its conformance to standards or applicability for business use cases.Fleckenstein et al. [39] describe the dimension-based approach to data value as an of extension of data quality methods and tools.Given the wide range of data quality metrics well known to the community 6 http://iso25000.com/index.php/en/iso-25000-standards/iso-25012104972 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.we do not provide a table of them here but instead refer readers to other surveys of data quality metrics such as [26].In Hanratty et al. [70] base their definition of data value on data quality related aspects, namely content reliability, trust of source, and timeliness.Intended for the military decisionmaking domain, the authors propose a fuzzy-logic-based approach that valuates data.Hofman et al. [97] propose an evaluation framework for data quality and value assessments.The authors explore data quality categories and dimensions for assessing the potential value of linking different customs data sets and linking a business dataset to a customs data set.The categories of data quality used were contextual (including relevancy, value-added, timeliness, completeness, and amount of data as dimensions) and representational (including interpretability, ease of understanding, concise representation and consistent representation).

4) USAGE-BASED MODELS
Authors such as Moody and Walsh [17] argue that data has no intrinsic value, yet it only becomes valuable when people use it.Following this argument, the more the data is used, the more valuable it becomes.This is based on the economic concept of ''value in use''.This dimension is a characterisation of how often or by whom [71] a dataset is used.It was developed by Chen [22] into a set of concrete metrics.Qiu et al. [47] propose a data value measuring algorithm for data migration applications based on the usage of the data, such as the access time, and the data read and write frequency and access, and other content related aspects such as data size and file content.Zhao et al. [54] propose a model that values data blocks based on the timeliness of the data, data distribution and usage, and the association between blocks, as well as other usage related characteristics such as read and write frequency and granularity.Yanlin and Haijun [38] derive a data asset valuation framework based on data asset production cost and data asset spillover value, where the data is considered to be more valuable the more it its used, particularly if it is used in a multidimensional manner.

5) UTILITY-BASED MODELS
Utility characterises data value in terms of value in use and the benefits, usually business oriented, that can be derived from it [14], [17], [101].This dimension is often used to classify data value metrics which are very specifically tied to a particular application, service or business process and in many cases a specific dataset.Sonobe et al. [80] define a data value model based on a utility function derived from seller/producer estimates, with the aim of enabling rapid flow of the most valuable data in a disaster recovery situation.Jia et al. [52]  also use a utility function in an attempt to answer how much each data point is worth for machine learning models based on a model of the relative value of data derived from its Shapley Value from game theory.Tan et al. [79] derive an analytic formula for information value assessment based on utility functions, data type classification, information uncertainty, and the willingness of different actors to pay for more information based on their risk preferences.

6) MULTI-DIMENSIONAL MODELS
Laney [5] discusses a number of ''Information Asset Valuation Models'' i.e. a set of approaches for calculating the value of data assets.Compared to the previous models, Laney's models are more comprehensive as they comprise a larger number of dimensions.Laney [5] divides the models into fundamental models and financial models.Fundamental models are mostly based on the content of the data, including the utility, business usage, impact of information, and quality aspects, such as validity, completeness, integrity, and consistency, and are claimed to be better for information asset management applications.The financial models are based on measures of money and accountancy and claimed to be better for assessing an information asset's business benefits.Based on the model definitions provided, Laney [5] uses more of an expert-based approach, as opposed to automated data valuation.Whilst a great basis for further research, these models in many cases lack the definition of concrete metrics.
Similar to Laney, Viscusi and Batini [14] propose a more model for digital information asset evaluation.authors consider information value to be based on: information capacity (comprising information quality, information structure, and information infrastructure) and utility based on information diffusion.The authors base their model on the assumption that information value can be quantified either on the basis of information utility of the IT capabilities enabled by a data asset, or otherwise on the basis of the overall capabilities the data asset may provide in the initiative in question.
Lu and Zhu [62] also consider a number of dimensions in the evaluation of Enterprise Value of Information (EVI), and construct an evaluation model of EVI based on a combination of a qualitative and quantitative approach.The authors propose an EVI evaluation index system that takes into consideration information authenticity, timeliness, degree of coverage, degree of relevancy, degree of superposition, manager's subjective consciousness, information flux, and information cost.Whilst only the latter two are quantitative indexes, all the previous are categorised as qualitative.The authors therefore use the cloud model evaluation to translate the qualitative indicators into the quantitative target.
Ahituv [31] defines a joint utility function that includes a set of dimensions for information value as follows: timeliness (including response time and frequency); contents (including similarity and aggregation level); format (including medium, ordering, and graphic design); and cost.Albeit the author provides this list of attributes to demonstrate an approach, the author points out that he does not intend this list to be exhaustive.
In summary, a wide range of data value dimensions are defined by existing models but only the models in [5], [14], [31] and [62] define a broad multi-dimensional approach to data value and in each case the dimensions selected differ.

7) VALUE OF INFORMATION (VoI) MODELS
Keisler et al. [102] define the value of information (VoI) as ''a decision analytic method for quantifying the potential benefit of additional information in the face of uncertainty''.It is often used in decision support systems and automated sensor data fusion applications.Thus it can be seen that VoI is a term used for a specific type of data value in the literature.Since VoI calculation is always specialised to a specific decision and application domain, there are a wide range of VoI models and assessment methods in the literature.Its application depends on having an objective function to be maximised and a choice between courses of action leading to uncertain payoffs [103].
A popular approach is to define VoI in terms of monetary values related to the costs and benefits of the use case in question.The monetary definition of VoI corresponds to the difference between the cost for acquiring/collecting new data and the benefit that this would create in terms of reducing uncertainty in decision making and improved business gains.For instance, Koski et al. [53], Macauley [60], and Rojo-Gimeno et al. [86] define and estimate VoI as a 104974 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.monetary value in three different application domains.Doctori-Blass and Geyer [61], and Dang et al. [63] also define VoI as monetary/economic value for the specific domain of supply chain.While the aforementioned papers aim to provide a single/global assessment of VoI, Giordani et al. [51] and Fauriat and Zio [84] aim to assess the VoI of different data sources separately, in an attempt to identify the most relevant information and prioritising their acquisition.However, in the former, the definition of VoI in is not based on monetary value.Instead, it is based on an aggregation of three domain specific attributes (i.e., a weighted-sum of source proximity, timeliness, and quality).Santos et al. [82] and Hanratty et al. [70] aim at improving the accuracy and applicability of the defined VoI for their respective use cases (i.e., reservoir development and military operations management).References [74], [87], and [92] discuss the large computational complexity that is associated with computing and optimising VoI in various domains and propose techniques to improve the efficiency and the time complexity of these processes.
In summary, the literature on VoI is relevant for discussing more general concept of data value but VoI is limited to creating application-specific data value models where it is possible to efficiently calculate or measure the expected value of perfect information (EVPI), the expected value of sample information (EVSI) and the expected net gain of sampling (ENGS) and this is not possible in many real world deployments.

C. APPLICATION DOMAINS AND USE CASES
Approximately two thirds of the papers (45) had a specific, detailed application domain or use case for data value.A thematic analysis of these papers is provided in Table 8.This helps to better understand the current application areas for data value and it identifies gaps for further work or opportunities for cross domain exploitation of existing results.The most common application areas identified were, in order of frequency of occurrence: information management, sensors and monitoring, security and privacy, information (data) pricing for data markets, and business decision making support.Each of these areas is discussed in detail below.

1) INFORMATION MANAGEMENT
This is the largest application area identified with 15 papers on this topic.Within this theme the most common sub-topic is a grouping of 10 papers in the discipline of information lifecycle management (ILM) looking at data migration, data storage, and file management.This theme also includes the related fields of enterprise information management and data quality management (3 papers).
ILM is an industry term for managed dynamic and efficient storage resource management for the increased digital data being managed given the availability of multi-tier storage that trades access time for cost and the increasing prevalence of data legislation forcing compliance [30].Central to the ILM approach is the idea that ''not all corporate information [data] has the same value and values change over time'' [6].Chen [6] identifies the three key ILM tasks as information valuation, information characterisation & classification, and task prioritisation & optimisation.Given the relatively early date of Chen's work (2005), the ongoing expansion of corporate data and the key role played by data valuation in ILM, it is not surprising that there is a significant body of work in this field.
Automating ILM is dependent on defining data value metrics that can be executed either in realtime or periodically to enable files [71], blocks [64] or other data [30] to be shifted between storage types or deleted.The relatively high availability of file metadata [6] and access information [46] characterising the numbers of data accesses means that both of these have been exploited by authors.Some authors caution that that metadata-driven approaches can require great effort to collect and maintain appropriate metadata and instead propose probabilistic approaches to metrics [46].Despite the progress on data value assessment for ILC, Wijnhoven et al. [71] conducted a large scale comparison of subjective data value assessments conducted by experts using Sajko et al. [56] questionnaire and automated assessments and found a poor correlation.However, when identifying ''wastage'', the least valuable files, they found that automated methods had an 80% accuracy, far better than when identifying the most valuable files.
Given the commercial importance of data retention, the works often include extensive evaluations in real-world settings.For example, Turczyk et al. [46] describe a case study that generates file migration rules for 150,000 files, whilst Wijnhoven et al. [71] conducted a case study with 77 employees of Capgemini Netherlands.Many of the studies propose domain specific file or block-based data value metrics that would be difficult to re-purpose for other use cases or domains [54].However the work has matured over the last decade and fully automated ILM systems are now a reality in data centres and so the focus has switched from data value metrics to decision algorithms based on the metrics enabling fully automated ILM [47].
Enterprise information management [65] has largely grown out of the discipline of records management which focuses on managing information for legal compliance and supporting efficient operations of the organisation (governance).There is a growing number of records or data sources in organisations and a prioritisation mechanism is needed to deal with them most effectively, given limited resources.However this work is less mature than ILM use cases.Ladley [65] presents a list of loosely defined metrics and ideas on how to measure the value of information assets for Enterprise information management.Tallon [30] proposes a tiered information framework that, by considering the value of information allows CIOs to comprehend the interplay of market forces that shape information costs.Laney [5] devotes a chapter to EIM driven by thinking of data as an asset.He states that most EIM metrics are for justifying, funding, prioritising and gauging the success of initiatives for either managing data and business initiatives that use the data.This provides extensive use cases for data valuation in EIM.Unfortunately the metrics presented by Laney are less formally specified and thus hard to operationalise.Finally there research on value-driven data quality management (a sub-topic of EIM) which, has strong metrics since it aims to enable automated processing of data [32].Given importance of data quality to modern machine learning and analytics pipelines, this is a promising topic for further research.

2) SENSORS AND MONITORING
Eight papers were identified as applying data value to sensor deployment or communication and monitoring systems.Topics discussed include a sensing system for earth sciences like satellites [60], environmental monitoring such as lake water quality [53], livestock production [86], sensor fusion architectures [57] and vehicular networks [51].
Many of the papers make use of the Value of Information (VoI) measure discussed above.In most cases probabilistic VoI approximation techniques are used, e.g.Bayesian [60] or Monte Carlo [53], due to the computational complexity of determining VoI exactly in realistic situations.Macauley [60] summarises that the value of data is based on ''(1) How uncertain decision makers are; (2) What is at stake as an outcome of their decisions; (3) How much it will cost to use the information [data] to make decisions; and (4) The price of the next-best substitute for the information [data]''.It can be seen that answering these questions for specific domains can inform decision-making about data acquisition or use.These domain-specific answers limit the re-usability of VoI methods in other domains.For example Cornou and Kristensen [25] assign a value to knowledge of pig drinking behaviour that is very closely tied to their use case.
The key applications of data value estimates for these papers are to assess the worth of paying for additional monitoring capacity or information sources (justifying ICT system expense), optimising deployments of mobile sensors, minimising inter-sensor communications network load, and optimising data storage in low resource devices.In most of these applications the data value estimate can be used to prioritise data collection or communication activities.Thus, utility-based VoI models dominate.However, for monitoring system deployment scenarios [25], [60], [86] economic benefits are also important as the decision is a longer term one 104976 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

3) SECURITY AND PRIVACY
This section discusses the eight papers that were classified as dealing with data value applications for security and privacy.There is a strong distinction between the two application areas: for security data value was used as a part of the risk assessment process [45], [56], [66], [72]; for privacy the emphasis was on placing an economic value on the private data that users exchange in return for ''free'' services [24], [69], [75], [76].Within the privacy papers the topic of IoT (Internet of things) was seen as an important sub-application area for two of the papers.One paper was a cross-over that considered both security and privacy aspects in security system design influenced by data value [45].
In security risk assessment, for example as defined by the ISO 27705 standard, a key question is prioritisation of data assets to be protected by security systems.Identifying the most valuable assets has long been understood as part of this process [56].This encourages consideration of a multi-dimensional view of data value.For example the early work of Sajko et al. [56] examined assessing timeliness, utility, replacement cost, legislative risk, and competitive advantage through a qualitative, questionnairebased approach.Not all assessments are qualitative though, as Shimazu et al. [72] propose a quantitative assessment of risk using the value of information [data value], protection level, and threat level.Khokhlov and Reznik [45] provide mappings between data quality metrics and data value as a further extension of the quantitative approach.They also show how the computed data value can be used to influence security system design for a crowd-sensing application and other one in TOR (The Onion Router) network including privacy protection.Liu et al. [66] extend ISO27005-style risk assessment of data assets to assets in power control systems (smart grids).
The privacy related studies use economic (monetary) data value assessments (sometimes called ''privacy valuation'') to drive competitive data trading [24], [69], or to raise users' awareness of the value of personal data by providing transparent access to the value realised by service providers and data brokers [75], [76].Game theoretic play an important role in these markets whether trying to drive the market [24] or provide more transparency for users [76].The data value model of Yassine et al. [69] is notable because instead of more usual data value dimensions it uses ''risk of sharing private data'' as a dimension of data value.The privacy data value papers are collectively characterised by a lack of strong implementation or evaluation.This may be due to the relative cost of integrating with many data broker platforms [75] or the immaturity of deployments of the IoT technology being discussed [24], [76].

4) DATA MARKETS AND INFORMATION PRICING
This topic was addressed by only seven of the papers analysed.This is interesting as data markets and pricing, i.e. a direct measure of the economic value of data, have received a lot of commercial attention in recent years but comparatively little seems to have been published on this topic, at least in the technical venues that we examined.Pricing for sale was seen as an end in itself by only two of the papers [55], [81] whereas three papers deal with data pricing in the context of privacy as a way of raising the awareness of users or increasing transparency of the transaction taking place between users and platform or service providers [24], [69], [75].We also group in this theme methods that place an emphasis on data trading or data markets, even when they are not evaluated financially.This could be considered to include many of the ''Value of Information'' papers that are based on a games theoretic approach to the concept of a market for information.However there are also more concrete market mechanisms specified for encouraging data transfer in disaster situations [80] and markets for personal data [24].Finally there is one paper discussing training data pricing for machine learning models [52].
Li et al. [81] describe a data pricing strategy based on information entropy and give a useful overview of different existing pricing strategies: subscription, query-based pricing, and bundling/discrimination-based pricing.It claims none of these are actually based on assessing the information value or contents of a dataset and presents a new information entropy-based way to measure the value of a dataset based on two value dimensions: size and information content.It provides an interesting evaluation of information content based on the ability of a dataset (or subset) to train a classifier and this is tested on six research datasets.A set of three example pricing functions to convert from entropy-based value to price are also provided and validated in use cases.Rao and Ng [55] instead define a utility-based pricing mechanism, however unlike most utility measures they define a method to estimate the utility of data before and after obfuscation for privacy purposes using Kolmorgorov statistics.This is applicable to personal information but it is unclear how other applications of the approach would work.
Jia et al. [52] seek to answer how much each (additional) input training data point is worth for machine learning models based on a model of the relative value of data derived from its Shapley value from game theory.The Shapley value defines an optimal distribution of value to resources in a multi-actor system such as a market.Since this is very expensive to calculate (O2 N ) the paper develops an approximation for k-nearest neighbours (KNN) machine learning models and a more practical Monte Carlo approximation algorithm.This analysis is based on having a utility function for the use of the data, which is often hard to estimate.They provide a classification of types of data valuation problems.The paper is significant as they explore the issues of data valuation at scale as they seek methods to assign value to each individual data point rather than an entire data-set or information source.
Given the importance of data for ML methods, it is to be expected that there will be more work in this direction in the future.

5) DECISION MAKING AND OTHER APPLICATIONS
This is a catch-all application domain category with 11 papers.In practice data value techniques are usually applied as an input to some form of decision making: should I use this data?do I need more data to make this decision?and so forth.The only sub-topic that contains multiple papers are supply chain management with 3 papers [35], [61], [63] but other areas include military decision making [70], insurance industry case studies [67] electricity markets [89], information overload [59], petroleum reservoir development [82], smart grids [79] and identifying relevant content in social media [73].Several of these papers are based on VoI measures discussed above and do not contribute anything unique their domain application.
Supply chain management is significant in having a number of papers and the fact that two of them are relatively early (2000s).Dang et al. [63] say ''Information resources management is the basis of supply chain'' and that this is a source of competitive advantage.They propose the concept of a value chain of information resources that parallels their own focus on physical supply chain management.Viet et al. [35] provide a structured review of the value of information in different supply chain decisions focusing on articles between 2006 and 2017.They find that the focus has been on data availability rather than data characteristics (quality dimensions).They lay out a research agenda.The two most relevant questions for our study are: more research is needed on information characteristics (data quality dimensions) and a need for new methods to assess the value of data.They highlight the importance of enabling multi-method modelling approaches and so semantic models of data value such as DaVe [27] could have an important contribution there.
Other significant papers include Glissmann et al. [67] who create a model to dissect data value from the value generated by analytics in business decision-making in the context of an insurance industry example.The approach is based on models of business operations and the way information (raw and analytical) is connected to the organisation and its business priorities (business architecture/enterprise architecture).This is a tempting model for data governance operations but it is very unlikely that businesses will have an up to date catalog of their data assets or comprehensive business process models in a machine-readable format due the relative immaturity of data governance in most industries [5].Scaffidi [73] discussed the use of data value measures to support the automatic identification of valuable Web resources in social media.This use of data value as a proxy for ''importance'' or ''interest'' is surely an area of overlap with other fields such as information retrieval and recommender systems that seek to satisfy a user's desires by supplying their data needs.More work is required to examine this overlap and identify paths to cross-fertilise the research in both areas.Scaffidi's approach is interesting in that it depends on collecting both positive and negative examples of value and highlights the importance of authorship of content for value (which is linked to Wijndhoven et al.'s idea of incorporating the role or seniority of the people accessing the data in their usage-based value model for files [71]. Laney [5] discusses techniques for applying data value measurements in a business context in chapter 11.First he introduces data value improvement through the idea that there are three degrees of value: i) realised value (current economic benefits), ii) probable value (based on intended uses) and iii) potential value (if optimally applied).The ''information performance gap'' is then defined as the difference between realised and probable value.It is suggested that his Market Value of Information (MVI) and Economic Value of Information (EVI) models can be used to calculate realised and probable value.The ''information vision gap'' is then defined as the difference between probable and potential values.Laney indicates that is harder to estimate.He suggests using Business Value of Information (BVI)-based actual versus potential valuations as a prioritisation technique.Unfortunately, the estimation of BVI is based on a crude estimate for relevance, which limits its applicability.A set of use cases for business decision-making are identified and a decision method based on the application and comparison of multiple of Laney's data value models is defined.The use cases and methods are as follows: prioritising information asset management initiative investments, proving benefits of information governance, innovation and digitisation, monetisation and analytics, to help building a business case for monetising information assets, reducing information lifecycle expenses.
All of these methods are currently limited by the subjective models available, yet they represent significant insight into the types of decision-making possible and the relevant aspects of data value that we need be able to understand and quantify in order to maximise its utility.

V. RESEARCH AGENDA
The review results showed that despite [104] identifying ''Data value measurement'' as a key challenge for the future scope of data governance in their well received review [104], there are still many challenges that are hindering the efficient and effective exploitation of the concept of data value.We have identified the following three main challenges for data value research: a common conceptual framework, reusable assessment methods and tools, and future applications.Table 9 summarises the open research questions and topics of interest.

A. COMMON CONCEPTUAL FRAMEWORK
The first major challenge observed in the literature is the absence of a common conceptual framework for data value that sets a common basis for communication, research, and tool interoperability in terms of data value applications, context, stakeholders, models, terminology, and data value 104978 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
dimensions [35].Currently, most data value research is use-case or application focused (see Table 8) and identifies its own terminologies, models, and even the type of value they seek to derive from data.There are opportunities to develop data value ontologies [27], catalogues and other online resources to make terminologies, metrics, use cases and data value test data easier to exchange and reusesimilar to the efforts that have been done in the data quality engineering field [26], [99].
The few unified, multidimensional data value models like [14] are incomplete when compared with scope of data value identified in this review.At present the infonomics models of Laney [5] come closest to this ideal but many of their underlying metrics rely on subjective judgement and are not suitable for scalable, automated decision making.Fleckenstein et al. [39] explains that despite the utility of their framework of three types of data value models (marketbased, economic and dimensional) that the ''three models overlap with each other'' and issues like regulation and privacy affect all models.All of the models are immature and lack a consistent way to represent the data value context which is external to the data and yet part of the data value assessment.This hinders the development and deployment of tool chains or frameworks for data value assessment and management or novel applications of data value to driving decisions in new domains like data governance [28].A key problem for any unified models is how to reconcile, combine, and convert between financial models of data value measured in currency and scalar metrics typically measured in the range 0-1 [105].Lastly, since there are several dimensions that could be included as part of a data value model, it is crucial that models provide machine-readable methods to score importance and provide means to aggregating the relevant dimensions and explaining their impact on individual use cases.Fleckenstein et al. [39] describes one model but it is human-oriented rather than addressing automation directly.

B. REUSABLE ASSESSMENT METHODS AND TOOLS
It is critical that the data value community has access to known, reusable techniques and tools for data value assessment.
Although this paper has presented the most comprehensive collection of data value metrics to date (see Tables 5, 4, 6, 7,), the coverage of individual data value dimensions has high variance.Many metrics are extremely specific to particular applications, especially in VoI models (see Sec. IV-B7).The relative frequency of subjective as opposed to objective metrics in the literature [5], [32], [45], [47], [56], [61], [70], [80] and the significant number of data value metrics based on abstract and idealised utility functions [52], [74], [79], [80], [92] that are hard to connect to real applications, shows that data value assessment is still immature.Objective metrics are also more suitable for automation and the creation of reusable assessment tools, as has long been common in the data quality domain [26].As Viet et al. [35] says there is ''a need for new methods to assess the value of data''.
As well as lowering costs, common tools and metrics foster reproducibility of research findings and will facilitate the creation of open competitions and technical challenges by the research community.At present, the proliferation of individual assessment tools and models ensures divergence of results and duplication of effort.
Furthermore, although several metrics already exist for the assessment of data value metrics on static data, we live in the big data era with constantly updated data streams, yet there is little research to date on data value metrics for this domain.It is key to accelerate stream data value research to both analyse the applicability of existing works and propose novel assessment techniques that will dynamically capture the evolution of data value over time.

C. APPLICATIONS
The potential scope of applications for robust data value assessment techniques is currently unknown.We have seen that a very wide range of application types and domains are already represented (see section IV-C) with automated data value-driven decisions is most mature in the field of Enterprise Information Management Fleckenstein et al. [39] explains.However with modern, increasingly data-centric businesses the potential for data value must be significant.Machine learning applications are heavily data dependent and although data quality has started to be addressed in this field there are very few data value papers published on machine learning to date.
It is notable from this survey that many data value models and metrics are still only tested in simulations [51], [53], [60] or case studies, and there is a lack of longitudinal case studies on the experiences and benefits of data value-driven decision making.Enterprise Information Management is an exception with several significant deployments reported but the privacy and security application papers [24], [45], [75], [76] do have not a single deployment of automated, objective data value assessment reported.
However, the diversity of current approaches demands some synthesis for more widespread applications.For example, the lack of an established pipeline, lifecycle, or workflow for the application of data value impedes its adoption.This could span data value assessment, reporting, improvement, value-driven decisions, and other stages.Early work on data value monitoring capability maturity models exists [106] but this must be expanded to all lifecycle stages.
Common data value representations, perhaps based on semantic models [27], are key to both application and domain interoperability.At the business level, they will facilitate the comparison and integration of reported data value between multiple units in the same organisation, eliminating silos and enabling management of organisation-wide data value chains [1].Beyond the organisation, this will provide better insights and collaboration opportunities to other stakeholders, e.g.governments, policy-makers, economists, non-governmental agencies, press.
While there are already attempts to link data value with the proliferation of machine learning [88] techniques, it is surprising how little is being done.We are witnessing several breakthroughs in machine learning, but only a tiny fraction of them are related to data value (e.g., assessment of data quality metrics [107]).On the other hand, a large proportion of machine learning outcomes relies on data (collected, synthetic, or augmented) with many data quality issues [108].Therefore, data value could play an important role in providing guidance and decision support mechanisms for machine learning.This collaboration would no doubt also produce more machine-learning based techniques for assessing data value.

VI. CONCLUSION
In this paper, we surveyed 63 existing works defining, characterising, modelling, and applying data value as a concept and for driving decision making (Table 2).We have identified that despite data value having conceptual origins back to at least 1980 in [31], the field is still immature with a lack of commonly agreed terminologies, models, or approaches to data value.
Our analysis found that there is a lack of generalised data value models and commonly understood dimensions to properly quantify the value of data, see Section IV-B.This contrasts strongly with the more mature but related domain of data quality.This leads to an absence of common validation platforms and tools, which limits comparison of work, reproducibility, rate of progress, and industrial deployment.Nonetheless this survey has collected the most comprehensive list of data value metrics to date (Table 8) where it can be seen that the Usage dimension is the most often measured dimension of data value that is not already a known dimension of data quality.
Despite the increase in number of works in the area in recent years, it is clear that there is still many important research questions be resolved and a set have been collected in Table 9. Addressing these challenges will help organisations better understand and exploit their data more effectively.More mature data value techniques will enable them to quantify the value of their data assets more accurately and efficiently.This will not only help mitigate any data-related risks and enhance any data governance efforts, but also enable data-based decision making, including data acquisition and investment decisions, data maintenance decisions, innovation decisions, and also business decisions (e.g. in merger and acquisition scenarios).

FIGURE 2 .
FIGURE 2. Number of publications per year.

TABLE 1 .
Inclusion and exclusion criteria.
104971Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 3 .
Data value dimensions included in data value models studied.

TABLE 4 .
Data value metrics for financial dimension.S:Subjective.O: Objective.

TABLE 5 .
Data value metrics for content-uniqueness dimension.S:Subjective.O: Objective.

TABLE 6 .
Data value metrics for usage dimension.S:Subjective.O: Objective.

TABLE 7 .
Data value metrics for utility dimension.S:Subjective.O: Objective.

TABLE 8 .
Thematic areas for data value application domains.

TABLE 9 .
Research areas for data value and potential research questions for future research.