Opportunities and Challenges in Data-Centric AI

Artificial intelligence (AI) systems are trained to solve complex problems and learn to perform specific tasks by using large volumes of data, such as prediction, classification, recognition, decision-making, etc. In the past three decades, AI research has focused mostly on the model-centric approach compared to the data-centric approach. In the model-centric approach, the focus is to improve the code or model architecture to enhance performance, whereas in data-centric AI, the focus is to improve the dataset to enhance performance. Data is food for AI. As a result, there has been a recent push in the AI community toward data-centric AI from model-centric AI. This paper provides a comprehensive and critical analysis of the current state of research in data-centric AI, presenting insights into the latest developments in this rapidly evolving field. By emphasizing the importance of data in AI, the paper identifies the key challenges and opportunities that must be addressed to improve the effectiveness of AI systems. Finally, this paper gives some recommendations for research opportunities in data-centric AI.


I. INTRODUCTION
Artificial intelligence (AI) models are used in almost all industries today, including automotive, agriculture, healthcare, financial, and semiconductor industries [1], [2], [3].Earlier, the core of AI was machine learning (ML) algorithms.But when AI spread across many sectors and grew in demand, ML algorithms became a commodity and no longer the core of AI.Instead, the training data would drive greater performance in ML models and systems [4].''Data is the new oil'' was the phrase first used by the data scientist Clive Humby in 2006 [5].People frequently believe that data is static, a significant factor in ML models becoming commonplace.The data literally means ''that which is supplied,'' which is true.ML entails downloading a readily available dataset and creating a model for most individuals.Afterward, efforts were directed to towards enhancing the The associate editor coordinating the review of this manuscript and approving it for publication was Mu-Yen Chen .model's performance.We refer to data in real-time business as the result of the processes.Compared to actual model construction, maintaining and strengthening the deployed models accounts for a larger share of the cost and efficacy of many real-time ML systems.The main challenge for these applications is to improve the data quality [6].
In practical AI applications, 80% of the activities involve data cleansing and preparation.Therefore, an essential duty for an ML engineer to consider is assuring the data quality [7].Data fluctuates significantly for higher-stakes AI.This data flow has a longer duration and various unfavorable effects.From data collection through model deployment, data movement can occur anywhere [8].Conventional methods entirely dependent on code are overshadowed by modern-day AI systems, which consider both code and data.Practitioners strive to work on the model/code to improve the AI system instead of focusing more on the data part of the system [4].However, improving the input data in actual applications is often preferable to experimenting with various algorithms.In the recent publications, 99% of papers have a modelcentric focus, while only 1% are data-centric. 1It can be easily comprehend from the Table 1 that the performance improvements have primarily been achieved through a datacentric approach.It appears that the model-centric approach did not result in any significant improvements over the baseline.Despite this, it is likely that practitioners invested a considerable amount of time, possibly hundreds of hours, in developing and refining these algorithms.The case studies in AI, Computer Vision, and other fields showed how enhancing the data will result in a better model [9].AI systems that are model-centric focus on how to alter the algorithm/code to improve performance.Data-centric AI's prime focus is to systematically amend the data to achieve a model with better performance.Model-centric and data-centric methods may be balanced well to provide a robust AI solution [10].The contribution of the work can be summarised as follows: • Connecting various data-centric AI techniques and related topics together at a high level with a focus on recent and significant developments.
• A comprehensive review of data-centric AI techniques.
• Describe the challenges and solutions of the data-centric AI approaches.
• Finally, we also give a few recommendations for research opportunities in data-centric AI The rest of this paper is organized as follows: Section II provides brief highlights of AI systems.An overview and limitations of model-centric AI are presented in Section III, while Section IV discusses the overview, mathematical model, advantages, and exploratory data analysis of datacentric AI.A summary of representative tasks and methods for training data development is provided in Section V.An overview of related work on data-centric AI is given in Section VI.The challenges, solutions, and research direction are outlined in Section VII.A discussion on the merits and limitations of the existing art, along with a few new perspectives, can be found in Section VIII.Finally, the concluding remarks of this paper are presented in Section IX.

II. AI SYSTEMS
Standard AI systems can be represented as: For an optimal operation, an AI system requires a balance of both model-centric and data-centric approaches for effective solutions.Model-centric focuses on developing algorithms, while data-centric focuses on the quality and relevance of the data used to train these algorithms.Both are important and need to be considered together for successful AI implementation.The data-centric approach involves addressing the challenges of the completeness, relevance, consistency, and heterogeneity of the data.It is an iterative process that must be integrated with the model-centric approach to achieve an AI system that effectively leverages the code and data.These models can range from data classification and clustering to recommendation systems and natural language processing.AI algorithms such as deep learning, reinforcement learning, and decision trees can be used to process large amounts of data and make predictions or decisions based on that data.The use of AI in data-centric models helps to uncover hidden patterns, relationships, and insights in data, enabling organizations to make data-driven decisions and achieve better outcomes.The integration of AI in data-centric models has the potential to revolutionize various industries, including healthcare, finance, retail, and many others [3], [11], [12].

III. MODEL-CENTRIC AI
Model-centric methods often follow a structured sequence during production that involves scoping the project, data collection and augmentation, data storage, data cleaning, data visualization, visual analytics, model construction and training, and finally, model evaluation [13].Fig. 1 shows the production workflow of the model-centric approach starting from data collection to model deployment.In these situations, the development and improvement of the model algorithms typically receive most of the attention, with data engineering being a one-time assignment.Modern approaches also tend to analyze biases and fairness, focusing on loss functions such as cross-entropy, errors in the mean square and mean absolute percentages, etc.In traditional ML production environments, a stronger emphasis is given to the code rather than the resultant trained model to improve its style, readability, and unambiguity.Since the models are to be trained with new data continuously, it necessitates a broader sense of the distinctions between a data scientist and ML engineer to corrode gradually, with all collaborators developing general expertise on skills and best practices for all the domains involved.Even though model training is essential, the development of automated machine learning (AutoML) systems has resulted in a progressive decline in the requisite amount of human intervention [14].The model-centric design, also known as the application-centric, depending on the use case, typically necessitates minimal work on the side of the data engineer due to the conventional one-time nature of data acquisition and pre-processing.However, data collection and model training were to become a continuous process where subsequent semi-supervised models learn from the same data they collect.In that instance, the workforce would be directed towards constant labelling  and augmentation of collected data, and a movement toward data-centric concepts would occur.

A. LIMITATIONS
Model-centric AI is predicated on ML approaches that prioritize improving model architectures (algorithm/code) and the underlying hyper-parameters.In this method, data is created once and is maintained throughout the development of the AI system.Model-centric AI has been successful over the past few decades, yet it has some flaws, as shown in Fig. 2. It particularly thrives in organizations and sectors when there are customer platforms with millions of users free to depend on generic fixes.In these conditions, most consumers would be satisfied by a single AI system; nonetheless, outliers would be practically useless.Examples of such organizations and sectors include the advertising sector, where firms like Google, Baidu, Amazon, and Facebook can access vast amounts of data (sometimes in a standardized format), which they may use to build model-centric AI systems.Standardized solutions like those offered by a single AI system cannot be used in sectors like manufacturing, agriculture, or healthcare, where customized solutions are preferred versus one-sizefits-all recipes.Instead, they should conceive their strategy to ensure that their algorithms learns what it needs to learn from having complete data that includes all crucial cases and is labelled consistently.The concept centers on the prevalence of extensive and dynamic datasets, particularly evident in the advertising sector.In such contexts, a 'one-size-fits-all' model consistently delivers satisfactory outcomes by mitigating the impact of outliers.However, this strategy proves impractical when dealing with small datasets.The flaw or limitation does not lie inherently within the model-centric approach itself but rather in recognizing that an effective AI application requires both a algorithm and a dataset.Consequently, any AI solution becomes ''narrowly applicable'' if the curation of the dataset does not align with the concurrent development of the model.This viewpoint, advocating the fusion of model-centric and data-centric approaches, is explicitly proposed in [15] as a crucial and optimal strategy to enhance AI systems.The suggestion is to refrain from a binary choice between deploying a model-centric or data-centric approach, emphasizing the necessity of synergy between the two for a more comprehensive and effective AI solution.
Due to the rapid pace of innovation in today's technologydriven world, AI models (algorithms) that are modelled may quickly become outdated and require retraining on new data, primarily because significant trends' unforeseen or unexpected behaviour may outpace the model's usefulness and relevance.Such model-centric AI's capabilities are severely constrained in fields where the model itself is created to support ongoing scientific study or as a component of a more complex model.As a result, training data is produced using a smaller model based on the restricted findings and hypotheses.Typically, data collection is not practicable in these situations, or work involves limited data from expensive datasets.In such cases, the shortcomings of model-centric technologies become apparent since training a model-centered algorithm may result in biased findings, mainly if the initial model used to generate the data was built on incomplete knowledge or under a lack of domain expertise.With general trends in deep learning requiring a focus on large amounts of data, there is a general tendency to prefer reusing existing models to fit into our use cases with specific training data.Such a shift is perceived as a step towards the data-centric movement, where the model is fixed, with the only variable inputs being the training data and classifiers.

IV. DATA-CENTRIC AI
Data-centric AI is a data-engineering strategy that improves an AI system's performance by methodically boosting the data quality used to train the underlying model under mutually exclusive yet collectively exhaustive methodologies and categories.Data-centric AI can improve the performance of AI models and services through augmentation, extrapolation, and interpolation.Data-centric AI can assist in making AI services more accurate and dependable by expanding the data that is accessible to them and enabling them to use it more efficiently [16].With the help of training data from many sources, including synthetic data, public datasets, and private datasets, data-centric AI is created utilizing this innovative methodology.This strategy can lessen the time and effort needed to generate training data while also helping to increase the data quality.It can also help increase the effectiveness with which AI services use training data.Additionally, datacentric AI can process additional data sets because the data is personalized.This means that regardless of the magnitude of the dataset, data-centric AI can analyze and learn from it and make decent predictions.

A. MATHEMATICAL MODEL
The mathematical model of data-centric AI is a model that allows the system to learn from data and make predictions.This model can take various forms, such as decision trees, linear regression, neural networks, or deep learning models.The choice of model depends on the problem domain and the type of data available.The model is trained on a labeled dataset, where the input data is associated with the corresponding output or label.The training process involves minimizing a loss function that measures the difference between the predicted output and the actual label.The goal is to find the model parameters that minimize the loss function and generalize well to unseen data.Once the model is trained, it can be used to make predictions on new input data.The prediction process involves feeding the input data into the model, which outputs a predicted label or value.The prediction accuracy depends on the quality of the model and the amount and quality of the input data.
Given a dataset D = (x 1 , y 1 ), (x 2 , y 2 ), . . ., (x n , y n ) consisting of n samples, where x i ϵX denotes the feature vector and y i ϵY denotes the corresponding label, the goal of data-centric AI is to learn a function f : X ϵY that maps each input feature vector x to its corresponding output label y.
Define a model function f (X ; ) that takes the input data X and a set of learnable parameters and outputs a prediction.This function f is learned by optimizing a loss function L(Y , f (X ; )), which measures the discrepancy between the predicted outputs f (x i ) and the true labels y i for each sample in the dataset.
Mathematically, finding the set of parameters that minimizes the loss function can be expressed as: This optimization problem is typically solved using techniques from optimization theory, such as gradient descent or stochastic gradient descent.Where * is the optimal set of parameters that minimize the loss function.
The optimization problem in the equation is typically solved using an iterative algorithm such as gradient descent.At each iteration t, the parameters t are updated using the gradient of the loss function with respect to : where α is the learning rate that determines the step size of the update, and L(Y , f (X ; t )) is the gradient of the loss function with respect to at iteration t.The iterative process continues until convergence, i.e., until the change in the loss function between successive iterations falls below a certain threshold.Once the optimal set of parameters * is found, the model function f (X ; * ) can be used to make predictions on new input data.
Justifications: The model is based on the premise that the performance of an AI system is dependent on the quality of the data on which it is trained.This is reflected in the model through the use of a data-centric loss function, which prioritizes the reduction of errors in the training data over the complexity of the model.The mathematical justification for this approach can be found in the field of statistical learning theory, which provides a framework for understanding the relationship between the complexity of a model and its ability to generalize to new data.The theory suggests that simpler models are more likely to generalize well, as they are less likely to overfit the training data.The data-centric model also incorporates techniques such as regularization and early stopping to prevent overfitting and improve the generalization performance of the model.These techniques have been extensively studied and validated in the literature, providing further mathematical justification for their use in the model.Furthermore, the model emphasizes the importance of data pre-processing and feature engineering, which can have a significant impact on the performance of an AI system.This is supported by research in the field of ML, which has shown that carefully selecting and preprocessing features can improve the accuracy of a model.

B. ADVANTAGES
Data-centric AI is not confined to a particular data type, it can learn from text, images, audio, and video.A data-centric AI approach typically involves utilizing the proper labels, addressing any issues, removing noisy data inconsistencies, enhancing data through data augmentation and feature engineering, analyzing errors with the help of domain experts to determine the accuracy or error in data points, and so on [17], [18], [19].Data-centric AI constantly evaluates the created AI model in conjunction with updated data.Typically, an AI model is trained on a dataset only once during the production stage, before the software development process can be finalized with the deployment of the model and its necessary features.However, the underlying model may encounter edge-case instances of data points that differ significantly from those encountered during the training phase.This is anticipated, as the workflow of data-centric AI involves continuous improvement of data, particularly in industries that cannot afford to have a large amount of data points, such as manufacturing, agriculture, and healthcare [20].As a result, evaluating the model's quality would also happen more regularly than only once.A model would be able to recognize, judge, and then answer appropriately to variational data distributions owing to the production systems' capacity to offer rapid feedback.In fact, this ability gives data-centric approaches a competitive advantage over their model-centric counterparts.The advantages of data-centric systems are highlighted in Fig. 3.
The ever-arising challenges and problems in today's world require continuous optimization and tuning of the model, along with simultaneous collection, processing, augmentation, and labelling of high-volume data.In cases of filter-list-based blocking/ moderation of social media content, restriction of cyber-threats, fraud detection, and spam tracking, a shift from a data-driven or applicationcentric to a data-centric perspective focusing on labelling and cleaning of data is apparent, with the information being collected in high frequency, at an hourly basis or even faster.The boundaries between business and technology are vanishing, with tools and techniques like ML and DL requiring assistance from domain experts and consultants to modify inputs or generate better algorithms.DL-based approaches have become popular owing to the enormous scope and capability for collecting, storing, and processing Big Data, mainly due to their excellent performance with big data and technological advances [21].
In deep-learning-based applications, the distinguishing hierarchy and structure of features or parameters are learned from the data.At the same time, they are usually coded by a human domain expert in typical machine learning applications.Then, through algorithms such as gradient descent and backpropagation, the deep learning algorithm learns and fits itself for accuracy.This methodology allows for mimicking the human brain at a primitive level, allowing DL models to make predictions more precisely through a combination of weights, inputs, and biases.Since a massive volume of data is processed through multiple layers of neural networks, the aspect of clean, labelled information becomes vital, as the presence of dirty, repeated, inconsistent data can cause unnatural biases, failure in edge cases, erroneous predictions, the poor performance of the model, wasteful computations, etc.Most real-world datasets are noisy, unstructured, and unorganized, with several biases, outliers, missing values, repeated values, etc.

C. SIGNIFICANCE
Model-centric AI assumes ML solutions that mainly focus on optimizing model architectures (algorithm/code) along with the underlying hyperparameters.Data within this approach is created only once and kept the same over the AI system's development life cycle.Model-centric AI fails in many real-world scenarios where the data continuously improves, like in drug discovery, the drug-related data keeps generating as the new disease evolves.In such a situation, the model built using the model-centric approach fails as the data points it encounters are entirely different from those encountered during the training phase.The ignorance of updated data by model-centric AI also prohibits it from being generalized across datasets and also makes it susceptible to adversarial samples [13], [15].In contrast, Data-centric AI considers ML solutions that emphasize continuous improvement of data and evaluation of the developed AI model in conjunction with data updates throughout the lifecycle of an AI project [22].As a result, it can adapt to ever-changing real-world situations.

D. EXECUTION AND ANALYSIS
Data preparation is a notable example of the tedious step of the ML lifecycle.Since data quality directly affects a model's quality, it is also one of the most crucial processes.This section will discuss the significance of exploratory data analysis (EDA), data visualization, and other tools for preparing data for ML pipelines and identifying data quality problems [23].Data scientists examine and glean essential insights from the data using EDA techniques.Additionally, efficient EDA dramatically benefits from the talents and subject expertise of data scientists in this area.To encourage more statisticians, particularly academics, to research a wide range of fascinating difficulties, we present the traditional yet current subject of data quality from a statistical perspective.
The data quality landscape is discussed along with the research underpinnings in computer science, overall quality management, and statistics.
The use of two case studies based on an EDA approach to data quality motivates a collection of research questions for statistics that cover theory, methodology, and software tools.Data visualization is a crucial EDA approach that uses visual elements like charts and graphs to make analysis simple and efficient [24].When it comes to data quality profiling, visual EDA is very pertinent.With visual features like charts and graphs, data visualization is a crucial EDA technique that simplifies and streamlines analysis.Visual EDA is especially pertinent in the context of data quality profiling.To investigate and summarise multiple data sets, data scientists use exploratory data analysis (EDA), which typically employs data visualization tools.Figuring out how to alter data sources to obtain the required answers, makes it easier for data scientists to detect trends, spot anomalies, test hypotheses, or validate assumptions.EDA aids in comprehending the variables used in data collecting and how they relate.Typically, it is used to look into what information the data might reveal outside of the formal modelling or hypothesis testing assignment.It can also assist you in determining the suitability of the statistical methods you're considering using for data analysis.John Tukey, an American mathematician, developed EDA methods, which are still extensively used in the data discovery process.EDA's main objective is to help with data analysis before making any assumptions.It can help with identifying obvious errors, better-comprehending data patterns, identifying outliers or unexpected events, and identifying fascinating relationships between the variables.Data scientists can make sure their findings are accurate and pertinent to any targeted business objectives by using exploratory analysis.EDA assists stakeholders by making sure they are asking the right questions.EDA can help in answering inquiries about standard deviations, categorical notations, and confidence intervals.In [25], authors present unifying principles offered by the categorical notion of data and discuss the importance of these principles in data-centric AI transition.

V. DATA DEVELOPMENT
Approximately 45% of a data scientist's time is spent on data preparation tasks including importing and cleaning data, according to an Anaconda survey of data scientists shown in Figure 4. 2 The quality and quantity of training data are pivotal for machine learning model performance.A summary of representative tasks and methods for training data development are given in Table 2.

A. DATA COLLECTION
Data collection is a crucial step in AI as it forms the foundation for training machine learning models.It involves three main approaches: data acquisition, data labeling, and improving existing data and models.Data acquisition involves finding or generating new datasets, data labeling involves adding annotations to enable learning, and improving existing data and models involves modifying them to increase accuracy and usefulness.The quality and relevance of collected data directly impact the accuracy and usefulness of AI models.

1) DATA ACQUISITION
When there is a lack of data, data acquisition can be performed to find suitable datasets for training machine learning models.There are three approaches for data acquisition: data discovery, data augmentation, and data generation.Data discovery involves indexing and searching for datasets, data augmentation involves creating synthetic examples by distorting or combining labeled examples, and data generation involves creating datasets through crowdsourcing or synthetic data generation techniques.
Data discovery refers to the process of identifying and locating relevant data sources to support decision-making and analysis.The goal is to make it easier for organizations to find and use the data they need, by indexing large datasets stored in data lakes [26] or searching for data on the web [27].This can help organizations to improve the efficiency and effectiveness of their data-driven activities.The Goods system [67] and Google Dataset Search [28] are examples of data discovery tools that support searching for datasets.These tools have become more interactive, with systems like Juneau [68] providing interactive data search and management.Juneau uses similarity measures, provenance information, and schema information to find related tables, which is a critical task inefficiently joining or unifying tables in data lakes.LSH-based algorithms that perform set overlap search or unionable attribute retrieval have been proposed to solve this challenge [69].
Data augmentation is a popular technique in machine learning to generate new training data to increase the diversity of data used for training.Generative Adversarial Networks (GANs) [70], [71] are commonly used for this purpose.They consist of a generator and a discriminator, which are trained in an adversarial fashion to generate fake data that is similar to the real data.However, GANs have limitations and other techniques have been developed to complement them, such as policies [72] and AutoAugment [30], where transformations are applied to data by a controller.Another popular technique is Mixup [31], [73], [74], [75], [76], which mixes pairs of data points of different classes to create additional data that regularizes the model.Model patching [77] is another method that uses GANs to augment the data for specific subgroups of a class to improve the model's accuracy.
Another way to acquire new data is by generating it through crowdsourcing platforms such as Amazon Mechanical Turk [78], where people are paid to create or find data for specific tasks.Data can also be generated through simulators or generators specific to a certain domain, such as Hermoupolis [32] for mobility data or Crash to Not Crash [79] for driving data.Domain randomization [33], [80] is a technique that can be used to generate a variety of realistic data by varying the parameters of a simulator.GANs can also be used to generate new data, but they require a sufficient amount of real data for training.

2) DATA LABELING
Data labeling is the process of annotating or categorizing data to provide it with a label or class.This is necessary for training supervised machine learning models, where the model needs to learn to make predictions based on labeled data.Data labeling can be a time-consuming and labor-intensive process, but it is crucial for ensuring the quality and accuracy of the trained models.There are several approaches to data labeling, including manual labeling by human annotators, automated labeling using heuristics or pretrained models, and crowdsourced labeling, where a large number of people can label the data through a platform.The choice of labeling method depends on the type and amount of data, the desired level of accuracy, and the budget and time constraints.
Data labeling is a crucial step in training machine learning models, where the goal is to assign labels to examples in the dataset.The labeling process can be performed using various methods, including: • Semi-supervised learning [34], where existing labels are used to predict the other labels.
• Crowdsourcing platforms like Amazon Mechanical Turk, where labelers are recruited to label the data.
• Domain experts, who provide labels based on their expertise but can be expensive.
• Active learning [81], which reduces the crowdsourcing cost by asking labelers to label uncertain examples that will improve model accuracy the most • Weak supervision, where labels are generated semiautomatically, for example, by using external knowledge bases or data programming techniques like Snorkel [37], [38] or Snuba [39].

3) IMPROVING EXISTING DATA
Improving existing data and labels is a useful approach in scenarios where there are no relevant datasets available externally and collecting more data no longer increases the model's precision.Re-labeling [40] is an effective approach to improve the quality of the labels and Can be accomplished through the collection of majority opinions on multiple labels per instance.Data validation, cleaning, and integration can also help improve the quality of existing data.

B. DATA CLEANING AND VALIDATION
Data cleaning is an important step in preparing data for machine learning.However, simply fixing well-defined errors in the data may not necessarily lead to an improvement in the accuracy of the machine learning models [49], [82].
Improving the accuracy of the model and making the training process resistant to noise in the data is more effectively achieved through directly cleaning the data [83], [84].
A training process that is resistant to noise is deemed to be more effective than cleaning the data prior to model training [82].Data with malicious intent, known as adversarial data noise, can also be present, and addressing it through cleaning is referred to as data sanitization.Incorporating AI ethics such as model fairness [85] is also an important consideration in data preparation as biased data can lead to discriminatory models.Platforms like TensorFlow Extended (TFX) [86], which specializes in machine learning, possess data validation [87] components that allow for the early detection of data errors.

1) DATA VALIDATION
Data visualization is an effective way of validating data for machine learning [87].Visualization tools like Facets [88] allow for quick checks on data to prevent errors downstream.
Research has also been conducted on the automatic creation of visualizations, such as the SeeDB [41] framework that uses a deviation-based metric to determine interestingness.However, automatic generation can lead to false positive results, which is why there is also a body of research that concentrates on techniques for controlling false discoveries.Schema-based validation [86], [87] like TensorFlow Data Validation (TFDV) [43], [44] is widely used in practice.It creates a data schema from prior data sets, utilizing it to validate future datasets and informing users of any anomalies that are detected.Data validation systems are becoming increasingly equipped with additional capabilities like declarative data quality constraints, pipeline inspection, automatic error identification, and ease of usage.Deequ [45], [46] is a library that allows you to define data quality constraints in a declarative way, which are then transformed into unit tests.The mlinspect [47] library provides a way to inspect machine learning pipelines in a declarative manner.The most recent advancements in data validation systems encompass the automatic identification of error types, evaluation of the effects of errors on models, userfriendliness, and efficient validation with human involvement [89], [90], [91], [92].
2) DATA CLEANING Data cleaning involves eliminating errors in the data by meeting various integrity restrictions, including key constraints, domain constraints, referential integrity constraints, and functional dependencies.Data cleaning techniques have become sophisticated, with state-of-the-art techniques like HoloClean [48] that repair data using probabilistic inference, satisfying integrity constraints, checking value validity using external dictionaries, and using quantitative statistics.
Exclusively focusing on data correction does not ensure optimal model accuracy.Clean data is not always a clear-cut concept and removing all possible errors is not always attainable.The CleanML [49] framework assesses different data cleaning methods to determine if they enhance model accuracy and reveals that data cleaning does not necessarily enhance machine learning models and could even have a detrimental impact.However, selecting the correct machine learning model can counterbalance the negative effects of data cleaning.There is no one-size-fits-all cleaning algorithm that is effective for all types of noise, and the selection of the algorithm depends on the type of noise.Care must be taken when utilizing data cleaning techniques that are not specifically designed for machine learning as they frequently have high-impact parameters that require proper tuning, similar to the tuning of machine learning hyperparameters.
Advanced data cleaning techniques use probabilistic inference, external dictionaries, and statistical methods to repair data.However, it is not always clear if data cleaning improves model accuracy, as sometimes cleaning data may have a negative effect.There are data cleaning techniques designed specifically to improve model accuracy, such as ActiveClean [50], which iteratively cleans samples and updates the model, and TARS [51], which cleans labels to improve model accuracy.Recently, there has been a rise in systematic approaches to data cleaning for machine learning, including CPClean [93], which analyzes the impact of missing data on predictions, and a study that proposes solutions to tackle data quality issues in MLOps [94].The most common challenges in data cleaning for machine learning include handling multimodal data and data that changes over time.

3) DATA SANITIZATION
Data poisoning is a serious issue where a small fraction of training data is altered with malicious intent to cause a machine learning model to fail.Data poisoning is becoming more sophisticated and hard to defend against [40], [95].One approach to defending against data poisoning is data sanitization, which involves detecting and discarding poisonings using outlier detection.However, current data sanitization techniques have proven to be ineffective against carefully designed attacks [54], [96].The field of data poisoning and sanitization is constantly evolving and requires further research.

4) MULTIMODAL DATA INTEGRATION
Multimodal data integration refers to the process of combining data from multiple sources with different modalities (e.g.video streams, radar data, and time series) [97].The integration of this data is important in machine learning, particularly in applications such as autonomous vehicles.Two primary techniques utilized in machine learning are alignment and co-learning.Alignment entails discovering connections between sub-components of instances that possess multiple modalities, while co-learning involves training a model based on one modality while leveraging information from another.Data integration is a vast field of research that has been explored for numerous years, however, not all techniques are applicable to machine learning [98].

VI. RELATED WORK
In recent times, there has been a remarkable shift in the importance of Data-Centric AI compared to Model-Centric AI.Industries are undergoing significant changes in their core architecture, becoming more focused on data.Many companies are adopting data-centric approaches to tackle their daily tasks.To address the challenges encountered in current workflow methodologies, where data engineering and model engineering are treated as separate cycles, a new division has emerged, introducing a range of tools such as CleanLab, 7 LandingLens, 8 Snorkel, 9 AutoAlbument, 10HoloClean, 11 Albumentations, 12 and more.These tools support various production processes by identifying and resolving issues step by step.This progress has been made possible through structured thinking and dividing tasks into distinct yet comprehensive components.Several benchmarks like dcbench [99], ImageNet [100], and MLPerf [101] have been developed to evaluate and validate the effectiveness and quality of these tools based on parameters such as training data size, budget-restricted data cleaning, object detection, computer vision, and others.In a study by Chen et al. [102], the authors introduce the data-centric AI approach and discuss the CLIP Model for multimedia misogyny detection.Additionally, Motamedi et al. [103] propose a data-centric AI approach for enhancing data quality and suggest a solution based on Generative Adversarial Networks (GANs) to synthesize new data points.Their findings demonstrate that the optimized dataset generated by the pipeline improves accuracy while being significantly smaller than the baseline.
RoboFlow [104], a data-centric cloud-based workflow management system, is revolutionizing the creation of AI-enhanced robots.By breaking down the development process into four building blocks and prioritizing data over conventional process-centric techniques, RoboFlow enables the creation of data-driven AI-enhanced robots.In the field of materials science [105], machine learning is facilitating the prediction of structural features and the discovery of new materials [106].By utilizing data-centric machine learning methodologies, accurate predictions can be made with minimal theoretical data, improving computational efficiency.In the renewable energy sector, while most research focuses on model-centric predictive maintenance techniques, a shift towards data-centric approaches is gaining attention [107].By adopting data-centric machine learning, the accuracy of climate change prediction and wind turbine understanding can be enhanced, leading to increased utilization of green energy [108], [109].
In a data-centric approach, Zeiser et al. [110] propose a method combining WGAN and encoder CNN for advanced process monitoring and prediction in real-world industrial applications.They highlight the importance of context-based data preparation and demonstrate that improving data quality has a significant impact on prediction accuracy.Safikhani et al. [111] argue that transformer-based language models can enhance automated occupation coding.By fine-tuning BERT and GPT3 on pre-labeled data and incorporating job titles and task descriptions, they achieve a 15.72 percentage point performance increase compared to existing methods.They also introduce a hierarchical classification system based on the KldB standard for encoding occupation details.Wang et al. [112] studied, a data-centric analysis of on-tree fruit detection using deep learning is conducted.They explore the challenges of fruit detection in precision agriculture and evaluate the impact of various data attributes on detection accuracy.They also investigate the relationship between dataset-derived similarity scores and expected accuracy, as well as the sufficiency of annotations for single-class fruit detection using different horticultural datasets.
Chen and Chou [102], a system for detecting misogynous MEME images is presented using coherent visual and language features from the CLIP model and the data-centric AI principle.The system achieved a competitive ranking in the SemEval-2022 MAMI challenge, demonstrating the effectiveness of leveraging Transformer models and focusing on data rather than extensive model tweaking.The proposed approach utilizes logistic regression for binary predictions and acknowledges the importance of data quality and quantity.Although limitations are not explicitly mentioned, the method's performance may be influenced by the available data and reliance on pre-trained models.Zhong et al. [114] proposed a data-centric approach to enhance the robustness of deep neural network (DNN) models against malicious perturbations.By focusing on dataset enhancement, their algorithm improves model performance against adversarial examples and common corruptions.The effectiveness of the approach is demonstrated through high rankings in a data-centric robust learning competition, highlighting the algorithm's success across various DNN models.

VII. CHALLENGES AND OPPORTUNITIES
Though data-centric AI shines as a promising advancement, its current limitations call for a more nuanced approach.We encourage synergistically leveraging both methods: utilizing data-centric practices for robust data engineering while refining models through conventional optimization techniques.This, as elegantly proposed in [15], treats data and model as intertwined gears driving exceptional results.This embrace of synergy acknowledges the current landscape and honors ethical research by recognizing the contributions of prior work.Fig. 5 illustrates the main challenges of the data-centric AI approach.Data-centric production faces challenges such as maintaining consistency, adequate volume retention after cleaning, quality of maintenance, the necessity of a proper data versioning system, etc. Amongst the various issues, one of the most significant is the loss of volume associated with cleaning and validating data.An AI model can only receive a massive amount of high-quality data if lowquality datasets are removed.

A. CHALLENGES AND SOLUTIONS
The main challenges of data-centric artificial intelligence (AI) are given below: 1) Data quality and quantity: One of the biggest challenges in data-centric AI is ensuring that the data used to train and test AI models is of high quality and sufficient quantity.Data may be incomplete, noisy, or biased, which can lead to poor performance or inaccurate predictions from AI models.Data preprocessing is a crucial step in the AI pipeline, but it can be time-consuming and computationally expensive.This includes cleaning, normalizing, and transforming data, as well as handling missing or corrupted data.Data labeling is also a time-consuming and costly task.Additionally, obtaining large amounts of high-quality data can be difficult and expensive, particularly in certain domains such as healthcare or finance.2) Handling uncertainty: AI systems often need to make decisions in uncertain and dynamic environments, where data is incomplete, noisy, or ambiguous.This requires developing methods for dealing with uncertainty, such as probabilistic and Bayesian methods.3) Real-time processing: Many data-centric AI systems need to analyze and make decisions based on real-time data streams, such as self-driving cars, financial trading, sensor data or social media feeds.This requires the ability to process data quickly and make predictions in near-real time, which can be challenging.There is a need for more efficient and effective algorithms for real-time data processing and analysis.4) Integration of multiple data sources: The ability to integrate and analyze multiple data sources, such as sensor data, social media data, and other types of unstructured data is a major challenge.This is because data from different sources may have different formats, structures, and levels of quality, which can make it difficult to combine and analyze them effectively.5) Generalization: AI models often struggle to generalize their predictions to new or unseen data, particularly when the data is significantly different from the training data [15].This is a major concern in realworld applications, where the data is often noisy and unpredictable.6) Data Privacy and security: Data may contain sensitive or personal information that must be protected from unauthorized access or misuse.As AI systems are increasingly used to analyze sensitive data or personal information, such as medical records or financial data, there is a growing need to protect this data from unauthorized access or misuse.This includes developing methods for data anonymization and differential privacy are being developed to address these issues.Additionally, AI systems themselves may be vulnerable to attacks such as adversarial examples or model stealing.7) Explainability and interpretability: AI systems become more complex and autonomous, particularly those based on deep learning, it can be difficult for humans to understand how they are making decisions or accurate predictions.This makes it difficult to trust and validate the results and can limit the use of AI in sensitive applications such as healthcare, finance, or criminal justice.Researchers are working on developing methods for making AI systems more transparent and interpretable.8) Scalability and robustness: As the amount of data being generated and collected continues to grow, it becomes increasingly difficult to process and analyze all of this data in a timely and efficient manner.However, many existing AI systems are not well-suited for these types of major challenges, and research is needed to improve the scalability and robustness of AI systems, particularly in resource-constrained environments.9) Multi-modal and multi-sourced data: As the amount and diversity of data increases, it becomes more challenging to integrate and make sense of data from multiple sources, such as sensor data, social media data, and other types of unstructured data.10) Causality and counterfactual reasoning: AI systems are often used to make predictions or decisions, but it can be difficult to understand the causal relationships underlying those predictions or decisions.Additionally, it's challenging for AI systems to reason about counterfactual scenarios, where the outcome of an event is different from what actually happened.11) Human-AI collaboration: As AI systems become more integrated into various domains, there is a need for more effective methods of collaboration between humans and AI systems.There are many challenges associated with human-AI interaction, such as trust, transparency, and explainability.This includes finding ways to make AI systems more transparent and explainable to humans, as well as developing methods for humans to provide feedback and guidance to AI systems.12) Bias and fairness: AI models can inadvertently introduce bias into their predictions, particularly if the data used to train them is not representative of the population.This can result in unfair or discriminatory decisions.This is a significant concern in applications such as healthcare, finance, and law enforcement.
Researchers are working on developing methods for detecting and mitigating bias in AI systems.13) Integration with other technologies: AI systems often need to be integrated with other technologies, such as sensors, IoT, and robotics.This can be challenging as each technology has its specificities and constraints.14) Ethics and regulation: As AI systems become more sophisticated and prevalent, there are important ethical and regulatory challenges to consider, such as the impact of AI on jobs and society, the accountability of AI systems, and the potential misuse of AI.
Therefore, a data-centric method frequently needs a more significant data volume than a model-centric one.This brings potential issues where cleaning might decrease the insufficient amount of collected data under technological or monetary constraints; for example, in scientific research, the model might be continuously tuned to generate and work on experimental data.Working in a constantly evolving setting can be challenging, primarily due to the changing algorithms and margins for errors.In many cases, rulesbased algorithms result in a high rejection rate because the technology needs to differentiate between authentic defective components and acceptable levels of variation, especially in manufacturing and production-based industries.This forces a large percentage of human follow-up inspection, which raises costs and slows down production lines.
Robustness is the model's ability to preserve performance with a small amount of noise in the data, such that the training error rates are consistent with testing error rates.Fairness refers to the model's tendency to defend against unnatural biases such that the model does not inadvertently introduce any biases.In the case of data quality issues, a focus is given to robust training algorithms when validation of noisy data is insufficient and/or to fairness when appropriate pre-processing levels are insufficient to remove biases from data.An AI system should be trained on the same data type on which it is to be tested and analyzed, including any edge-cases.Additionally, as part of quality control methods, properties of data records that are not causative features should be randomized during training.An AI model quickly loses accuracy without consistent data annotation.Unfortunately, it can be challenging to maintain a high level of consistency.However, this brings forth a significant challenge to data-centric AI, as it requires human annotation that is costlier than machine computation, especially in environments that emphasize maximum automation.

B. OPPORTUNITIES
Several research opportunities can advance the Data-Centric AI field going forward.The main challenge for applying a data-centric approach in embedded ecosystem applications is the complexity.It is challenging to deploy data-centric AI on portable devices.In this context, one promising research direction is the Edge Impulse-based data-centric approach for a wide range of hardware targets [126].Edge Impulse is a cloud-based ML operations (MLOps) platform like Microsoft Azureor [127] or Google VertexAI [128] for developing embedded and Tiny Machine Learning (TinyML) [129] systems.
The researchers may conduct the Vivo experiments on data-centric Green AI [109].They analyze how combining data-centric with Green AI techniques in real-world AI pipelines.These techniques may impact the energy efficiency and accuracy of AI models.
The research opportunities in the field of data-centric artificial intelligence are given below: • Developing methods for effectively and efficiently processing and analyzing large amounts of data, such as big data and real-time streaming data.
• Developing more efficient and accurate algorithms for data pre-processing, cleaning, and feature extraction.
• Improving the performance of machine learning models by developing new architectures and training techniques.
• Developing new methods for dealing with large and complex datasets, such as deep learning and reinforcement learning.
• Developing new approaches for dealing with unstructured and semi-structured data, such as natural language processing and image processing.
• Developing new methods for interpretability and explainability of AI models, • Developing methods for integrating AI with other technologies, such as the Internet of Things (IoT) and edge computing.
• Developing new ways to use AI to analyze and make sense of data from multiple sources, such as sensor data, social media data, and financial data.
• Developing new approaches for dealing with privacy and security issues related to data-centric AI.
• Developing new ways to use AI to improve decisionmaking in various domains, such as healthcare, finance, and transportation.
• Researching on the ethical and societal impact of AI.
• Developing more efficient and effective algorithms for data processing and analysis, such as machine learning and deep learning techniques.
• Improving the scalability and robustness of AI systems to handle large and complex data sets.
• Developing new methods for integrating and analyzing multiple data sources, such as sensor data, social media data, and other types of unstructured data.
• Improving the performance and accuracy of machine learning and deep learning models through the use of advanced techniques such as transfer learning and ensemble learning.
• Developing new and innovative applications of AI in areas such as healthcare, finance, and transportation, to name a few.
• Exploring the use of AI for decision-making and problem-solving in complex and dynamic environments, such as in the areas of autonomous systems and smart cities. • Understanding and addressing ethical and societal implications of AI, including issues related to bias, transparency, and accountability.
• Developing new methods for explainable AI (XAI) which can help to increase the transparency and interpretability of AI systems.
• Designing AI systems that can learn and adapt to changing environments and data distributions.
• Researching ways to make AI more robust and secure, such as against adversarial attacks.

VIII. DISCUSSION
Initially, the model-centric AI approach appears more reasonable for modifying the model (algorithm/code) to enhance the system's efficiency, as opposed to altering the data.However, when faced with the constraints of model-centric AI solutions, the use of data-centric approaches becomes pertinent, as discussed previously.Model-centric classification algorithms employ regularization techniques to adjust the algorithm or code in order to enhance model performance.Nevertheless, for real-time applications to function effectively, reliable data is essential, and this consideration should extend to the underlying model to the same degree.Therefore, it is important to view model-centric and datacentric approaches as complementary and not to dismiss the model-centric approach entirely due to its limitations in some specific organizations and industries.First, in the initial stages of problem-solving, it is crucial to emphasize not only understanding the properties and information of things but also how we interact with them.In the context of AI development, this implies a greater focus on designing and refining intelligent algorithms, as well as fine-tuning the hyperparameters of the underlying model and the code that implements the algorithms.Defining the dataset at this project stage is of utmost importance, as it serves as the foundation for comparing and categorizing the performance of models.
Second, commencing the AI system design with a model-centric approach provides ample opportunities to gain hands-on experience in understanding real-world problems and exploring potential computational solutions, rather than the experience that might have been acquired through adopting a data-centric approach in research and industry.
When attempting to solve a problem, it is often in our nature to start by taking action to make an impact before delving further into understanding the properties of items in our environment.Reinforcement learning, a recently wellpublicized machine learning (ML) technique, is based on the idea that intelligent agents learn through interacting with their environments.Third, it would be challenging to relate the final models to real-world problems because they would have undergone a completely different development methodology.In the early phases of AI development, both academia and industry may have preferred the data-centric approach over the model-centric method.
A general architecture of model-centric AI, data-centric AI, and model data-centric AI is shown in Fig. 6.Model-datacentric AI combines the benefits of both model-centric AI and data-centric AI, seeking to balance the refinement of models and the improvement of training data, recognizing that both aspects are crucial for achieving optimal results.The choice of approach should take into consideration the specific needs, available resources, and objectives of the AI project at hand.

IX. CONCLUSION
This paper presented a high-level review of the data-centric AI paradigm by connecting all relevant topics together to help academicians, researchers, and engineers manage the AI systems.Data-centric AI is capable of constituting a complementary paradigm to model-centric AI to build successful AI-based systems for real-world applications.Data-centric AI emphasizes the importance of data quality in the performance of AI systems, recognizing data as dynamic and constantly evolving, rather than just a static input for preprocessing.Improving data quality is a continuous process throughout the system's life cycle.This article presents DCAI as a developing movement and proposes four principles to guide its efforts.The future of data-centric AI lies in establishing systematic processes for monitoring and enhancing data quality, which requires more focus on data supply, preparation, annotation, and integration into AI systems.

FIGURE 4 .
FIGURE 4. Time invested by data scientists in various phases.

TABLE 2 .
Representative tasks and methods for training data development for data-centric AI applications.

TABLE 3 .
A brief summary of existing literature in data-centric AI.