Abstract:
In a typical Data Science project, the analyst uses many programming languages to explore and analyze big data coming from diverse data sources. A major challenge is mana...Show MoreMetadata
Abstract:
In a typical Data Science project, the analyst uses many programming languages to explore and analyze big data coming from diverse data sources. A major challenge is managing and pre-processing so much data, with potentially inconsistent content, significant redundancy, in diverse formats, with varying data quality. Database systems research has tackled such problems for a long time, but mostly on relational databases. With such motivation in mind, this paper compares strengths and weaknesses of popular languages used nowadays from a database pespective: Python, R and SQL. We discuss the entire analytic pipeline, going from data integration, cleaning and pre-processing to model application and tuning. From a database systems perspective, we present a comprehensive survey of storage mechanisms, data processing algorithms, external algorithms, run-time memory management, consistency, optimizations and parallel processing. From a programming languages angle, we consider elegance, expressiveness, abstraction, composability, interactive behavior and automatic code optimization. We present a short experimental evaluation comparing the performance of the three languages on typical data exploration and pre-processing tasks. Our conclusion: there is no winner.
Date of Conference: 15-18 December 2021
Date Added to IEEE Xplore: 13 January 2022
ISBN Information: