Loading [a11y]/accessibility-menu.js

A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective | IEEE Journals & Magazine | IEEE Xplore

- Donate
- Cart
- Create Account
- Personal Sign In

ADVANCED SEARCH

Journals & Magazines >IEEE Transactions on Knowledg... >Volume: 33 Issue: 4

A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently...Show More

Metadata

Abstract:

Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data. In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models. We provide a research landscape of these operations, provide guidelines on which technique to use when, and identify interesting research challenges. The integration of machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research.

Published in: IEEE Transactions on Knowledge and Data Engineering ( Volume: 33, Issue: 4, 01 April 2021)

Page(s): 1328 - 1347

Date of Publication: 08 October 2019

ISSN Information:

DOI: 10.1109/TKDE.2019.2946162

Funding Agency:

References is not available for this document.

Select All

1.

“Deep learning for detection of diabetic eye disease,” [Online]. Available: https://research.googleblog.com/2016/11/deep-learning-for-detection-of-diabetic.html. Accessed: Oct. 17, 2019.

2.

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA, USA : The MIT Press, 2016.

3.

S. H. Bach, B. D. He, A. Ratner, and C. Ré, “Learning the structure of generative models without labeled data,” in Proc. 34th Int. Conf. Mach. Learn., 2017, pp. 273–282.

4.

N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, “Data lifecycle challenges in production machine learning: A survey,” SIGMOD Rec., vol. 47, no. 2, pp. 17–28, Jun. 2018.

CrossRef Google Scholar

5.

N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, “Data management challenges in production machine learning,” in Proc. ACM Int. Conf. Manage. Data, 2017, pp. 1723–1726.

CrossRef Google Scholar

6.

“Google cloud automl,” [Online]. Available: https://cloud.google.com/automl/. Accessed: Oct. 17, 2019.

7.

“Microsoft custom vision,” [Online]. Available: https://azure.microsoft.com/en-us/services/cognitive-services/custom-vision-service/. Accessed: Oct. 17, 2019.

8.

“Amazon sagemaker,” [Online]. Available: https://aws.amazon.com/sagemaker/. Accessed: Oct. 17, 2019.

9.

A. Bhardwaj, A. Deshpande, A. J. Elmore, D. Karger, S. Madden, A. Parameswaran, H. Subramanyam, E. Wu, and R. Zhang, “Collaborative data analytics with datahub,” Proc. VLDB Endowment, vol. 8, no. 12, pp. 1916–1919, Aug. 2015.

CrossRef Google Scholar

10.

A. P. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran, “Datahub: Collaborative data science dataset version management at scale,” in Proc. Biennial Conf. Innovative Data Syst. Res., 2015.

11.

S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, and A. Parameswaran, “Principles of dataset versioning: Exploring the recreation/storage tradeoff,” Proc. VLDB Endowment, vol. 8, no. 12, pp. 1346–1357, Aug. 2015.

CrossRef Google Scholar

12.

A. Y. Halevy, “Data publishing and sharing using fusion tables,” in Proc. Biennial Conf. Innovative Data Syst. Res., 2013.

13.

H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, and W. Shen, “Google fusion tables: Data management, integration and collaboration in the cloud,” in Proc. 1st ACM Symp. Cloud Comput., 2010, pp. 175–180.

CrossRef Google Scholar

14.

H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, W. Shen, and J. Goldberg-Kidon, “Google fusion tables: Web-centered data management and collaboration,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010, pp. 1061–1066.

CrossRef Google Scholar

15.

“Ckan,” [Online]. Available: http://ckan.org. Accessed: Oct. 17, 2019.

16.

“Quandl,” [Online]. Available: https://www.quandl.com. Accessed: Oct. 17, 2019.

17.

“Datamarket,” [Online]. Available: https://datamarket.com. Accessed: Oct. 17, 2019.

18.

“Kaggle,” [Online]. Available: https://www.kaggle.com/. Accessed: Oct. 17, 2019.

19.

I. G. Terrizzano, P. M. Schwarz, M. Roth, and J. E. Colino, “Data wrangling: The challenging yourney from the wild to the lake,” in Proc. Biennial Conf. Innovative Data Syst. Res., 2015.

20.

A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang, “Goods: Organizing googles datasets,” in Proc. Int. Conf. Manage. Data, 2016, pp. 795–806.

CrossRef Google Scholar

21.

R. Castro Fernandez, D. Deng, E. Mansour, A. A. Qahtan, W. Tao, Z. Abedjan, A. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang, “A demo of the data civilizer system,” in Proc. ACM Int. Conf. Manage. Data, 2017, pp. 1639–1642.

CrossRef Google Scholar

22.

D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang, “The data civilizer system,” in Proc. Biennial Conf. Innovative Data Syst. Res., 2017.

23.

Y. Gao, S. Huang, and A. G. Parameswaran, “Navigating the data lake with DATAMARAN: Automatically extracting structure from log datasets,” in Proc. Int. Conf. Manage. Data, 2018, pp. 943–958.

CrossRef Google Scholar

24.

M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang, “Webtables: Exploring the power of tables on the web,” Proc. VLDB Endowment, vol. 1, no. 1, pp. 538–549, 2008.

CrossRef Google Scholar

25.

M. J. Cafarella, A. Y. Halevy, H. Lee, J. Madhavan, C. Yu, D. Z. Wang, and E. Wu, “Ten years of webtables,” Proc. VLDB Endowment, vol. 11, no. 12, pp. 2140–2149, 2018.

CrossRef Google Scholar

26.

“Google dataset search,” [Online]. Available: https://www.blog.google/products/search/making-it-easier-discover-datasets/. Accessed: Oct. 17, 2019.

27.

H. Elmeleegy, J. Madhavan, and A. Halevy, “Harvesting relational tables from lists on the web,” VLDB J., vol. 20, no. 2, pp. 209–226, Apr. 2011.

CrossRef Google Scholar

28.

X. Chu, Y. He, K. Chakrabarti, and K. Ganjam, “Tegra: Table extraction by global record alignment,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2015, pp. 1713–1728.

CrossRef Google Scholar

29.

K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, and Y. He, “Data services leveraging bings data assets,” IEEE Data Eng. Bull., vol. 39, no. 3, pp. 15–28, Sept. 2016.

30.

M. J. Cafarella, A. Halevy, and N. Khoussainova, “Data integration for the relational web,” Proc. VLDB Endowment, vol. 2, no. 1, pp. 1090–1101, Aug. 2009.

CrossRef Google Scholar

References is not available for this document.