Loading [MathJax]/extensions/MathMenu.js
Is Your Model Sensitive? SPEDAC: A New Resource for the Automatic Classification of Sensitive Personal Data | IEEE Journals & Magazine | IEEE Xplore

Is Your Model Sensitive? SPEDAC: A New Resource for the Automatic Classification of Sensitive Personal Data


Flow of sensitive detection model representing the applied experimental process.

Abstract:

In recent years, there has been an exponential growth of applications, including dialogue systems, that handle sensitive personal information. This has brought to light t...Show More

Abstract:

In recent years, there has been an exponential growth of applications, including dialogue systems, that handle sensitive personal information. This has brought to light the extremely important issue of personal data protection in virtual environments. Sensitive information detection (SID) covers different domains and languages in literature. However, if we refer to the personal data domain, the absence of a shared standard benchmark makes comparison with the state-of-the-art difficult for this task. To fill this gap, we introduce and release SPEDAC, a new annotated resource for the identification of sensitive personal data categories in the English language. SPEDAC enables the evaluation of computational models for three different SID subtasks with increasing levels of complexity. SPEDAC 1 regards binary classification, a model has to detect if a sentence contains sensitive information or not; in SPEDAC 2 we collected labeled sentences using 5 categories that relate to macro-domains of personal information; in SPEDAC 3, the labeling is fine-grained and includes 61 personal data categories. We conduct an extensive evaluation of the resource using different state-of-the-art-classifiers. The results show that SPEDAC is challenging, particularly with regard to fine-grained classification. Classifiers based on the transformer architectures achieve good results on SPEDAC 1 and 2 but have difficulties to discern among fine-grained classes in SPEDAC 3.
Flow of sensitive detection model representing the applied experimental process.
Published in: IEEE Access ( Volume: 11)
Page(s): 10864 - 10880
Date of Publication: 30 January 2023
Electronic ISSN: 2169-3536

Funding Agency:

Author image of Gaia Gambarelli
FICLIT, University of Bologna, Bologna, Italy
Ellysse srl, Reggio Emilia, Italy
Gaia Gambarelli was born in Correggio, Reggio Emilia, Italy, in 1994. She received the M.S. degree in Italian studies, linguistics, and European literary cultures, with a thesis in computational linguistics, from the University of Bologna, in March 2019, where she is currently pursuing the Ph.D. degree in digital humanities with the Department of Classical Philology and Italian Studies (FICLIT). She has industrial work ex...Show More
Gaia Gambarelli was born in Correggio, Reggio Emilia, Italy, in 1994. She received the M.S. degree in Italian studies, linguistics, and European literary cultures, with a thesis in computational linguistics, from the University of Bologna, in March 2019, where she is currently pursuing the Ph.D. degree in digital humanities with the Department of Classical Philology and Italian Studies (FICLIT). She has industrial work ex...View more
Author image of Aldo Gangemi
FICLIT, University of Bologna, Bologna, Italy
ISTC-CNR, Rome, Italy
Aldo Gangemi is currently a Full Professor with the University of Bologna and the Director of the Institute for Cognitive Sciences and Technologies, Italian National Research Council, where he co-founded the Semantic Technology Laboratory (STLab), in 2008. He has published more than 250 papers in international peer-reviewed journals, conferences, and books (scholar H-index = 61) and sits as an EiC member or an EB member o...Show More
Aldo Gangemi is currently a Full Professor with the University of Bologna and the Director of the Institute for Cognitive Sciences and Technologies, Italian National Research Council, where he co-founded the Semantic Technology Laboratory (STLab), in 2008. He has published more than 250 papers in international peer-reviewed journals, conferences, and books (scholar H-index = 61) and sits as an EiC member or an EB member o...View more
Author image of Rocco Tripodi
LILEC, University of Bologna, Bologna, Italy
Rocco Tripodi received the Ph.D. degree in computer science from the Ca’ Foscari University of Venice, with a thesis titled “Evolutionary game theoretic models for natural language processing,” in 2015. He was a Research Assistant and an Adjunct Professor at Ca’ Foscari University, where he worked on lexical semantics and taught corpus linguistics, natural language processing, and digital text analysis. He worked as a Res...Show More
Rocco Tripodi received the Ph.D. degree in computer science from the Ca’ Foscari University of Venice, with a thesis titled “Evolutionary game theoretic models for natural language processing,” in 2015. He was a Research Assistant and an Adjunct Professor at Ca’ Foscari University, where he worked on lexical semantics and taught corpus linguistics, natural language processing, and digital text analysis. He worked as a Res...View more

Author image of Gaia Gambarelli
FICLIT, University of Bologna, Bologna, Italy
Ellysse srl, Reggio Emilia, Italy
Gaia Gambarelli was born in Correggio, Reggio Emilia, Italy, in 1994. She received the M.S. degree in Italian studies, linguistics, and European literary cultures, with a thesis in computational linguistics, from the University of Bologna, in March 2019, where she is currently pursuing the Ph.D. degree in digital humanities with the Department of Classical Philology and Italian Studies (FICLIT). She has industrial work experience as an expert in conversational agents. Her Ph.D. project concerns the protection of sensitive data through textual automatic identification. Her main current research interests include natural language processing, semantics, and linguistics applied to privacy protection and conversational agents. Previously, she worked on automatic personality recognition, rhetoric, and the theory of argumentation.
Gaia Gambarelli was born in Correggio, Reggio Emilia, Italy, in 1994. She received the M.S. degree in Italian studies, linguistics, and European literary cultures, with a thesis in computational linguistics, from the University of Bologna, in March 2019, where she is currently pursuing the Ph.D. degree in digital humanities with the Department of Classical Philology and Italian Studies (FICLIT). She has industrial work experience as an expert in conversational agents. Her Ph.D. project concerns the protection of sensitive data through textual automatic identification. Her main current research interests include natural language processing, semantics, and linguistics applied to privacy protection and conversational agents. Previously, she worked on automatic personality recognition, rhetoric, and the theory of argumentation.View more
Author image of Aldo Gangemi
FICLIT, University of Bologna, Bologna, Italy
ISTC-CNR, Rome, Italy
Aldo Gangemi is currently a Full Professor with the University of Bologna and the Director of the Institute for Cognitive Sciences and Technologies, Italian National Research Council, where he co-founded the Semantic Technology Laboratory (STLab), in 2008. He has published more than 250 papers in international peer-reviewed journals, conferences, and books (scholar H-index = 61) and sits as an EiC member or an EB member of international journals (semantic web, web semantics, and applied ontology), the Conference Chair (EKAW2008, WWW2015, and ESWC2018/9), has coordinated research teams in eight EU projects, and is the Scientific Coordinator of the H2020 SPICE Project. His research interests include semantic technologies as an integration of methods from knowledge engineering, semantic web, linked data, cognitive science, and natural language processing. His theoretical interests concentrate upon the representation and discovery of knowledge patterns across data, ontologies, natural language, and cognition, using hybrid symbolic/sub-symbolic methods, applications domains include cultural heritage, robotics, medicine, law, e-government, agriculture and fishery, and business. He is also a member of the Board of Directors at the IMT School for Advanced Studies Lucca.
Aldo Gangemi is currently a Full Professor with the University of Bologna and the Director of the Institute for Cognitive Sciences and Technologies, Italian National Research Council, where he co-founded the Semantic Technology Laboratory (STLab), in 2008. He has published more than 250 papers in international peer-reviewed journals, conferences, and books (scholar H-index = 61) and sits as an EiC member or an EB member of international journals (semantic web, web semantics, and applied ontology), the Conference Chair (EKAW2008, WWW2015, and ESWC2018/9), has coordinated research teams in eight EU projects, and is the Scientific Coordinator of the H2020 SPICE Project. His research interests include semantic technologies as an integration of methods from knowledge engineering, semantic web, linked data, cognitive science, and natural language processing. His theoretical interests concentrate upon the representation and discovery of knowledge patterns across data, ontologies, natural language, and cognition, using hybrid symbolic/sub-symbolic methods, applications domains include cultural heritage, robotics, medicine, law, e-government, agriculture and fishery, and business. He is also a member of the Board of Directors at the IMT School for Advanced Studies Lucca.View more
Author image of Rocco Tripodi
LILEC, University of Bologna, Bologna, Italy
Rocco Tripodi received the Ph.D. degree in computer science from the Ca’ Foscari University of Venice, with a thesis titled “Evolutionary game theoretic models for natural language processing,” in 2015. He was a Research Assistant and an Adjunct Professor at Ca’ Foscari University, where he worked on lexical semantics and taught corpus linguistics, natural language processing, and digital text analysis. He worked as a Researcher in different laboratories, including Sapienza NLP, Sapienza University of Rome, and the European Centre for Living Technology (ECLT), Venice, and on different European projects, including ODYCCEUS, MOUSSE, and Polifonia. He is currently an Assistant Professor at the University of Bologna. His research interests include the areas of machine learning and natural language processing with a focus on lexical and sentence level semantics, learning models based on game theoretic principles, and the design, learning, and evolution of linguistic communication systems. EOD
Rocco Tripodi received the Ph.D. degree in computer science from the Ca’ Foscari University of Venice, with a thesis titled “Evolutionary game theoretic models for natural language processing,” in 2015. He was a Research Assistant and an Adjunct Professor at Ca’ Foscari University, where he worked on lexical semantics and taught corpus linguistics, natural language processing, and digital text analysis. He worked as a Researcher in different laboratories, including Sapienza NLP, Sapienza University of Rome, and the European Centre for Living Technology (ECLT), Venice, and on different European projects, including ODYCCEUS, MOUSSE, and Polifonia. He is currently an Assistant Professor at the University of Bologna. His research interests include the areas of machine learning and natural language processing with a focus on lexical and sentence level semantics, learning models based on game theoretic principles, and the design, learning, and evolution of linguistic communication systems. EODView more

References

References is not available for this document.