Loading [MathJax]/extensions/MathZoom.js
String Comparators for Chinese-Characters-Based Record Linkages | IEEE Journals & Magazine | IEEE Xplore

String Comparators for Chinese-Characters-Based Record Linkages


Kruskal-Wallis Test for Numbers of False Matches and False Non-matches Generated by Three Methods.

Abstract:

In the context of big data, data sharing between different institutions can not only reduce the cost of information collection greatly but also benefit for obtaining anal...Show More

Abstract:

In the context of big data, data sharing between different institutions can not only reduce the cost of information collection greatly but also benefit for obtaining analysis results effectively and efficiently. Record linkage is the task of locating records that refer to the same entity from heterogeneous data sources. In the last decades, extensive researches on alphabet-based record linkages have been carried out, among which the Fellegi-Sunter model extended by Winkler has outperformed others. However, it is still a challenge to perform record linkage on Chinese-character-based datasets. In this article, two set-based methods (Cosine similarity and Dice similarity) were introduced firstly, and then the similarity of Chinese characters was quantified based on an adapted encoding technique which exploits the information of both the shape and the pronunciation of Chinese character. A new method entitled Hybrid similarity was proposed in the next part, which is the combination of the character transformation technique (SoundShape Code) and Dice similarity. Finally, we performed the aforementioned methods on the simulated datasets, and each method was evaluated by counting the number of misclassified record pairs and the computational time. The results demonstrated that our Hybrid similarity method outperformed others in reducing the number of misclassified pairs with a relatively low computational cost.
Kruskal-Wallis Test for Numbers of False Matches and False Non-matches Generated by Three Methods.
Published in: IEEE Access ( Volume: 9)
Page(s): 3735 - 3743
Date of Publication: 29 December 2020
Electronic ISSN: 2169-3536

Funding Agency:

Author image of Senlin Xu
Department of Mathematics and Statistics, College of Science, Huazhong Agricultural University, Wuhan, China
Senlin Xu received the B.S. degree from the Jiangxi University of Finance and Economics, in 2019. He is currently pursuing the M.S. degree in probability and statistics with the College of Science, Huazhong Agricultural University.
Senlin Xu received the B.S. degree from the Jiangxi University of Finance and Economics, in 2019. He is currently pursuing the M.S. degree in probability and statistics with the College of Science, Huazhong Agricultural University.View more
Author image of Mingfan Zheng
Department of Mathematics and Statistics, College of Science, Huazhong Agricultural University, Wuhan, China
Mingfan Zheng received the B.S. degree from the Guangdong University of Finance, in 2019. He is currently pursuing the M.S. degree in probability and statistics with the College of Science, Huazhong Agricultural University.
Mingfan Zheng received the B.S. degree from the Guangdong University of Finance, in 2019. He is currently pursuing the M.S. degree in probability and statistics with the College of Science, Huazhong Agricultural University.View more
Author image of Xinran Li
Department of Mathematics and Statistics, College of Science, Huazhong Agricultural University, Wuhan, China
Xinran Li received the B.S. degree in applied mathematics and the M.S. degree in statistics and data processing from the University of Clermont-Ferrand II, in 2009 and 2011, respectively, and the Ph.D. degree in biostatistics and medical informatics from the University of Clermont-Ferrand I, France, in 2015. He is currently an Associate Professor with the Department of Mathematics and Statistics, Huazhong Agricultural Uni...Show More
Xinran Li received the B.S. degree in applied mathematics and the M.S. degree in statistics and data processing from the University of Clermont-Ferrand II, in 2009 and 2011, respectively, and the Ph.D. degree in biostatistics and medical informatics from the University of Clermont-Ferrand I, France, in 2015. He is currently an Associate Professor with the Department of Mathematics and Statistics, Huazhong Agricultural Uni...View more

Author image of Senlin Xu
Department of Mathematics and Statistics, College of Science, Huazhong Agricultural University, Wuhan, China
Senlin Xu received the B.S. degree from the Jiangxi University of Finance and Economics, in 2019. He is currently pursuing the M.S. degree in probability and statistics with the College of Science, Huazhong Agricultural University.
Senlin Xu received the B.S. degree from the Jiangxi University of Finance and Economics, in 2019. He is currently pursuing the M.S. degree in probability and statistics with the College of Science, Huazhong Agricultural University.View more
Author image of Mingfan Zheng
Department of Mathematics and Statistics, College of Science, Huazhong Agricultural University, Wuhan, China
Mingfan Zheng received the B.S. degree from the Guangdong University of Finance, in 2019. He is currently pursuing the M.S. degree in probability and statistics with the College of Science, Huazhong Agricultural University.
Mingfan Zheng received the B.S. degree from the Guangdong University of Finance, in 2019. He is currently pursuing the M.S. degree in probability and statistics with the College of Science, Huazhong Agricultural University.View more
Author image of Xinran Li
Department of Mathematics and Statistics, College of Science, Huazhong Agricultural University, Wuhan, China
Xinran Li received the B.S. degree in applied mathematics and the M.S. degree in statistics and data processing from the University of Clermont-Ferrand II, in 2009 and 2011, respectively, and the Ph.D. degree in biostatistics and medical informatics from the University of Clermont-Ferrand I, France, in 2015. He is currently an Associate Professor with the Department of Mathematics and Statistics, Huazhong Agricultural University, China. His research interests include record linkage, text mining, and statistical learning.
Xinran Li received the B.S. degree in applied mathematics and the M.S. degree in statistics and data processing from the University of Clermont-Ferrand II, in 2009 and 2011, respectively, and the Ph.D. degree in biostatistics and medical informatics from the University of Clermont-Ferrand I, France, in 2015. He is currently an Associate Professor with the Department of Mathematics and Statistics, Huazhong Agricultural University, China. His research interests include record linkage, text mining, and statistical learning.View more

References

References is not available for this document.