Abstract:
Authorship attribution is the problem of iden-tifying the author of a text based on its content, topic and stylistic features. BERT can visualize the weight of each word ...Show MoreMetadata
Abstract:
Authorship attribution is the problem of iden-tifying the author of a text based on its content, topic and stylistic features. BERT can visualize the weight of each word via self-attention and are effective in explaining the basis for the decision for authorship attribution. However, the basis is not easy to capture because of the large amount of self-attention in a model. Topological data analysis(TDA) is a method used to capture a set of points in space based on their topology. Using TDA for text data, we can capture how the words focus on each other, or the “pattern of recognition” by self-attention. In this study, we attempt to analyze the explanatory power of the model based on the attention patterns of BERT by extracting TDA-based features based on the zeroth-order persistent homology through a developed classification model and visualizations based on uniform manifold approximation and projection (UMAP). Based on experimental results, we can conclude that TDA-based features contain sufficient in-formation to discriminate authors. The proposed method can capture the basis of BERT's authorship attribution more clearly, facilitating the explanation of the basis.
Published in: 2025 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)
Date of Conference: 18-21 February 2025
Date Added to IEEE Xplore: 19 March 2025
ISBN Information: