Loading [a11y]/accessibility-menu.js
Factors That Influence Automatic Recognition of African-American Vernacular English in Machine-Learning Models | IEEE Journals & Magazine | IEEE Xplore

Factors That Influence Automatic Recognition of African-American Vernacular English in Machine-Learning Models


Abstract:

Racial bias is a well-documented problem in natural language processing (NLP). The dialectal language used by marginalized groups is often misclassified or mischaracteriz...Show More

Abstract:

Racial bias is a well-documented problem in natural language processing (NLP). The dialectal language used by marginalized groups is often misclassified or mischaracterized by language models, which in turn can further disenfranchise these populations. Previous works have noted that some popular language identification (LID) models perform worse when classifying tweets that contain African-American Vernacular English (AAVE) than when classifying tweets that contain White-Aligned English (WAE). This work examines the factors that contribute to racial bias in language models for the LID task. The contributions of this work are two-fold. First, a thorough analysis demonstrates that a lack of “unique” language-specific n-gram features in an LID model can lead to poor performance on dialectal data, especially on shorter-length inputs like those typically found on social media. Second, based on these findings, this work introduces and illustrates the efficacy of two simple yet accurate solutions: i.) mining “unique” n-gram features and ii.) including examples of dialectal English in training data. These solutions mitigate the accuracy gap between WAE and AAVE which some language identification models demonstrate when classifying shorter inputs. Mining for unique features and training with a more diverse dataset can improve the disparity on short-length sequences by 6% and 9.8% respectively.
Page(s): 509 - 516
Date of Publication: 08 November 2023

ISSN Information:


References

References is not available for this document.