The proposed SGNWS model includes three core parts: the encoder, self-attention, and decoder. Then, the training details of the proposed model are presented. The proposed...
Abstract:
Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It’s cursive and consists o...Show MoreMetadata
Abstract:
Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It’s cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets.
The proposed SGNWS model includes three core parts: the encoder, self-attention, and decoder. Then, the training details of the proposed model are presented. The proposed...
Published in: IEEE Access ( Volume: 12)
Funding Agency:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Representation Learning ,
- Substring ,
- Word Segmentation ,
- Neural Model ,
- Bidirectional Long Short-term Memory ,
- Conditional Random Field ,
- Sequence Labeling ,
- Neural Network-based Model ,
- Raw Text ,
- Word Boundaries ,
- Cursive ,
- Large Amounts Of Text ,
- Training Set ,
- Decoding ,
- Artificial Neural Network ,
- Twitter ,
- Softmax ,
- Attention Mechanism ,
- Trainable Parameters ,
- White Space ,
- Rule-based Algorithm ,
- Sequence Tags ,
- Absence Of Space ,
- Morphemes ,
- Unigram ,
- Compound Words ,
- Rule-based Methods ,
- Large Amount Of Knowledge
- Author Keywords
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Representation Learning ,
- Substring ,
- Word Segmentation ,
- Neural Model ,
- Bidirectional Long Short-term Memory ,
- Conditional Random Field ,
- Sequence Labeling ,
- Neural Network-based Model ,
- Raw Text ,
- Word Boundaries ,
- Cursive ,
- Large Amounts Of Text ,
- Training Set ,
- Decoding ,
- Artificial Neural Network ,
- Twitter ,
- Softmax ,
- Attention Mechanism ,
- Trainable Parameters ,
- White Space ,
- Rule-based Algorithm ,
- Sequence Tags ,
- Absence Of Space ,
- Morphemes ,
- Unigram ,
- Compound Words ,
- Rule-based Methods ,
- Large Amount Of Knowledge
- Author Keywords