Text Preprocessing for Urdu Text: A Survey of Techniques and Their Influence on NLP Tasks
DOI:
https://doi.org/10.63075/6x0cdd67Abstract
Text preprocessing (TP) has historically been a critical phase in Natural Language Processing (NLP) pipelines, aimed at transforming raw text into a cleaner, more manageable format for machine consumption. With the advent of sophisticated pre-trained Transformer models, the perceived necessity of explicit TP has been debated. This paper offers a comprehensive review of existing literature concerning text preprocessing, with a specific focus on its application and impact within Urdu Natural Language Processing. We delve into the unique linguistic challenges posed by Urdu, such as its rich morphology and Nastaliq script, and survey various preprocessing techniques including script normalization, stop word removal, and stemming/lemmatization. Through an extensive examination of past studies, we analyze how these techniques have influenced the performance of both traditional machine learning classifiers and modern deep learning architectures, including Transformer models, in Urdu text classification and other NLP tasks. This review synthesizes key findings from the literature, highlighting the enduring relevance of tailored TP strategies for optimizing Urdu NLP applications and identifying critical gaps for future research.