Text Preprocessing for Urdu Text: A Survey of Techniques and Their Influence on NLP Tasks

Authors

  • Usama Shahid Department of Computer Science, University of Southern Punjab, Multan. Author
  • Mubasher Malik Department of Computer Science, University of Southern Punjab, Multan. Author
  • Talha Farooq Khan Department of Computer Science, University of Southern Punjab, Multan. Author
  • Rabia Rehman Department of Computer Science, University of Southern Punjab, Multan. Author

DOI:

https://doi.org/10.63075/6x0cdd67

Abstract

Text preprocessing (TP) has historically been a critical phase in Natural Language Processing (NLP) pipelines, aimed at transforming raw text into a cleaner, more manageable format for machine consumption. With the advent of sophisticated pre-trained Transformer models, the perceived necessity of explicit TP has been debated. This paper offers a comprehensive review of existing literature concerning text preprocessing, with a specific focus on its application and impact within Urdu Natural Language Processing. We delve into the unique linguistic challenges posed by Urdu, such as its rich morphology and Nastaliq script, and survey various preprocessing techniques including script normalization, stop word removal, and stemming/lemmatization. Through an extensive examination of past studies, we analyze how these techniques have influenced the performance of both traditional machine learning classifiers and modern deep learning architectures, including Transformer models, in Urdu text classification and other NLP tasks. This review synthesizes key findings from the literature, highlighting the enduring relevance of tailored TP strategies for optimizing Urdu NLP applications and identifying critical gaps for future research.  

Downloads

Download data is not yet available.

Downloads

Published

2025-07-20

How to Cite

 Text Preprocessing for Urdu Text: A Survey of Techniques and Their Influence on NLP Tasks. (2025). Annual Methodological Archive Research Review, 3(7), 201-227. https://doi.org/10.63075/6x0cdd67