Plagiarism Detection Software, OCR, and Urdu Research: Challenges and Possibilities
پلیجیریزم ڈیٹکشن سافٹ ویئر ، او سی آر اور اردو تحقیق: مسائل اور امکانات
DOI:
https://doi.org/10.52015/daryaft.v18i01.441Keywords:
Urdu Linguistics, OCR, Plagiarism, Urdu Corpus, Ligatures, Khat-e-NastaliqAbstract
This article explores the relationship between technology and Urdu literary research, focusing on the challenges posed by plagiarism detection systems and Optical Character Recognition (OCR). Unlike English, a relatively ligature-free language, Urdu’s cursive script and complex ligatures create significant difficulties for OCR development. At present, the absence of a comprehensive Urdu corpus allows a degree of flexibility in plagiarism detection, as a large body of classical and handwritten (calligraphic) Urdu material is not yet available in editable digital formats. The study classifies existing PDF formats of Urdu texts and evaluates the limitations of current OCR tools, including vFlat, Dastaan, and OCR developed by the Center for Language Engineering (CLE), particularly in handling diverse fonts and traditional calligraphy (Khat-e-Nastaliq). The development of a universal Urdu OCR is essential for building a robust Urdu corpus. Although this would increase scrutiny through plagiarism detection software, it would ultimately enhance academic standards in Urdu research by encouraging originality, critical engagement, and reduced reliance on unverified textual reproduction.
Conflict of Interest: The author declares that there are no conflicts of interest related to the research, authorship, and/or publication of this article, and that the data presented have not been fabricated or falsified.
Funding: This research did not receive any specific grant or financial support from public, commercial, or not-for profit funding agencies.
Participant Consent: The author confirms that Informed consent was obtained from all participants, and confidentiality was duly maintained.



