Sinhala
Similarity &
Plagiarism
Detection
An AI-powered system detecting Sinhala-to-Sinhala similarity plagiarism and English-to-Sinhala semantic plagiarism using modern embedding models — bringing academic integrity tools to Sri Lanka's low-resource language.
Detects Sinhala-to-Sinhala plagiarism using fine-tuned multilingual BERT paired with a Siamese Bi-LSTM. Achieves precision & recall above 0.90 on a curated 2,000-document corpus.
Identifies translated plagiarism where English text is machine-translated to Sinhala. Uses LaBSE bilingual embeddings with MarianMT fallback, targeting F1 > 0.88 in under 1 second.
Both components integrate into a single Flask REST API deployed in Docker. Lecturers upload Sinhala documents and receive a detailed plagiarism report with similarity scores in real time.
Students copy English sources, translate via Google Translate, and submit as original Sinhala work. Mainstream engines like Turnitin don't support Sinhala script. Lecturers read submissions line by line — slow and error-prone.
We build the first publicly released parallel corpus for Sinhala–English plagiarism, fine-tune bilingual embedding models, and expose the detector as an open API — enabling real-time, script-aware integrity checks for Sri Lankan institutions.