Research Project · SLIIT · 25-26J-545

Sinhala
Similarity &
Plagiarism
Detection

An AI-powered system detecting Sinhala-to-Sinhala similarity plagiarism and English-to-Sinhala semantic plagiarism using modern embedding models — bringing academic integrity tools to Sri Lanka's low-resource language.

F1 >0.90
Target Accuracy
<1s
API Response
10k+
Sentence Pairs
2 Langs
Sinhala & English
// mBERT
// LaBSE
// Siamese Bi-LSTM
// MarianMT
// Faiss Index
// Flask REST API
// Docker
Overview
What We're Building
Component 01
Sinhala Similarity Detection

Detects Sinhala-to-Sinhala plagiarism using fine-tuned multilingual BERT paired with a Siamese Bi-LSTM. Achieves precision & recall above 0.90 on a curated 2,000-document corpus.

mBERTSiamese LSTMFaiss
Component 02
English → Sinhala Semantic

Identifies translated plagiarism where English text is machine-translated to Sinhala. Uses LaBSE bilingual embeddings with MarianMT fallback, targeting F1 > 0.88 in under 1 second.

LaBSEMarianMTCross-Lingual
Platform
Similarity.lk Unified Service

Both components integrate into a single Flask REST API deployed in Docker. Lecturers upload Sinhala documents and receive a detailed plagiarism report with similarity scores in real time.

Flask APIDockerOpen Source
The Problem
Sinhala plagiarism slips past every existing tool

Students copy English sources, translate via Google Translate, and submit as original Sinhala work. Mainstream engines like Turnitin don't support Sinhala script. Lecturers read submissions line by line — slow and error-prone.

Our Solution
First open Sinhala–English plagiarism benchmark

We build the first publicly released parallel corpus for Sinhala–English plagiarism, fine-tune bilingual embedding models, and expose the detector as an open API — enabling real-time, script-aware integrity checks for Sri Lankan institutions.