Literature survey, research gaps, problem statement, objectives, methodology and technologies powering Similarity.lk.
Literature Survey
Prior Work in Sinhala NLP & Plagiarism
Kasthuri, 2019 [1]
Word2Vec + Cosine Similarity for Sinhala News
Applied Word2Vec embeddings and cosine similarity on Sinhala news text for same-language similarity detection. Limited to monolingual Sinhala — cross-language cases were out of scope.
Nilaxan, 2021 [2]
Siamese LSTM for Sinhala–Tamil Sentence Similarity
Introduced a Siamese LSTM architecture for measuring sentence similarity between Sinhala and Tamil. Remained within same-script or closely related languages — English cross-language detection was not addressed.
Rajamanthri, 2021 [3]
Web Crawl + Jaccard Similarity in Sinhala
Used a web crawler to gather Sinhala content and applied Jaccard similarity for plagiarism detection. Effective at scale but limited to surface-level token matching within Sinhala only.
Wickramasinghe [4]
Bilingual Word Vectors for Sinhala–English
Constructed bilingual word vectors for the Sinhala–English language pair. Plagiarism detection was not evaluated — only embedding quality was assessed.
Feng et al. [5]
LaBSE: Language-Agnostic BERT Sentence Embeddings
Google's LaBSE model maps sentences from 109 languages into a shared embedding space. State-of-the-art on bilingual semantic textual similarity — the backbone of our cross-language component.
Research Gap
What Existing Systems Miss
Cross-language detectors universally ignore Sinhala. No public Sinhala–English plagiarism corpus exists. Existing Sinhala studies stay inside the language, missing translated copying.
Feature
Proposed System
Kasthuri 2019
Nilaxan 2021
Rajamanthri 2021
Cross-language plagiarism handling
✓
✗
✗
✗
Bilingual sentence embeddings (LaBSE)
✓
✗
✗
✗
Contextual embeddings (mBERT)
✓
✗
✗
✗
Machine translation fallback
✓
✗
✗
✗
Handles deep paraphrase
✓
✗
✓
✗
Web-scale source crawl
✓
✗
✗
✓
Real-time API (<1s)
✓
✗
✗
✗
Public dataset release
✓
✗
✗
✗
Research Problem
Core Questions We Answer
Main Research Question
"Can a bilingual sentence encoder, supported by a modest parallel corpus, flag English sentences that appear in Sinhala after machine translation — with high accuracy and low response latency?"
Component 1 Problem
Sinhala Similarity Plagiarism
Standard engines like Turnitin do not support Sinhala script. Students copy, paraphrase and reorder Sinhala sentences to bypass manual checks. A contextual embedding model must handle Sinhala's complex morphology to catch deep paraphrases.
Component 2 Problem
English-to-Sinhala Semantic Plagiarism
Free machine translation makes it trivial to copy English academic sources and submit them in Sinhala. No existing system maps across both scripts. A cross-lingual embedding approach must detect semantic equivalence without surface-level matching.
Objectives
Research Goals
Main Objective
Deliver a real-time Sinhala plagiarism detection service
Achieve F1 > 0.88 for cross-language and precision/recall > 0.90 for monolingual detection, with API response under one second — deployed as an open-source Docker service.
Specific Objectives
01
Build and publicly release a parallel Sinhala–English corpus with 10,000+ aligned sentence pairs collected from gazettes and bilingual news sources.
02
Construct a curated 2,000-document Sinhala corpus pairing original text with crafted plagiarised versions from news, essays, and blogs.
03
Fine-tune mBERT and a Siamese Bi-LSTM for Sinhala-to-Sinhala similarity detection with precision and recall above 0.90.
04
Fine-tune LaBSE for English-to-Sinhala semantic plagiarism detection, adding a MarianMT fallback for rare vocabulary, targeting F1 > 0.88.
05
Deploy both components as a unified Flask REST API in Docker, responding in under one second per sentence at classroom scale.
06
Release all corpus data, model checkpoints, and source code publicly to create the first open benchmark for Sinhala–English plagiarism detection.
Methodology
System Design & Approach
Component 1 — Similarity Detection
Sinhala-to-Sinhala Pipeline
1. Data collection from Sinhala news, essays, blogs 2. Annotators label copied and paraphrased pairs 3. mBERT fine-tuning on 2,000-document corpus 4. Siamese Bi-LSTM for pair classification 5. Scrapy crawler + Faiss for web-scale retrieval 6. Grid-search threshold calibration (>0.90)
Component 2 — Semantic Detection
English-to-Sinhala Pipeline
1. 10k+ pairs from gazettes and bilingual news 2. HunAlign sentence alignment & preprocessing 3. LaBSE fine-tuning maps both languages to shared space 4. Faiss index stores English web embeddings 5. Sinhala query → cosine similarity retrieval 6. MarianMT fallback for low-confidence cases
System Architecture
Unified Similarity.lk Service
Both components integrate through a single Flask REST endpoint. Document upload triggers text extraction, parallel processing through both detection pipelines, confidence score aggregation, and JSON report generation — all within a Docker container deployed on a university server.