Research Domain

Domain Overview

Literature survey, research gaps, problem statement, objectives, methodology and technologies powering Similarity.lk.

Literature Survey
Prior Work in Sinhala NLP & Plagiarism
Kasthuri, 2019 [1]
Word2Vec + Cosine Similarity for Sinhala News
Applied Word2Vec embeddings and cosine similarity on Sinhala news text for same-language similarity detection. Limited to monolingual Sinhala — cross-language cases were out of scope.
Nilaxan, 2021 [2]
Siamese LSTM for Sinhala–Tamil Sentence Similarity
Introduced a Siamese LSTM architecture for measuring sentence similarity between Sinhala and Tamil. Remained within same-script or closely related languages — English cross-language detection was not addressed.
Rajamanthri, 2021 [3]
Web Crawl + Jaccard Similarity in Sinhala
Used a web crawler to gather Sinhala content and applied Jaccard similarity for plagiarism detection. Effective at scale but limited to surface-level token matching within Sinhala only.
Wickramasinghe [4]
Bilingual Word Vectors for Sinhala–English
Constructed bilingual word vectors for the Sinhala–English language pair. Plagiarism detection was not evaluated — only embedding quality was assessed.
Feng et al. [5]
LaBSE: Language-Agnostic BERT Sentence Embeddings
Google's LaBSE model maps sentences from 109 languages into a shared embedding space. State-of-the-art on bilingual semantic textual similarity — the backbone of our cross-language component.
Research Gap
What Existing Systems Miss

Cross-language detectors universally ignore Sinhala. No public Sinhala–English plagiarism corpus exists. Existing Sinhala studies stay inside the language, missing translated copying.

FeatureProposed SystemKasthuri 2019Nilaxan 2021Rajamanthri 2021
Cross-language plagiarism handling
Bilingual sentence embeddings (LaBSE)
Contextual embeddings (mBERT)
Machine translation fallback
Handles deep paraphrase
Web-scale source crawl
Real-time API (<1s)
Public dataset release
Research Problem
Core Questions We Answer
Main Research Question

"Can a bilingual sentence encoder, supported by a modest parallel corpus, flag English sentences that appear in Sinhala after machine translation — with high accuracy and low response latency?"

Component 1 Problem
Sinhala Similarity Plagiarism

Standard engines like Turnitin do not support Sinhala script. Students copy, paraphrase and reorder Sinhala sentences to bypass manual checks. A contextual embedding model must handle Sinhala's complex morphology to catch deep paraphrases.

Component 2 Problem
English-to-Sinhala Semantic Plagiarism

Free machine translation makes it trivial to copy English academic sources and submit them in Sinhala. No existing system maps across both scripts. A cross-lingual embedding approach must detect semantic equivalence without surface-level matching.

Objectives
Research Goals
Main Objective
Deliver a real-time Sinhala plagiarism detection service

Achieve F1 > 0.88 for cross-language and precision/recall > 0.90 for monolingual detection, with API response under one second — deployed as an open-source Docker service.

Specific Objectives
01
Build and publicly release a parallel Sinhala–English corpus with 10,000+ aligned sentence pairs collected from gazettes and bilingual news sources.
02
Construct a curated 2,000-document Sinhala corpus pairing original text with crafted plagiarised versions from news, essays, and blogs.
03
Fine-tune mBERT and a Siamese Bi-LSTM for Sinhala-to-Sinhala similarity detection with precision and recall above 0.90.
04
Fine-tune LaBSE for English-to-Sinhala semantic plagiarism detection, adding a MarianMT fallback for rare vocabulary, targeting F1 > 0.88.
05
Deploy both components as a unified Flask REST API in Docker, responding in under one second per sentence at classroom scale.
06
Release all corpus data, model checkpoints, and source code publicly to create the first open benchmark for Sinhala–English plagiarism detection.
Methodology
System Design & Approach
Component 1 — Similarity Detection
Sinhala-to-Sinhala Pipeline

1. Data collection from Sinhala news, essays, blogs
2. Annotators label copied and paraphrased pairs
3. mBERT fine-tuning on 2,000-document corpus
4. Siamese Bi-LSTM for pair classification
5. Scrapy crawler + Faiss for web-scale retrieval
6. Grid-search threshold calibration (>0.90)

Component 2 — Semantic Detection
English-to-Sinhala Pipeline

1. 10k+ pairs from gazettes and bilingual news
2. HunAlign sentence alignment & preprocessing
3. LaBSE fine-tuning maps both languages to shared space
4. Faiss index stores English web embeddings
5. Sinhala query → cosine similarity retrieval
6. MarianMT fallback for low-confidence cases

System Architecture
Unified Similarity.lk Service

Both components integrate through a single Flask REST endpoint. Document upload triggers text extraction, parallel processing through both detection pipelines, confidence score aggregation, and JSON report generation — all within a Docker container deployed on a university server.

Technologies Used
Technical Stack
mBERT
Multilingual BERT — Sinhala embeddings
NLP Model
LaBSE
Bilingual sentence encoder (109 languages)
NLP Model
Siamese Bi-LSTM
Sentence pair classification network
DL Architecture
MarianMT
Machine translation fallback (En→Si)
Translation
Faiss
Facebook AI Similarity Search
Vector DB
Scrapy
Web crawler for corpus collection
Crawling
HunAlign
Bilingual sentence alignment
Alignment
Flask
REST API backend framework
Backend
Docker
Containerised deployment
DevOps
PyTorch
Deep learning framework
Framework
XLM-R
Cross-lingual RoBERTa baseline
NLP Model
SQL
Relational data store for documents
Database