Research Domain

Domain Overview

Literature survey, research gaps, problem statement, objectives, methodology and technologies powering Similarity.lk.

Literature Survey

Prior Work in Sinhala NLP & Plagiarism

Kasthuri, 2019 [1]

Word2Vec + Cosine Similarity for Sinhala News

Applied Word2Vec embeddings and cosine similarity on Sinhala news text for same-language similarity detection. Limited to monolingual Sinhala — cross-language cases were out of scope.

Nilaxan, 2021 [2]

Siamese LSTM for Sinhala–Tamil Sentence Similarity

Introduced a Siamese LSTM architecture for measuring sentence similarity between Sinhala and Tamil. Remained within same-script or closely related languages — English cross-language detection was not addressed.

Rajamanthri, 2021 [3]

Web Crawl + Jaccard Similarity in Sinhala

Used a web crawler to gather Sinhala content and applied Jaccard similarity for plagiarism detection. Effective at scale but limited to surface-level token matching within Sinhala only.

Wickramasinghe [4]

Bilingual Word Vectors for Sinhala–English

Constructed bilingual word vectors for the Sinhala–English language pair. Plagiarism detection was not evaluated — only embedding quality was assessed.

Feng et al. [5]

LaBSE: Language-Agnostic BERT Sentence Embeddings

Google's LaBSE model maps sentences from 109 languages into a shared embedding space. State-of-the-art on bilingual semantic textual similarity — the backbone of our cross-language component.

Research Gap

What Existing Systems Miss

Cross-language detectors universally ignore Sinhala. No public Sinhala–English plagiarism corpus exists. Existing Sinhala studies stay inside the language, missing translated copying.

Feature	Proposed System	Kasthuri 2019	Nilaxan 2021	Rajamanthri 2021
Cross-language plagiarism handling	✓	✗	✗	✗
Bilingual sentence embeddings (LaBSE)	✓	✗	✗	✗
Contextual embeddings (mBERT)	✓	✗	✗	✗
Machine translation fallback	✓	✗	✗	✗
Handles deep paraphrase	✓	✗	✓	✗
Web-scale source crawl	✓	✗	✗	✓
Real-time API (<1s)	✓	✗	✗	✗
Public dataset release	✓	✗	✗	✗

Research Problem

Core Questions We Answer

Main Research Question

"Can a bilingual sentence encoder, supported by a modest parallel corpus, flag English sentences that appear in Sinhala after machine translation — with high accuracy and low response latency?"

Component 1 Problem

Sinhala Similarity Plagiarism

Standard engines like Turnitin do not support Sinhala script. Students copy, paraphrase and reorder Sinhala sentences to bypass manual checks. A contextual embedding model must handle Sinhala's complex morphology to catch deep paraphrases.

Component 2 Problem

English-to-Sinhala Semantic Plagiarism

Free machine translation makes it trivial to copy English academic sources and submit them in Sinhala. No existing system maps across both scripts. A cross-lingual embedding approach must detect semantic equivalence without surface-level matching.

Objectives

Research Goals

Main Objective

Deliver a real-time Sinhala plagiarism detection service

Achieve F1 > 0.88 for cross-language and precision/recall > 0.90 for monolingual detection, with API response under one second — deployed as an open-source Docker service.

Specific Objectives

Build and publicly release a parallel Sinhala–English corpus with 10,000+ aligned sentence pairs collected from gazettes and bilingual news sources.

Construct a curated 2,000-document Sinhala corpus pairing original text with crafted plagiarised versions from news, essays, and blogs.

Fine-tune mBERT and a Siamese Bi-LSTM for Sinhala-to-Sinhala similarity detection with precision and recall above 0.90.

Fine-tune LaBSE for English-to-Sinhala semantic plagiarism detection, adding a MarianMT fallback for rare vocabulary, targeting F1 > 0.88.

Deploy both components as a unified Flask REST API in Docker, responding in under one second per sentence at classroom scale.

Release all corpus data, model checkpoints, and source code publicly to create the first open benchmark for Sinhala–English plagiarism detection.

Methodology

System Design & Approach

Component 1 — Similarity Detection

Sinhala-to-Sinhala Pipeline

1. Data collection from Sinhala news, essays, blogs
2. Annotators label copied and paraphrased pairs
3. mBERT fine-tuning on 2,000-document corpus
4. Siamese Bi-LSTM for pair classification
5. Scrapy crawler + Faiss for web-scale retrieval
6. Grid-search threshold calibration (>0.90)

Component 2 — Semantic Detection

English-to-Sinhala Pipeline

1. 10k+ pairs from gazettes and bilingual news
2. HunAlign sentence alignment & preprocessing
3. LaBSE fine-tuning maps both languages to shared space
4. Faiss index stores English web embeddings
5. Sinhala query → cosine similarity retrieval
6. MarianMT fallback for low-confidence cases

System Architecture

Unified Similarity.lk Service

Both components integrate through a single Flask REST endpoint. Document upload triggers text extraction, parallel processing through both detection pipelines, confidence score aggregation, and JSON report generation — all within a Docker container deployed on a university server.

Technologies Used

Technical Stack

mBERT

Multilingual BERT — Sinhala embeddings

NLP Model

LaBSE

Bilingual sentence encoder (109 languages)

NLP Model

Siamese Bi-LSTM

Sentence pair classification network

DL Architecture

MarianMT

Machine translation fallback (En→Si)

Translation

Faiss

Facebook AI Similarity Search

Vector DB

Scrapy

Web crawler for corpus collection

Crawling

HunAlign

Bilingual sentence alignment

Alignment

Flask

REST API backend framework

Backend

Docker

Containerised deployment

DevOps

PyTorch

Deep learning framework

Framework

XLM-R

Cross-lingual RoBERTa baseline

NLP Model

SQL

Relational data store for documents

Database