Saksi Translation: Nepali-English Machine Translation
This project provides a machine translation solution to translate text from Nepali and Sinhala to English. It leverages the power of the NLLB (No Language Left Behind) model from Meta AI, which is fine-tuned on a custom dataset for improved performance. The project includes a complete workflow from data acquisition to model deployment, featuring a REST API for easy integration.
Table of Contents
- Features
- Workflow
- Tech Stack
- Model Details
- API Endpoints
- Getting Started
- Usage
- Project Structure
- Future Improvements
Features
- High-Quality Translation: Utilizes a fine-tuned NLLB model for accurate translations.
- Support for Multiple Languages: Currently supports Nepali and Sinhala to English translation.
- REST API: Exposes the translation model through a high-performance FastAPI application.
- Interactive Frontend: A simple and intuitive web interface for easy translation.
- Batch Translation: Supports translating multiple texts in a single request.
- PDF Translation: Supports translating text directly from PDF files.
- Scalable and Reproducible: Built with a modular structure and uses MLflow for experiment tracking.
Workflow
The project follows a standard machine learning workflow for building and deploying a translation model:
- Data Acquisition: The process begins with collecting parallel text data (Nepali/Sinhala and English). The - scripts/fetch_parallel_data.pyscript is used to download data from various online sources. The quality and quantity of this data are crucial for the model's performance.
- Data Cleaning and Preprocessing: Raw data from the web is often noisy and requires cleaning. The - scripts/clean_text_data.pyscript performs several preprocessing steps:- HTML Tag Removal: Strips out HTML tags and other web artifacts.
- Unicode Normalization: Normalizes Unicode characters to ensure consistency.
- Sentence Filtering: Removes sentences that are too long or too short, which can negatively impact training.
- Corpus Alignment: Ensures a one-to-one correspondence between source and target sentences.
 
- Model Finetuning: The core of the project is fine-tuning a pre-trained NLLB model on our custom parallel dataset. The - src/train.pyscript, which leverages the Hugging Face- TrainerAPI, handles this process. This script manages the entire training loop, including:- Loading the pre-trained NLLB model and tokenizer.
- Creating a PyTorch Dataset from the preprocessed data.
- Configuring training arguments like learning rate, batch size, and number of epochs.
- Executing the training loop and saving the fine-tuned model checkpoints.
 
- Model Evaluation: After training, the model's performance is evaluated using the - src/evaluation.pyscript. This script calculates the BLEU (Bilingual Evaluation Understudy) score, a widely accepted metric for machine translation quality. It works by comparing the model's translations of a test set with a set of high-quality reference translations.
- Inference and Deployment: Once the model is trained and evaluated, it's ready for use. - interactive_translate.py: A command-line script for quick, interactive translation tests.
- fast_api.py: A production-ready REST API built with FastAPI that serves the translation model. This allows other applications to easily consume the translation service.
 
Tech Stack
The technologies used in this project were chosen to create a robust, efficient, and maintainable machine translation pipeline:
- Python: The primary language for the project, offering a rich ecosystem of libraries and frameworks for machine learning.
- PyTorch: A flexible and powerful deep learning framework that provides fine-grained control over the model training process.
- Hugging Face Transformers: The backbone of the project, providing easy access to pre-trained models like NLLB and a standardized interface for training and inference.
- Hugging Face Datasets: Simplifies the process of loading and preprocessing large datasets, with efficient data loading and manipulation capabilities.
- FastAPI: A modern, high-performance web framework for building APIs with Python. It's used to serve the translation model as a REST API.
- Uvicorn: A lightning-fast ASGI server, used to run the FastAPI application.
- MLflow: Used for experiment tracking to ensure reproducibility. It logs training parameters, metrics, and model artifacts, which is crucial for managing machine learning projects.
Model Details
- Base Model: The project uses the facebook/nllb-200-distilled-600Mmodel, a distilled version of the NLLB-200 model. This model is designed to be efficient while still providing high-quality translations for a large number of languages.
- Fine-tuning: The base model is fine-tuned on a custom dataset of Nepali-English and Sinhala-English parallel text to improve its performance on these specific language pairs.
- Tokenizer: The NllbTokenizeris used for tokenizing the text. It's a sentence-piece based tokenizer that is specifically designed for the NLLB model.
API Endpoints
The FastAPI application provides the following endpoints:
- GET /: Returns the frontend HTML page.
- GET /languages: Returns a list of supported languages.
- POST /translate: Translates a single text.- Request Body:{ "text": "string", "source_language": "string" }
- Response Body:{ "original_text": "string", "translated_text": "string", "source_language": "string" }
 
- Request Body:
- POST /batch-translate: Translates a batch of texts.- Request Body:{ "texts": [ "string" ], "source_language": "string" }
- Response Body:{ "original_texts": [ "string" ], "translated_texts": [ "string" ], "source_language": "string" }
 
- Request Body:
- POST /translate-pdf: Translates a PDF file.- Request: source_language: str,file: UploadFile
- Response Body:{ "filename": "string", "translated_text": "string", "source_language": "string" }
 
- Request: 
Getting Started
Prerequisites
- Python 3.10 or higher: Ensure you have a recent version of Python installed.
- Git and Git LFS: Git is required to clone the repository, and Git LFS is required to handle large model files.
- (Optional) NVIDIA GPU with CUDA: A GPU is highly recommended for training the model.
Installation
- Clone the repository: - git clone <repository-url> cd saksi_translation
- Create and activate a virtual environment: - python -m venv .venv # On Windows .venv\Scripts\activate # On macOS/Linux source .venv/bin/activate
- Install dependencies: - pip install -r requirements.txt
Usage
Data Preparation
- Fetch Parallel Data: - python scripts/fetch_parallel_data.py --output_dir data/raw
- Clean Text Data: - python scripts/clean_text_data.py --input_dir data/raw --output_dir data/processed
Training
- Start Training:python src/train.py \ --model_name "facebook/nllb-200-distilled-600M" \ --dataset_path "data/processed" \ --output_dir "models/nllb-finetuned-nepali-en" \ --learning_rate 2e-5 \ --per_device_train_batch_size 8 \ --num_train_epochs 3
Evaluation
- Evaluate the Model:python src/evaluate.py \ --model_path "models/nllb-finetuned-nepali-en" \ --test_data_path "data/test_sets/test.en" \ --reference_data_path "data/test_sets/test.ne"
Interactive Translation
- Run the interactive script:python interactive_translate.py
API
- Run the API:
 Open your browser and navigate touvicorn fast_api:app --reloadhttp://127.0.0.1:8000to use the web interface.
Project Structure
saksi_translation/
βββ .gitignore
βββ fast_api.py             # FastAPI application
βββ interactive_translate.py  # Interactive translation script
βββ README.md               # Project documentation
βββ requirements.txt        # Python dependencies
βββ test_translation.py     # Script for testing the translation model
βββ frontend/
β   βββ index.html          # Frontend HTML
β   βββ script.js           # Frontend JavaScript
β   βββ styles.css          # Frontend CSS
βββ data/
β   βββ processed/          # Processed data for training
β   βββ raw/                # Raw data downloaded from the web
β   βββ test_sets/          # Test sets for evaluation
βββ mlruns/                 # MLflow experiment tracking data
βββ models/
β   βββ nllb-finetuned-nepali-en/ # Fine-tuned model
βββ notebooks/              # Jupyter notebooks for experimentation
βββ scripts/
β   βββ clean_text_data.py
β   βββ create_test_set.py
β   βββ download_model.py
β   βββ fetch_parallel_data.py
β   βββ scrape_bbc_nepali.py
βββ src/
    βββ __init__.py
    βββ evaluation.py       # Script for evaluating the model
    βββ train.py            # Script for training the model
    βββ translate.py        # Script for translating text
Future Improvements
- Support for more languages: The project can be extended to support more languages by adding more parallel data and fine-tuning the model on it.
- Improved Model: The model can be improved by using a larger version of the NLLB model or by fine-tuning it on a larger and cleaner dataset.
- Advanced Frontend: The frontend can be improved by adding features like translation history, user accounts, and more advanced styling.
- Containerization: The application can be containerized using Docker for easier deployment and scaling.