You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

ViLegalQwen2.5-1.5B-Base

Model Description

ViLegalQwen2.5-1.5B-Base is a Vietnamese legal language model developed through continual pretraining of Qwen/Qwen2.5-1.5B on an extensive Vietnamese legal corpus. This model is specifically designed for Vietnamese legal text understanding and generation, offering enhanced performance in legal domain tasks while maintaining the robust foundational capabilities of the Qwen2.5 architecture.

⚠️ Important Notice: This is a base model that requires fine-tuning for optimal performance in downstream tasks. We strongly recommend applying supervised fine-tuning (SFT), instruction tuning, or other post-training techniques before production use.

Model Details

Architecture & Specifications

Base Model: Qwen/Qwen2.5-1.5B
Model Type: Causal Language Model (Dense)
Parameters: 1.54B total (1.31B non-embedding)
Architecture: Transformer with 28 layers
Attention: Grouped Query Attention (GQA) - 12 heads for Q, 2 heads for KV
Context Length: 2,048 tokens (training sequence length)
Vocabulary Size: 151,646 tokens
License: CC BY-NC-SA 4.0

Training Details

Training Method: Continual Pretraining
Training Data: 17GB Vietnamese legal corpus
Training Objective: Causal Language Modeling
Optimization:
- Optimizer: AdamW with fused implementation
- Learning Rate: 5e-5 with cosine annealing
- Batch Size: Effective batch size of 48 with gradient accumulation
- Precision: Mixed precision training
Hardware: NVIDIA A100 GPU
Training Framework: Hugging Face Transformers + PyTorch

Legal Corpus Composition

The training corpus was compiled through systematic crawling and curation of Vietnamese legal documents:

Data Sources:

vbpl.vn - National Legal Document Database
thuvienphapluat.vn - Legal Library Vietnam
luatvietnam.vn - Vietnam Law Portal
lawnet.vn - Professional Legal Network

Corpus Statistics:

Initial Collection: ~1,136,839 legal documents crawled
Post-Deduplication: 17GB of raw Vietnamese legal texts
Document Types: Laws, decrees, circulars, decisions, regulations, and legal interpretations
Coverage: Comprehensive Vietnamese legal framework from multiple authoritative sources

Performance & Capabilities

Strengths

Legal Domain Expertise: Enhanced understanding of Vietnamese legal terminology and concepts
Document Structure: Improved comprehension of legal document formats and hierarchies
Contextual Understanding: Better grasp of legal relationships and dependencies
Vietnamese Language: Native-level Vietnamese legal language processing

Evaluation

This model has been trained through continual pretraining on Vietnamese legal texts.

Comprehensive evaluation results coming soon.

Usage

This is a base model that requires fine-tuning for optimal performance in downstream tasks. We strongly recommend applying supervised fine-tuning (SFT), instruction tuning, or other post-training techniques before production use.

For usage examples and implementation guidance, please refer to the Qwen2.5 documentation and Transformers documentation.

Model Limitations & Considerations

Limitations

Domain Specificity: Optimized for legal domain; may underperform on general Vietnamese text
Base Model Nature: Requires fine-tuning for optimal task-specific performance
Training Data Bias: Performance may reflect biases present in Vietnamese legal corpus
Context Constraints: Model was pretrained on legal texts with 2,048-token sequences. While the base architecture supports up to 32K tokens, performance may degrade significantly with contexts exceeding the training length
Temporal Limitations: Training data has a temporal cutoff; may not reflect the most recent legal changes

Ethical Considerations

Not Legal Advice: This model should NOT be used to provide actual legal advice
Professional Review Required: All model outputs should be reviewed by qualified legal professionals
Bias Awareness: Users should be aware of potential biases in legal interpretation
Responsible Use: Model should be used responsibly within appropriate legal and ethical frameworks

Safety Measures

Human Oversight: Always require human legal expert oversight
Output Verification: Verify all generated content against authoritative legal sources
Regulatory Compliance: Ensure usage complies with local AI and legal practice regulations

Citation

If you use ViLegalQwen2.5-1.5B-Base in your research or applications, please cite:

@misc{vilegalqwen25-2025,
  title={ViLegalQwen: Lightweight Large Language Models for Vietnamese Legal Texts},
  author={Truong-Phuc Nguyen, Tien-Manh Tran, Manh-Cuong Phan},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/ntphuc149/ViLegalQwen2.5-1.5B-Base}
}

@misc{qwen2.5,
  title={Qwen2.5: A Party of Foundation Models},
  author={Qwen Team},
  year={2024},
  url={https://qwenlm.github.io/blog/qwen2.5/}
}

Contact & Support

For questions, suggestions, or collaboration opportunities:

GitHub: https://github.com/ntphuc149
Email: [email protected]
Issues: Please report issues on the model's discussion page

Disclaimer: This model is provided for research and educational purposes. It should not replace professional legal advice or consultation. Users are responsible for ensuring compliance with applicable laws and regulations in their jurisdiction.