ViLegalQwen2.5-1.5B-Base
Model Description
ViLegalQwen2.5-1.5B-Base is a Vietnamese legal language model developed through continual pretraining of Qwen/Qwen2.5-1.5B on an extensive Vietnamese legal corpus. This model is specifically designed for Vietnamese legal text understanding and generation, offering enhanced performance in legal domain tasks while maintaining the robust foundational capabilities of the Qwen2.5 architecture.
⚠️ Important Notice: This is a base model that requires fine-tuning for optimal performance in downstream tasks. We strongly recommend applying supervised fine-tuning (SFT), instruction tuning, or other post-training techniques before production use.
Model Details
Architecture & Specifications
- Base Model: Qwen/Qwen2.5-1.5B
- Model Type: Causal Language Model (Dense)
- Parameters: 1.54B total (1.31B non-embedding)
- Architecture: Transformer with 28 layers
- Attention: Grouped Query Attention (GQA) - 12 heads for Q, 2 heads for KV
- Context Length: 2,048 tokens (training sequence length)
- Vocabulary Size: 151,646 tokens
- License: CC BY-NC-SA 4.0
Training Details
- Training Method: Continual Pretraining
- Training Data: 17GB Vietnamese legal corpus
- Training Objective: Causal Language Modeling
- Optimization:
- Optimizer: AdamW with fused implementation
- Learning Rate: 5e-5 with cosine annealing
- Batch Size: Effective batch size of 48 with gradient accumulation
- Precision: Mixed precision training
- Hardware: NVIDIA A100 GPU
- Training Framework: Hugging Face Transformers + PyTorch
Legal Corpus Composition
The training corpus was compiled through systematic crawling and curation of Vietnamese legal documents:
Data Sources:
- vbpl.vn - National Legal Document Database
- thuvienphapluat.vn - Legal Library Vietnam
- luatvietnam.vn - Vietnam Law Portal
- lawnet.vn - Professional Legal Network
Corpus Statistics:
- Initial Collection: ~1,136,839 legal documents crawled
- Post-Deduplication: 17GB of raw Vietnamese legal texts
- Document Types: Laws, decrees, circulars, decisions, regulations, and legal interpretations
- Coverage: Comprehensive Vietnamese legal framework from multiple authoritative sources
Performance & Capabilities
Strengths
- Legal Domain Expertise: Enhanced understanding of Vietnamese legal terminology and concepts
- Document Structure: Improved comprehension of legal document formats and hierarchies
- Contextual Understanding: Better grasp of legal relationships and dependencies
- Vietnamese Language: Native-level Vietnamese legal language processing
Evaluation
This model has been trained through continual pretraining on Vietnamese legal texts.
Comprehensive evaluation results coming soon.
Usage
This is a base model that requires fine-tuning for optimal performance in downstream tasks. We strongly recommend applying supervised fine-tuning (SFT), instruction tuning, or other post-training techniques before production use.
For usage examples and implementation guidance, please refer to the Qwen2.5 documentation and Transformers documentation.
Model Limitations & Considerations
Limitations
- Domain Specificity: Optimized for legal domain; may underperform on general Vietnamese text
- Base Model Nature: Requires fine-tuning for optimal task-specific performance
- Training Data Bias: Performance may reflect biases present in Vietnamese legal corpus
- Context Constraints: Model was pretrained on legal texts with 2,048-token sequences. While the base architecture supports up to 32K tokens, performance may degrade significantly with contexts exceeding the training length
- Temporal Limitations: Training data has a temporal cutoff; may not reflect the most recent legal changes
Ethical Considerations
- Not Legal Advice: This model should NOT be used to provide actual legal advice
- Professional Review Required: All model outputs should be reviewed by qualified legal professionals
- Bias Awareness: Users should be aware of potential biases in legal interpretation
- Responsible Use: Model should be used responsibly within appropriate legal and ethical frameworks
Safety Measures
- Human Oversight: Always require human legal expert oversight
- Output Verification: Verify all generated content against authoritative legal sources
- Regulatory Compliance: Ensure usage complies with local AI and legal practice regulations
Citation
If you use ViLegalQwen2.5-1.5B-Base in your research or applications, please cite:
@misc{vilegalqwen25-2025,
title={ViLegalQwen: Lightweight Large Language Models for Vietnamese Legal Texts},
author={Truong-Phuc Nguyen, Tien-Manh Tran, Manh-Cuong Phan},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/ntphuc149/ViLegalQwen2.5-1.5B-Base}
}
@misc{qwen2.5,
title={Qwen2.5: A Party of Foundation Models},
author={Qwen Team},
year={2024},
url={https://qwenlm.github.io/blog/qwen2.5/}
}
Contact & Support
For questions, suggestions, or collaboration opportunities:
- GitHub: https://github.com/ntphuc149
- Email: [email protected]
- Issues: Please report issues on the model's discussion page
Disclaimer: This model is provided for research and educational purposes. It should not replace professional legal advice or consultation. Users are responsible for ensuring compliance with applicable laws and regulations in their jurisdiction.
- Downloads last month
- 78