You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

ViLegalQwen2.5-1.5B-Base

Model Description

ViLegalQwen2.5-1.5B-Base is a Vietnamese legal language model developed through continual pretraining of Qwen/Qwen2.5-1.5B on an extensive Vietnamese legal corpus. This model is specifically designed for Vietnamese legal text understanding and generation, offering enhanced performance in legal domain tasks while maintaining the robust foundational capabilities of the Qwen2.5 architecture.

⚠️ Important Notice: This is a base model that requires fine-tuning for optimal performance in downstream tasks. We strongly recommend applying supervised fine-tuning (SFT), instruction tuning, or other post-training techniques before production use.

Model Details

Architecture & Specifications

  • Base Model: Qwen/Qwen2.5-1.5B
  • Model Type: Causal Language Model (Dense)
  • Parameters: 1.54B total (1.31B non-embedding)
  • Architecture: Transformer with 28 layers
  • Attention: Grouped Query Attention (GQA) - 12 heads for Q, 2 heads for KV
  • Context Length: 2,048 tokens (training sequence length)
  • Vocabulary Size: 151,646 tokens
  • License: CC BY-NC-SA 4.0

Training Details

  • Training Method: Continual Pretraining
  • Training Data: 17GB Vietnamese legal corpus
  • Training Objective: Causal Language Modeling
  • Optimization:
    • Optimizer: AdamW with fused implementation
    • Learning Rate: 5e-5 with cosine annealing
    • Batch Size: Effective batch size of 48 with gradient accumulation
    • Precision: Mixed precision training
  • Hardware: NVIDIA A100 GPU
  • Training Framework: Hugging Face Transformers + PyTorch

Legal Corpus Composition

The training corpus was compiled through systematic crawling and curation of Vietnamese legal documents:

Data Sources:

Corpus Statistics:

  • Initial Collection: ~1,136,839 legal documents crawled
  • Post-Deduplication: 17GB of raw Vietnamese legal texts
  • Document Types: Laws, decrees, circulars, decisions, regulations, and legal interpretations
  • Coverage: Comprehensive Vietnamese legal framework from multiple authoritative sources

Performance & Capabilities

Strengths

  • Legal Domain Expertise: Enhanced understanding of Vietnamese legal terminology and concepts
  • Document Structure: Improved comprehension of legal document formats and hierarchies
  • Contextual Understanding: Better grasp of legal relationships and dependencies
  • Vietnamese Language: Native-level Vietnamese legal language processing

Evaluation

This model has been trained through continual pretraining on Vietnamese legal texts.

Comprehensive evaluation results coming soon.

Usage

This is a base model that requires fine-tuning for optimal performance in downstream tasks. We strongly recommend applying supervised fine-tuning (SFT), instruction tuning, or other post-training techniques before production use.

For usage examples and implementation guidance, please refer to the Qwen2.5 documentation and Transformers documentation.

Model Limitations & Considerations

Limitations

  • Domain Specificity: Optimized for legal domain; may underperform on general Vietnamese text
  • Base Model Nature: Requires fine-tuning for optimal task-specific performance
  • Training Data Bias: Performance may reflect biases present in Vietnamese legal corpus
  • Context Constraints: Model was pretrained on legal texts with 2,048-token sequences. While the base architecture supports up to 32K tokens, performance may degrade significantly with contexts exceeding the training length
  • Temporal Limitations: Training data has a temporal cutoff; may not reflect the most recent legal changes

Ethical Considerations

  • Not Legal Advice: This model should NOT be used to provide actual legal advice
  • Professional Review Required: All model outputs should be reviewed by qualified legal professionals
  • Bias Awareness: Users should be aware of potential biases in legal interpretation
  • Responsible Use: Model should be used responsibly within appropriate legal and ethical frameworks

Safety Measures

  • Human Oversight: Always require human legal expert oversight
  • Output Verification: Verify all generated content against authoritative legal sources
  • Regulatory Compliance: Ensure usage complies with local AI and legal practice regulations

Citation

If you use ViLegalQwen2.5-1.5B-Base in your research or applications, please cite:

@misc{vilegalqwen25-2025,
  title={ViLegalQwen: Lightweight Large Language Models for Vietnamese Legal Texts},
  author={Truong-Phuc Nguyen, Tien-Manh Tran, Manh-Cuong Phan},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/ntphuc149/ViLegalQwen2.5-1.5B-Base}
}

@misc{qwen2.5,
  title={Qwen2.5: A Party of Foundation Models},
  author={Qwen Team},
  year={2024},
  url={https://qwenlm.github.io/blog/qwen2.5/}
}

Contact & Support

For questions, suggestions, or collaboration opportunities:


Disclaimer: This model is provided for research and educational purposes. It should not replace professional legal advice or consultation. Users are responsible for ensuring compliance with applicable laws and regulations in their jurisdiction.

Downloads last month
78
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ntphuc149/ViLegalQwen2.5-1.5B-Base

Base model

Qwen/Qwen2.5-1.5B
Finetuned
(214)
this model
Finetunes
1 model

Collection including ntphuc149/ViLegalQwen2.5-1.5B-Base