LLaDA-8B-BGPO-countdown

Model Description
LLaDA-8B-BGPO-countdown is an 8-billion parameter diffusion large language model (dLLM) that was trained on LLaDA-8B-Instruct using Boundary-Guided Policy Optimization (BGPO) for enhanced planning capabilities on countdown task.
Model Details
- Model Type: Diffusion Large Language Model (dLLM)
- Parameters: 8 billion
- Training Method: Boundary-Guided Policy Optimization (BGPO)
- Base Model: LLaDA-8B-Instruct
- Task: Countdown
- Language: English
Training Details
- Training Steps: 560 steps
- Response Length: 256 tokens
- Train Diffusion Steps: 128
- Eval Diffusion Steps: 256
- Block Size: 32
- Monte Carlo Sample Size ($n_t$): 16
- Learning Rate: 5e-7
- Batch Size: 16
- Framework: Built on VeRL (Volcengine Reinforcement Learning)
Usage & Limitations
- Primarily designed for countdown tasks.
- Performance may vary on other tasks.
- Requires appropriate computational resources for inference.