LLaDA-8B-BGPO-countdown

Model Description

LLaDA-8B-BGPO-countdown is an 8-billion parameter diffusion large language model (dLLM) that was trained on LLaDA-8B-Instruct using Boundary-Guided Policy Optimization (BGPO) for enhanced planning capabilities on countdown task.

Model Details

Model Type: Diffusion Large Language Model (dLLM)
Parameters: 8 billion
Training Method: Boundary-Guided Policy Optimization (BGPO)
Base Model: LLaDA-8B-Instruct
Task: Countdown
Language: English

Training Details

Training Steps: 560 steps
Response Length: 256 tokens
Train Diffusion Steps: 128
Eval Diffusion Steps: 256
Block Size: 32
Monte Carlo Sample Size ($n_t$): 16
Learning Rate: 5e-7
Batch Size: 16
Framework: Built on VeRL (Volcengine Reinforcement Learning)

Usage & Limitations

Primarily designed for countdown tasks.
Performance may vary on other tasks.
Requires appropriate computational resources for inference.

Downloads last month: 30

Safetensors

Model size

8B params

Tensor type

BF16

Video Preview

Reinforcement Learning

Collection including THU-KEG/LLaDA-8B-BGPO-countdown

LLaDA-8B-BGPO

Collection

Boundary-Guided Policy Optimization for Memory-Efficient RL of Diffusion Large Language Models • 4 items • Updated 15 days ago • 4