jmajkutewicz
/

Llama-3.1-Tulu-3-8B-DPO_dataset-mix

Text Generation

Model card Files Files and versions

Tülu3 8B aligned with DPO on mix of datasets with β=0.01

This repo contains LoRA adapter created by aligning Tülu3 8B using Direct Preference Optimization (DPO) on the mix all following datasets:

It was trained as a series of models for studying DPO alignment.

Model details

Base model: allenai/Llama-3.1-Tulu-3-8B-SFT
Preference datasets:
DPO beta: 0.01
Training framework: PEFT/LoRA

See the base model card for usage and chat template details.

Training hyperparameters

Epochs: 1
Batch size: 8
Learning rate: 5e-06
Learning rate scheduler: cosine
Learning rate warmup ratio: 0.1
Gradient accumulation: 2
LoRA:
- rank: 64
- alpha: 64
- dropout: 0.05
- target modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]

License

This adapter is released under Meta's Llama 3.1 Community License Agreement. Llama 3.1 is © Meta Platforms, Inc.

Citation

If this work was helpful, please cite:

TBA

Downloads last month: 5

Model tree for jmajkutewicz/Llama-3.1-Tulu-3-8B-DPO_dataset-mix

Base model

meta-llama/Llama-3.1-8B

Finetuned

allenai/Llama-3.1-Tulu-3-8B-SFT

Adapter

(6)

this model

Datasets used to train jmajkutewicz/Llama-3.1-Tulu-3-8B-DPO_dataset-mix

Collection including jmajkutewicz/Llama-3.1-Tulu-3-8B-DPO_dataset-mix

Evaluation of DPO Configurations

An Empirical Study of DPO Configuration Choices for LLM Alignment • 14 items • Updated Sep 30