Model Description
- Fine-Tuned from Model: meta-llama/Llama-3.1-70B-Instruct
 - Paper: Efficient Safety Retrofitting Against Jailbreaking for LLMs
 - Point of Contact: Adrián Tormos
 
Model Summary
This is a fine-tuned Llama-3.1-70B-Instruct model on the Egida-DPO-Llama-3.1-70B-Instruct dataset.
The Egida dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Llama-3.1-70B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.
Training Details
- Hardware: NVIDIA H100 64 GB GPUs
 - Devices: 64 GPUs (16 node)
 - Time: 10.23h
 - Batch Size: 64
 - LR: 10−6
 
Performance
Safety Performance (Attack Success Ratio)
| Egida (test) ↓ | DELPHI ↓ | Alert-Base ↓ | Alert-Adv ↓ | |
|---|---|---|---|---|
| Meta-Llama-3.1-70B-Instruct | 0.274 | 0.170 | 0.320 | 0.084 | 
| Meta-Llama-3.1-70B-Instruct-Egida-DPO | 0.009 | 0.007 | 0.006 | 0.005 | 
General Purpose Performance
| OpenLLM Leaderboard (Average) ↑ | MMLU Generative (ROUGE1) ↑ | |
|---|---|---|
| Meta-Llama-3.1-70B-Instruct | 0.575 | 0.726 | 
| Meta-Llama-3.1-70B-Instruct-Egida-DPO | 0.577 | 0.038 | 
Refusal Ratio
| OR Bench 80K (refusal) ↓ | OR Bench Hard (refusal) ↓ | |
|---|---|---|
| Meta-Llama-3.1-70B-Instruct | 0.008 | 0.022 | 
| Meta-Llama-3.1-70B-Instruct-Egida-DPO | 0.347 | 0.351 | 
Note that this refusal ratio is computed as keyword matching with a curated list of keywords. For more information, check the paper.
Environmental Impact
Citation Information
@misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
      title={Efficient Safety Retrofitting Against Jailbreaking for LLMs}, 
      author={Dario Garcia-Gasulla and Adrian Tormos and Anna Arias-Duart and Daniel Hinjos and Oscar Molina-Sedano and Ashwin Kumar Gururajan and Maria Eugenia Cardello},
      year={2025},
      eprint={2502.13603},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.13603}, 
}
- Downloads last month
 - 3
 
Model tree for HPAI-BSC/Meta-Llama-3.1-70B-Instruct-Egida-DPO
Base model
meta-llama/Llama-3.1-70B