Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,207 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
base_model: Qwen/Qwen2.5-7B-Instruct
|
| 6 |
+
pipeline_tag: text-generation
|
| 7 |
+
tags:
|
| 8 |
+
- reward model
|
| 9 |
+
---
|
| 10 |
+
# FoVer
|
| 11 |
+
|
| 12 |
+
<p align="center">
|
| 13 |
+
<a href="https://fover-prm.github.io/">Project Website</a> | 📄 <a href="https://arxiv.org/abs/">Paper</a> | 🛠️ <a href="https://github.com/psunlpgroup/FoVer">GitHub</a> | 🤗 <a href="https://huggingface.co/collections/ryokamoi/fover-682e28cc9f6200c7dfd5342f">Dataset</a> | 🤗 <a href="https://huggingface.co/collections/ryokamoi/fover-682e28cc9f6200c7dfd5342f">Models</a>
|
| 14 |
+
</p>
|
| 15 |
+
|
| 16 |
+
This repository includes code and materials for the paper "Training Step-Level Reasoning Verifiers with Formal Verification Tools".
|
| 17 |
+
|
| 18 |
+
Please refer to [Quick Start](#quick-start) for a quick start guide to evaluate your models on the FoVer dataset or evaluate the FoVer models on your dataset.
|
| 19 |
+
|
| 20 |
+
* GitHub: [https://github.com/psunlpgroup/FoVer](https://github.com/psunlpgroup/FoVer)
|
| 21 |
+
* FoVer Dataset
|
| 22 |
+
* Raw datasets (including the training, validation, and test splits)
|
| 23 |
+
* [ryokamoi/FoVer-FormalLogic-Llama-3.1-8B](https://huggingface.co/datasets/ryokamoi/FoVer-FormalLogic-Llama-3.1-8B)
|
| 24 |
+
* [ryokamoi/FoVer-FormalProof-Llama-3.1-8B](https://huggingface.co/datasets/ryokamoi/FoVer-FormalProof-Llama-3.1-8B)
|
| 25 |
+
* [ryokamoi/FoVer-FormalLogic-Qwen-2.5-7B](https://huggingface.co/datasets/ryokamoi/FoVer-FormalLogic-Qwen-2.5-7B)
|
| 26 |
+
* [ryokamoi/FoVer-FormalProof-Qwen-2.5-7B](https://huggingface.co/datasets/ryokamoi/FoVer-FormalProof-Qwen-2.5-7B)
|
| 27 |
+
* Balanced datasets for training (including training data only)
|
| 28 |
+
* [ryokamoi/FoVer-FormalLogic-FormalProof-Llama-3.1-8B-LastStepBalanced-40k](https://huggingface.co/datasets/ryokamoi/FoVer-FormalLogic-FormalProof-Llama-3.1-8B-LastStepBalanced-40k)
|
| 29 |
+
* [ryokamoi/FoVer-FormalLogic-FormalProof-Qwen-2.5-7B-LastStepBalanced-40k](https://huggingface.co/datasets/ryokamoi/FoVer-FormalLogic-FormalProof-Qwen-2.5-7B-LastStepBalanced-40k)
|
| 30 |
+
|
| 31 |
+
* FoVer PRMs
|
| 32 |
+
* [ryokamoi/Llama-3.1-8B-FoVer-PRM](https://huggingface.co/ryokamoi/Llama-3.1-8B-FoVer-PRM)
|
| 33 |
+
* [ryokamoi/Qwen-2.5-7B-FoVer-PRM](https://huggingface.co/ryokamoi/Qwen-2.5-7B-FoVer-PRM)
|
| 34 |
+
|
| 35 |
+
* Other materials, including variants of the datasets and intermediate outputs
|
| 36 |
+
* [ryokamoi/FoVer-misc](https://huggingface.co/datasets/ryokamoi/FoVer-misc)
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
```bibtex
|
| 40 |
+
@unpublished{kamoi2025fover,
|
| 41 |
+
title = {Training Step-Level Reasoning Verifiers with Formal Verification Tools},
|
| 42 |
+
author = {Ryo Kamoi and Yusen Zhang and Nan Zhang and Sarkar Snigdha Sarathi Das and Rui Zhang},
|
| 43 |
+
year = {2025},
|
| 44 |
+
}
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
## Introduction
|
| 48 |
+
|
| 49 |
+
Process reward models (PRMs), which provide step-by-step feedback on the reasoning generated by large language models (LLMs), are receiving increasing attention for their potential to enhance LLMs via reinforcement learning and inference-time refinement.
|
| 50 |
+
|
| 51 |
+
We propose FoVer, an approach for training PRMs on step-level error labels that are automatically annotated using formal verification tools (e.g., Z3, Isabelle). We introduce a dataset that includes automatically annotated step-level error labels on LLM responses for the formal logic and proof tasks. We demonstrate that LLM-based PRMs trained on the FoVer dataset exhibit cross-task transfer of verification capabilities learned in formal logic and proof, leading to improved verification across a broad range of reasoning tasks, including mathematics, academic problems, logic, and abstract reasoning.
|
| 52 |
+
|
| 53 |
+
<div align="center"><img src="readme_figures/fover_overview.png" width="600"></div>
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
## Setup
|
| 57 |
+
|
| 58 |
+
To run our PRMs:
|
| 59 |
+
|
| 60 |
+
* torch==2.6.0
|
| 61 |
+
* transformers==4.50.3
|
| 62 |
+
|
| 63 |
+
Please refer to [setup/setup.sh](https://github.com/psunlpgroup/FoVer/setup/setup.sh) for details. We use different environments for dataset creation, training, and evaluation.
|
| 64 |
+
|
| 65 |
+
We run our experiments on the following environment. You might need to modify configulations if you are using a different environment.
|
| 66 |
+
|
| 67 |
+
* Four NVIDIA A100 SXM4 80GB GPUs
|
| 68 |
+
* CUDA Version: 12.2
|
| 69 |
+
|
| 70 |
+
## Quick Start
|
| 71 |
+
|
| 72 |
+
### Evaluate Your PRM on the FoVer Datasets
|
| 73 |
+
|
| 74 |
+
The FoVer dataset is initially designed to train models, but our test splits also serves as an evaluation benchmark for PRMs. Our dataset provides the following information. Please refer to [FoVer Dataset](#fover-dataset) for details of other items in our dataset.
|
| 75 |
+
|
| 76 |
+
```json
|
| 77 |
+
{
|
| 78 |
+
"problem": """Based on the provided facts ($context$), either prove or disprove the hypothesis or state that it is unknown. The facts and the hypothesis are written in logical formulas as follows: capital letters such as "{A}", "{B}", "{AB}" are predicates, small letters such as "{a}", "{b}", "{ab}" are constants, "&" is logical conjunction, "v" is logical disjunction, "¬" is negation, "->" is implication, "(x)" is "for all x", and "(Ex)" is "for some x".\n\n$hypothesis$: ¬{A}\n\n$context$:\nfact1: {IN}\nfact2: {BH}\nfact3: {EE}\nfact4: ¬{B} -> ({A} & {FH})\nfact5: {CA}\nfact6: {GO}\nfact7: {IR}\nfact8: {HH}\nfact9: {JI}\nfact10: {AN}\nfact11: {C} -> ({B} & ¬{A})\nfact12: {HP}\nfact13: {GK}\nfact14: {JC}\nfact15: ¬{E} -> ({C} & {D})\nfact16: {T}\nfact17: {H}\nfact18: {AF}""",
|
| 79 |
+
"solution_steps": [
|
| 80 |
+
"fact11 -> int1: {B} & ¬{A}",
|
| 81 |
+
"int1 -> int2: ¬{A}",
|
| 82 |
+
"The final answer is PROVED"
|
| 83 |
+
],
|
| 84 |
+
"error_labels": [false, true, true]
|
| 85 |
+
}
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
You can access our dataset from Hugging Face Hub.
|
| 89 |
+
|
| 90 |
+
```python
|
| 91 |
+
from datasets import load_dataset
|
| 92 |
+
|
| 93 |
+
dataset = load_dataset("ryokamoi/FoVer-FormalLogic-Qwen-2.5-7B", split="validation")
|
| 94 |
+
|
| 95 |
+
print(dataset[0].keys())
|
| 96 |
+
# dict_keys(['id', 'problem', 'solution_steps', 'error_labels',
|
| 97 |
+
# 'problem_witout_definition', 'messages', 'base_dataset',
|
| 98 |
+
# 'messages_for_prediction', 'hypothesis_formula', 'facts_formula'])
|
| 99 |
+
|
| 100 |
+
print(dataset[0]['error_labels'])
|
| 101 |
+
# [True, True, True, True, True, False, True, False]
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
### Evaluate the FoVer PRMs on Your Dataset
|
| 105 |
+
|
| 106 |
+
Here is the minimum example to run FoVer PRMs. Please clone our GitHub repository to use the post-processing functions.
|
| 107 |
+
|
| 108 |
+
```python
|
| 109 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 110 |
+
from src.prm.preprocessing import get_fover_input_format
|
| 111 |
+
from src.prm.postprocessing import extract_fover_scores
|
| 112 |
+
|
| 113 |
+
# ryokamoi/Qwen-2.5-7B-FoVer-PRM or
|
| 114 |
+
# ryokamoi/Llama-3.1-8B-FoVer-PRM
|
| 115 |
+
prm_name = "ryokamoi/Qwen-2.5-7B-FoVer-PRM"
|
| 116 |
+
|
| 117 |
+
tokenizer = AutoTokenizer.from_pretrained(prm_name)
|
| 118 |
+
model = AutoModelForCausalLM.from_pretrained(prm_name).to("cuda")
|
| 119 |
+
|
| 120 |
+
# Get input format for the FoVer PRM
|
| 121 |
+
conversation = get_fover_input_format(
|
| 122 |
+
problem="Calculate (1+1)*(1+2)",
|
| 123 |
+
solution_steps=["1+1=2", "1+2=3", "2*3=8"],
|
| 124 |
+
)
|
| 125 |
+
inputs = tokenizer.apply_chat_template(
|
| 126 |
+
conversation, return_tensors="pt").to("cuda")
|
| 127 |
+
|
| 128 |
+
# Generate the step-level scores
|
| 129 |
+
output = model(inputs)
|
| 130 |
+
|
| 131 |
+
# extract the step-level scores
|
| 132 |
+
scores = extract_fover_scores(
|
| 133 |
+
tokenized_prompt=inputs[0].cpu().numpy(),
|
| 134 |
+
logits=output.logits[0],
|
| 135 |
+
tokenizer=tokenizer,
|
| 136 |
+
)
|
| 137 |
+
|
| 138 |
+
print(scores)
|
| 139 |
+
# [0.9099470376968384, 0.9997847676277161, 0.012338237836956978]
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
We also provide a script to evaluate the FoVer PRMs on your dataset.
|
| 143 |
+
|
| 144 |
+
First, convert your dataset into a JSONL file whose rows are in the following format and put at [quickstart/dataset/testdata.jsonl](https://github.com/psunlpgroup/FoVer/quickstart/dataset/testdata.jsonl).
|
| 145 |
+
|
| 146 |
+
```json
|
| 147 |
+
{"problem": "this is a problem.", "solution_steps": ["first step (correct)", "second step (wrong)", "third step (unknown)"], "error_labels": [true, false, null]}
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
Then, run the following command to evaluate the PRM on your dataset. We use the minimum step-level score as an instance-level score by default.
|
| 151 |
+
|
| 152 |
+
```bash
|
| 153 |
+
python quickstart/evaluate.py \
|
| 154 |
+
--fover_prm_name ryokamoi/Qwen-2.5-7B-FoVer-PRM \
|
| 155 |
+
--dataset_dir quickstart/dataset/test_data \
|
| 156 |
+
--output_dir quickstart/results/
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
You will get the following outputs.
|
| 160 |
+
|
| 161 |
+
* `quickstart/results/testdata/performance.json`
|
| 162 |
+
* The performance metrics of the FoVer PRM on your dataset.
|
| 163 |
+
* The step-level and instance-level scores by the FoVer PRM on your dataset.
|
| 164 |
+
|
| 165 |
+
## FoVer Dataset
|
| 166 |
+
|
| 167 |
+
We provide the FoVer datasets that include the mistakes made by Llama 3.1 8B and Qwen 2.5 7B on formal logic and proof tasks.
|
| 168 |
+
|
| 169 |
+
### Dataset Format
|
| 170 |
+
|
| 171 |
+
Each instance of the FoVer datasets include the following items.
|
| 172 |
+
|
| 173 |
+
* `problem` (str)
|
| 174 |
+
* `solution_steps` (list[str])
|
| 175 |
+
* The solution steps generated by the model.
|
| 176 |
+
* `error_labels` (list[str])
|
| 177 |
+
* The ground-truth error labels generated by the error verification tools (Z3, Isabelle)
|
| 178 |
+
* `messages` (list[dict[str, str]])
|
| 179 |
+
* The conversation we use for fine-tuning our PRMs.
|
| 180 |
+
* `messages_for_prediction` (list[dict[str, str]])
|
| 181 |
+
* The conversation we use for prediction. The model outputs are dummy values and all `correct`.
|
| 182 |
+
* `problem_witout_definition` (str)
|
| 183 |
+
* The `problem` without task definition (metadata, not used in our experiments).
|
| 184 |
+
|
| 185 |
+
### Dataset Statistics
|
| 186 |
+
|
| 187 |
+
<div align="center"><img src="readme_figures/fover_stats.png" width="600"></div>
|
| 188 |
+
|
| 189 |
+
### LastStepBalanced Dataset
|
| 190 |
+
|
| 191 |
+
We create the LastStepBalanced dataset to train PRMs on the balanced dataset where the last step includes 50% of correct and 50% of incorrect steps. We truncate solutions to make the last step balanced, so we expect to mask all steps but the last step to train the PRMs.
|
| 192 |
+
|
| 193 |
+
Specificlaly, we use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) with the option `mask_history: true`.
|
| 194 |
+
|
| 195 |
+
### Creating Training Data for New Models
|
| 196 |
+
|
| 197 |
+
You can create mistakes made by stronger models to make a better training dataset. Please refer to [run/01_dataset_creation](run/01_dataset_creation) for the dataset creation process. You may need to update our code to support other models.
|
| 198 |
+
|
| 199 |
+
## Reproducing the Experiments in the Paper
|
| 200 |
+
|
| 201 |
+
You can refer to shell files in the [run](run) directory to reproduce the experiments in our paper.
|
| 202 |
+
|
| 203 |
+
You do not need to run the code if you are only interested in using our models or datasets. Please refer to [Quick Start](#quick-start).
|
| 204 |
+
|
| 205 |
+
## License
|
| 206 |
+
|
| 207 |
+
Please refer to the [LICENSE.md](https://github.com/psunlpgroup/FoVer/LICENSE.md) file for the license of this repository.
|