ryokamoi commited on
Commit
6346180
·
verified ·
1 Parent(s): 0653cf2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +207 -0
README.md ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ base_model: Qwen/Qwen2.5-7B-Instruct
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - reward model
9
+ ---
10
+ # FoVer
11
+
12
+ <p align="center">
13
+ <a href="https://fover-prm.github.io/">Project Website</a> | 📄 <a href="https://arxiv.org/abs/">Paper</a> | 🛠️ <a href="https://github.com/psunlpgroup/FoVer">GitHub</a> | 🤗 <a href="https://huggingface.co/collections/ryokamoi/fover-682e28cc9f6200c7dfd5342f">Dataset</a> | 🤗 <a href="https://huggingface.co/collections/ryokamoi/fover-682e28cc9f6200c7dfd5342f">Models</a>
14
+ </p>
15
+
16
+ This repository includes code and materials for the paper "Training Step-Level Reasoning Verifiers with Formal Verification Tools".
17
+
18
+ Please refer to [Quick Start](#quick-start) for a quick start guide to evaluate your models on the FoVer dataset or evaluate the FoVer models on your dataset.
19
+
20
+ * GitHub: [https://github.com/psunlpgroup/FoVer](https://github.com/psunlpgroup/FoVer)
21
+ * FoVer Dataset
22
+ * Raw datasets (including the training, validation, and test splits)
23
+ * [ryokamoi/FoVer-FormalLogic-Llama-3.1-8B](https://huggingface.co/datasets/ryokamoi/FoVer-FormalLogic-Llama-3.1-8B)
24
+ * [ryokamoi/FoVer-FormalProof-Llama-3.1-8B](https://huggingface.co/datasets/ryokamoi/FoVer-FormalProof-Llama-3.1-8B)
25
+ * [ryokamoi/FoVer-FormalLogic-Qwen-2.5-7B](https://huggingface.co/datasets/ryokamoi/FoVer-FormalLogic-Qwen-2.5-7B)
26
+ * [ryokamoi/FoVer-FormalProof-Qwen-2.5-7B](https://huggingface.co/datasets/ryokamoi/FoVer-FormalProof-Qwen-2.5-7B)
27
+ * Balanced datasets for training (including training data only)
28
+ * [ryokamoi/FoVer-FormalLogic-FormalProof-Llama-3.1-8B-LastStepBalanced-40k](https://huggingface.co/datasets/ryokamoi/FoVer-FormalLogic-FormalProof-Llama-3.1-8B-LastStepBalanced-40k)
29
+ * [ryokamoi/FoVer-FormalLogic-FormalProof-Qwen-2.5-7B-LastStepBalanced-40k](https://huggingface.co/datasets/ryokamoi/FoVer-FormalLogic-FormalProof-Qwen-2.5-7B-LastStepBalanced-40k)
30
+
31
+ * FoVer PRMs
32
+ * [ryokamoi/Llama-3.1-8B-FoVer-PRM](https://huggingface.co/ryokamoi/Llama-3.1-8B-FoVer-PRM)
33
+ * [ryokamoi/Qwen-2.5-7B-FoVer-PRM](https://huggingface.co/ryokamoi/Qwen-2.5-7B-FoVer-PRM)
34
+
35
+ * Other materials, including variants of the datasets and intermediate outputs
36
+ * [ryokamoi/FoVer-misc](https://huggingface.co/datasets/ryokamoi/FoVer-misc)
37
+
38
+
39
+ ```bibtex
40
+ @unpublished{kamoi2025fover,
41
+ title = {Training Step-Level Reasoning Verifiers with Formal Verification Tools},
42
+ author = {Ryo Kamoi and Yusen Zhang and Nan Zhang and Sarkar Snigdha Sarathi Das and Rui Zhang},
43
+ year = {2025},
44
+ }
45
+ ```
46
+
47
+ ## Introduction
48
+
49
+ Process reward models (PRMs), which provide step-by-step feedback on the reasoning generated by large language models (LLMs), are receiving increasing attention for their potential to enhance LLMs via reinforcement learning and inference-time refinement.
50
+
51
+ We propose FoVer, an approach for training PRMs on step-level error labels that are automatically annotated using formal verification tools (e.g., Z3, Isabelle). We introduce a dataset that includes automatically annotated step-level error labels on LLM responses for the formal logic and proof tasks. We demonstrate that LLM-based PRMs trained on the FoVer dataset exhibit cross-task transfer of verification capabilities learned in formal logic and proof, leading to improved verification across a broad range of reasoning tasks, including mathematics, academic problems, logic, and abstract reasoning.
52
+
53
+ <div align="center"><img src="readme_figures/fover_overview.png" width="600"></div>
54
+
55
+
56
+ ## Setup
57
+
58
+ To run our PRMs:
59
+
60
+ * torch==2.6.0
61
+ * transformers==4.50.3
62
+
63
+ Please refer to [setup/setup.sh](https://github.com/psunlpgroup/FoVer/setup/setup.sh) for details. We use different environments for dataset creation, training, and evaluation.
64
+
65
+ We run our experiments on the following environment. You might need to modify configulations if you are using a different environment.
66
+
67
+ * Four NVIDIA A100 SXM4 80GB GPUs
68
+ * CUDA Version: 12.2
69
+
70
+ ## Quick Start
71
+
72
+ ### Evaluate Your PRM on the FoVer Datasets
73
+
74
+ The FoVer dataset is initially designed to train models, but our test splits also serves as an evaluation benchmark for PRMs. Our dataset provides the following information. Please refer to [FoVer Dataset](#fover-dataset) for details of other items in our dataset.
75
+
76
+ ```json
77
+ {
78
+ "problem": """Based on the provided facts ($context$), either prove or disprove the hypothesis or state that it is unknown. The facts and the hypothesis are written in logical formulas as follows: capital letters such as "{A}", "{B}", "{AB}" are predicates, small letters such as "{a}", "{b}", "{ab}" are constants, "&" is logical conjunction, "v" is logical disjunction, "¬" is negation, "->" is implication, "(x)" is "for all x", and "(Ex)" is "for some x".\n\n$hypothesis$: ¬{A}\n\n$context$:\nfact1: {IN}\nfact2: {BH}\nfact3: {EE}\nfact4: ¬{B} -> ({A} & {FH})\nfact5: {CA}\nfact6: {GO}\nfact7: {IR}\nfact8: {HH}\nfact9: {JI}\nfact10: {AN}\nfact11: {C} -> ({B} & ¬{A})\nfact12: {HP}\nfact13: {GK}\nfact14: {JC}\nfact15: ¬{E} -> ({C} & {D})\nfact16: {T}\nfact17: {H}\nfact18: {AF}""",
79
+ "solution_steps": [
80
+ "fact11 -> int1: {B} & ¬{A}",
81
+ "int1 -> int2: ¬{A}",
82
+ "The final answer is PROVED"
83
+ ],
84
+ "error_labels": [false, true, true]
85
+ }
86
+ ```
87
+
88
+ You can access our dataset from Hugging Face Hub.
89
+
90
+ ```python
91
+ from datasets import load_dataset
92
+
93
+ dataset = load_dataset("ryokamoi/FoVer-FormalLogic-Qwen-2.5-7B", split="validation")
94
+
95
+ print(dataset[0].keys())
96
+ # dict_keys(['id', 'problem', 'solution_steps', 'error_labels',
97
+ # 'problem_witout_definition', 'messages', 'base_dataset',
98
+ # 'messages_for_prediction', 'hypothesis_formula', 'facts_formula'])
99
+
100
+ print(dataset[0]['error_labels'])
101
+ # [True, True, True, True, True, False, True, False]
102
+ ```
103
+
104
+ ### Evaluate the FoVer PRMs on Your Dataset
105
+
106
+ Here is the minimum example to run FoVer PRMs. Please clone our GitHub repository to use the post-processing functions.
107
+
108
+ ```python
109
+ from transformers import AutoTokenizer, AutoModelForCausalLM
110
+ from src.prm.preprocessing import get_fover_input_format
111
+ from src.prm.postprocessing import extract_fover_scores
112
+
113
+ # ryokamoi/Qwen-2.5-7B-FoVer-PRM or
114
+ # ryokamoi/Llama-3.1-8B-FoVer-PRM
115
+ prm_name = "ryokamoi/Qwen-2.5-7B-FoVer-PRM"
116
+
117
+ tokenizer = AutoTokenizer.from_pretrained(prm_name)
118
+ model = AutoModelForCausalLM.from_pretrained(prm_name).to("cuda")
119
+
120
+ # Get input format for the FoVer PRM
121
+ conversation = get_fover_input_format(
122
+ problem="Calculate (1+1)*(1+2)",
123
+ solution_steps=["1+1=2", "1+2=3", "2*3=8"],
124
+ )
125
+ inputs = tokenizer.apply_chat_template(
126
+ conversation, return_tensors="pt").to("cuda")
127
+
128
+ # Generate the step-level scores
129
+ output = model(inputs)
130
+
131
+ # extract the step-level scores
132
+ scores = extract_fover_scores(
133
+ tokenized_prompt=inputs[0].cpu().numpy(),
134
+ logits=output.logits[0],
135
+ tokenizer=tokenizer,
136
+ )
137
+
138
+ print(scores)
139
+ # [0.9099470376968384, 0.9997847676277161, 0.012338237836956978]
140
+ ```
141
+
142
+ We also provide a script to evaluate the FoVer PRMs on your dataset.
143
+
144
+ First, convert your dataset into a JSONL file whose rows are in the following format and put at [quickstart/dataset/testdata.jsonl](https://github.com/psunlpgroup/FoVer/quickstart/dataset/testdata.jsonl).
145
+
146
+ ```json
147
+ {"problem": "this is a problem.", "solution_steps": ["first step (correct)", "second step (wrong)", "third step (unknown)"], "error_labels": [true, false, null]}
148
+ ```
149
+
150
+ Then, run the following command to evaluate the PRM on your dataset. We use the minimum step-level score as an instance-level score by default.
151
+
152
+ ```bash
153
+ python quickstart/evaluate.py \
154
+ --fover_prm_name ryokamoi/Qwen-2.5-7B-FoVer-PRM \
155
+ --dataset_dir quickstart/dataset/test_data \
156
+ --output_dir quickstart/results/
157
+ ```
158
+
159
+ You will get the following outputs.
160
+
161
+ * `quickstart/results/testdata/performance.json`
162
+ * The performance metrics of the FoVer PRM on your dataset.
163
+ * The step-level and instance-level scores by the FoVer PRM on your dataset.
164
+
165
+ ## FoVer Dataset
166
+
167
+ We provide the FoVer datasets that include the mistakes made by Llama 3.1 8B and Qwen 2.5 7B on formal logic and proof tasks.
168
+
169
+ ### Dataset Format
170
+
171
+ Each instance of the FoVer datasets include the following items.
172
+
173
+ * `problem` (str)
174
+ * `solution_steps` (list[str])
175
+ * The solution steps generated by the model.
176
+ * `error_labels` (list[str])
177
+ * The ground-truth error labels generated by the error verification tools (Z3, Isabelle)
178
+ * `messages` (list[dict[str, str]])
179
+ * The conversation we use for fine-tuning our PRMs.
180
+ * `messages_for_prediction` (list[dict[str, str]])
181
+ * The conversation we use for prediction. The model outputs are dummy values and all `correct`.
182
+ * `problem_witout_definition` (str)
183
+ * The `problem` without task definition (metadata, not used in our experiments).
184
+
185
+ ### Dataset Statistics
186
+
187
+ <div align="center"><img src="readme_figures/fover_stats.png" width="600"></div>
188
+
189
+ ### LastStepBalanced Dataset
190
+
191
+ We create the LastStepBalanced dataset to train PRMs on the balanced dataset where the last step includes 50% of correct and 50% of incorrect steps. We truncate solutions to make the last step balanced, so we expect to mask all steps but the last step to train the PRMs.
192
+
193
+ Specificlaly, we use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) with the option `mask_history: true`.
194
+
195
+ ### Creating Training Data for New Models
196
+
197
+ You can create mistakes made by stronger models to make a better training dataset. Please refer to [run/01_dataset_creation](run/01_dataset_creation) for the dataset creation process. You may need to update our code to support other models.
198
+
199
+ ## Reproducing the Experiments in the Paper
200
+
201
+ You can refer to shell files in the [run](run) directory to reproduce the experiments in our paper.
202
+
203
+ You do not need to run the code if you are only interested in using our models or datasets. Please refer to [Quick Start](#quick-start).
204
+
205
+ ## License
206
+
207
+ Please refer to the [LICENSE.md](https://github.com/psunlpgroup/FoVer/LICENSE.md) file for the license of this repository.