ProverGen: A Framework for First-Order Logic Reasoning Dataset Generation

Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation

ProverGen: A Framework for First-Order Logic Reasoning Dataset Generation

Overview

ProverGen is a novel framework that synergizes the generative strengths of Large Language Models (LLMs) with the rigor and precision of symbolic provers to create scalable, diverse, and high-quality First-Order Logic (FOL) reasoning datasets.

Using this framework, we have created ProverQA, a challenging FOL reasoning benchmark consisting of 1,500 evaluation instances and 5,000 training instances across three difficulty levels.

Key Features

🎯 Scalable Generation: Automated framework enabling expansion with minimal manual intervention
🌍 Natural & Diverse Language: Wide range of natural language expressions reflecting real-world linguistic variability
🔬 Symbolic Validation: Formal symbolic structures validated through Prover9 symbolic prover
🔗 Faithful Reasoning Chains: Transparent intermediate reasoning steps in both symbolic and natural language formats
📊 Complete FOL Coverage: All seven First-Order Logic relationships (∧, ∨, ¬, →, ⊕, ∀, ∃)

Installation

Install Required Packages

conda create -n provergen python=3.8
conda activate provergen
git clone https://github.com/opendatalab/ProverGen
cd ./ProverGen
pip install -r requirements.txt

Install Prover9

The ProverGen framework leverages Prover9 for logic skeleton generation. Download from their website (select LADR-2009-11A.tar.gz). For installation instructions, refer to the guide.

⚠️ Installation Notes

Linux systems: Since Prover9 is outdated, you might encounter issues with make all. Navigate to LADR-2009-11A/provers.src/Makefile and move all -lm flags to the end of each line:

# Before
$(CC) $(CFLAGS) -lm -o newsax newsax.o $(OBJECTS) ../ladr/libladr.a

# After  
$(CC) $(CFLAGS) -o newsax newsax.o $(OBJECTS) ../ladr/libladr.a -lm

macOS systems: You may encounter implicit declaration errors. Fix by:

In LADR-2009-11A/mace4.src/select.c: line 236:

// Before
int select_concentric_band(min_id, max_id, max_constrained)
// After
int select_concentric_band(int min_id, int max_id, int max_constrained)

In LADR-2009-11A/mace4.src/msearch.c: line 850:

// Before
int next_domain_size(n)
// After
int next_domain_size(int n)

After installation, update the binary locations in logic_skeleton_generator.py at line 20.

Dataset: ProverQA

The ProverQA dataset created using this framework is available on:

🤗 Hugging Face: opendatalab/ProverQA
🌐 OpenDataLab: ProverQA

Dataset Structure:

Development Set: 1,500 instances (500 easy, 500 medium, 500 hard)
Training Set: 5,000 instances with data augmentation
Difficulty Levels: Easy (1-2 steps), Medium (3-5 steps), Hard (6-9 steps)

Generation Pipeline

The generation process consists of three main steps:

Step 1: Logic Skeleton Generation 🧭

Generate the logical structure using symbolic provers with a novel top-down approach.

python3 logic_skeleton_generator.py --mode easy --num 500 --output_dir outputs/logic_data

Parameters:

mode: Difficulty level (easy, medium, hard)
num: Number of logic skeletons to generate
output_dir: Output directory

The script also allows customization of the distribution of answers ([True, False, Uncertain]) and the proportion of composite conclusions.

Here are the relevant parameters:

goal_value_probs: Distribution of [True, False, Uncertain] (e.g., [0.4, 0.3, 0.3]).
rule_candidate_path: Path to the rule pool file.
rule_as_goal_proportion: Proportion of fact vs. rule conclusions (e.g., [0.7, 0.3]).
fact_num_threshold: If the fact pool size surpasses this threshold, there's a chance the fact will be provided directly.
fact_num_prob: Probability of directly providing a fact.

Step 2: Logic Skeleton Translation 📝

Convert logic expressions into natural language using LLMs.

python3 logic_skeleton_translator.py \
    --model_name meta-llama/Meta-Llama-3.1-70B-Instruct \
    --data_dir outputs/logic_data \
    --num 100 --start 0 --end 100 \
    --output_dir outputs/translated_data \
    --mode hard \
    --base_url localhost:6417 --api_key EMPTY

Parameters:

model_name: LLM for translation
data_dir: Path to the logic skeleton files produced in Step 1.
num: Total number of logic skeletons
start/end: Index range for processing
output_dir: Directory to store the output file.
mode: Difficulty level
base_url/api_key: For local models

Step 3: FOL Problem Generation 📚

Generate complete FOL problems with optional data augmentation.

python3 fol_problem_generator.py \
    --model_name meta-llama/Meta-Llama-3.1-70B-Instruct
    --filepath outputs/translated_data/hard-100-0_100.json \
    --start 0 --end 100 \
    --output_dir outputs/final_data \
    --mode normal_generation \

Parameters:

model_name: LLM for generation.
filepath: Path to the translated files produced in Step 2.
start/end: Index range for processing
output_dir: Directory to store the output file.
mode: Generation mode (normal_generation, step_augment, uncertain_augment)
base_url/api_key: For local models
noise1/noise2: Control different types of distractions

Evaluation ⚖️

Evaluate LLM performance on the generated datasets:

python3 evaluation.py \
    --model_name meta-llama/Meta-Llama-3.1-70B-Instruct \
    --dataset_name ProverQA --split easy \
    --output_dir result/ \
    --mode Direct \
    --start 0 --end 500
    --base_url http://localhost:6417/v1 --api_key EMPTY \

Parameters:

model_name: LLM for evaluation.
dataset_name: Dataset to evaluate (ProverQA, FOLIO, ProntoQA, ProofWriter)
split: Subset (easy, medium, hard for ProverQA; dev for others)
output_dir: Directory to store the output file.
mode: Evaluation strategy (Direct or CoT)
start/end: Index range for evaluation.
base_url/api_key: For local models
trained_model: Path to fine-tuned model (optional)

Compute metrics using metric.py on the generated result files.

Benchmark Results

State-of-the-art models show significant room for improvement on ProverQA:

Model	Easy	Medium	Hard
Proprietary Models
GPT-4o (CoT)	94.2%	79.4%	50.0%
Claude-3.5-Sonnet (CoT)	95.2%	83.6%	56.4%
o1-preview-2024-09-12	89.8%	78.8%	66.2%
Open Models
Llama3.1-70B (CoT)	90.4%	73.2%	46.8%
Llama3.1-8B (CoT)	75.6%	46.6%	33.6%
Mistral-Large (CoT)	92.6%	75.8%	52.2%
DeepSeek-R1	91.8%	78.4%	66.6%

Key observations:

📉 Challenging Hard Subset: Even top models struggle with 6-9 step reasoning
🧠 Reasoning Models Excel: O1 and DeepSeek-R1 show strong performance on complex problems
🔗 CoT Helps: Chain-of-thought prompting provides significant improvements
📊 Fine-grained Difficulty: Clear performance degradation across difficulty levels

Citation

If you use ProverGen or ProverQA in your research, please cite:

@inproceedings{
qi2025large,
title={Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation},
author={Chengwen Qi and Ren Ma and Bowen Li and He Du and Binyuan Hui and Jinwang Wu and Yuanjun Laili and Conghui He},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=C25SgeXWjE}
}

License and Ethics

✅ Open Source: Framework and dataset are publicly available
📚 Compliance: Dataset sources follow public repository licenses (MIT for names, WordNet for keywords)
🤖 No Human Data: No human participants involved in data collection
🛡️ Safety: Dataset contains no harmful or biased content
🎓 Academic Purpose: Designed for research in advancing logical reasoning capabilities

Contact

For questions, issues, or collaborations:

Chengwen Qi: [email protected]
Ren Ma: [email protected]
Bowen Li: [email protected]

GitHub Issues: https://github.com/opendatalab/ProverGen/issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation

ProverGen: A Framework for First-Order Logic Reasoning Dataset Generation

Overview

Key Features

Table of Contents

Installation

Install Required Packages

Install Prover9

Dataset: ProverQA

Generation Pipeline

Step 1: Logic Skeleton Generation 🧭

Step 2: Logic Skeleton Translation 📝

Step 3: FOL Problem Generation 📚

Evaluation ⚖️

Benchmark Results

Citation

License and Ethics

Contact

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
logic_data		logic_data
result		result
training_data		training_data
utils		utils
.gitignore		.gitignore
README.md		README.md
evaluation.py		evaluation.py
fol_problem_generator.py		fol_problem_generator.py
framework.png		framework.png
logic_skeleton_generator.py		logic_skeleton_generator.py
logic_skeleton_translator.py		logic_skeleton_translator.py
requirements.txt		requirements.txt

opendatalab/ProverGen

Folders and files

Latest commit

History

Repository files navigation

Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation

ProverGen: A Framework for First-Order Logic Reasoning Dataset Generation

Overview

Key Features

Table of Contents

Installation

Install Required Packages

Install Prover9

Dataset: ProverQA

Generation Pipeline

Step 1: Logic Skeleton Generation 🧭

Step 2: Logic Skeleton Translation 📝

Step 3: FOL Problem Generation 📚

Evaluation ⚖️

Benchmark Results

Citation

License and Ethics

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages