ProverGen is a novel framework that synergizes the generative strengths of Large Language Models (LLMs) with the rigor and precision of symbolic provers to create scalable, diverse, and high-quality First-Order Logic (FOL) reasoning datasets.
Using this framework, we have created ProverQA, a challenging FOL reasoning benchmark consisting of 1,500 evaluation instances and 5,000 training instances across three difficulty levels.
- 🎯 Scalable Generation: Automated framework enabling expansion with minimal manual intervention
- 🌍 Natural & Diverse Language: Wide range of natural language expressions reflecting real-world linguistic variability
- 🔬 Symbolic Validation: Formal symbolic structures validated through Prover9 symbolic prover
- 🔗 Faithful Reasoning Chains: Transparent intermediate reasoning steps in both symbolic and natural language formats
- 📊 Complete FOL Coverage: All seven First-Order Logic relationships (∧, ∨, ¬, →, ⊕, ∀, ∃)
- ProverGen: A Framework for First-Order Logic Reasoning Dataset Generation
conda create -n provergen python=3.8
conda activate provergen
git clone https://github.com/opendatalab/ProverGen
cd ./ProverGen
pip install -r requirements.txt
The ProverGen framework leverages Prover9 for logic skeleton generation. Download from their website (select LADR-2009-11A.tar.gz
). For installation instructions, refer to the guide.
Linux systems: Since Prover9 is outdated, you might encounter issues with make all
. Navigate to LADR-2009-11A/provers.src/Makefile
and move all -lm
flags to the end of each line:
# Before
$(CC) $(CFLAGS) -lm -o newsax newsax.o $(OBJECTS) ../ladr/libladr.a
# After
$(CC) $(CFLAGS) -o newsax newsax.o $(OBJECTS) ../ladr/libladr.a -lm
macOS systems: You may encounter implicit declaration errors. Fix by:
- In
LADR-2009-11A/mace4.src/select.c: line 236
:
// Before
int select_concentric_band(min_id, max_id, max_constrained)
// After
int select_concentric_band(int min_id, int max_id, int max_constrained)
- In
LADR-2009-11A/mace4.src/msearch.c: line 850
:
// Before
int next_domain_size(n)
// After
int next_domain_size(int n)
After installation, update the binary locations in logic_skeleton_generator.py
at line 20.
The ProverQA dataset created using this framework is available on:
- 🤗 Hugging Face: opendatalab/ProverQA
- 🌐 OpenDataLab: ProverQA
Dataset Structure:
- Development Set: 1,500 instances (500 easy, 500 medium, 500 hard)
- Training Set: 5,000 instances with data augmentation
- Difficulty Levels: Easy (1-2 steps), Medium (3-5 steps), Hard (6-9 steps)
The generation process consists of three main steps:
Generate the logical structure using symbolic provers with a novel top-down approach.
python3 logic_skeleton_generator.py --mode easy --num 500 --output_dir outputs/logic_data
Parameters:
mode
: Difficulty level (easy
,medium
,hard
)num
: Number of logic skeletons to generateoutput_dir
: Output directory
The script also allows customization of the distribution of answers ([True, False, Uncertain]) and the proportion of composite conclusions.
Here are the relevant parameters:
goal_value_probs
: Distribution of [True, False, Uncertain] (e.g., [0.4, 0.3, 0.3]).rule_candidate_path
: Path to the rule pool file.rule_as_goal_proportion
: Proportion of fact vs. rule conclusions (e.g., [0.7, 0.3]).fact_num_threshold
: If the fact pool size surpasses this threshold, there's a chance the fact will be provided directly.fact_num_prob
: Probability of directly providing a fact.
Convert logic expressions into natural language using LLMs.
python3 logic_skeleton_translator.py \
--model_name meta-llama/Meta-Llama-3.1-70B-Instruct \
--data_dir outputs/logic_data \
--num 100 --start 0 --end 100 \
--output_dir outputs/translated_data \
--mode hard \
--base_url localhost:6417 --api_key EMPTY
Parameters:
model_name
: LLM for translationdata_dir
: Path to the logic skeleton files produced in Step 1.num
: Total number of logic skeletonsstart
/end
: Index range for processingoutput_dir
: Directory to store the output file.mode
: Difficulty levelbase_url
/api_key
: For local models
Generate complete FOL problems with optional data augmentation.
python3 fol_problem_generator.py \
--model_name meta-llama/Meta-Llama-3.1-70B-Instruct
--filepath outputs/translated_data/hard-100-0_100.json \
--start 0 --end 100 \
--output_dir outputs/final_data \
--mode normal_generation \
Parameters:
model_name
: LLM for generation.filepath
: Path to the translated files produced in Step 2.start
/end
: Index range for processingoutput_dir
: Directory to store the output file.mode
: Generation mode (normal_generation
,step_augment
,uncertain_augment
)base_url
/api_key
: For local modelsnoise1
/noise2
: Control different types of distractions
Evaluate LLM performance on the generated datasets:
python3 evaluation.py \
--model_name meta-llama/Meta-Llama-3.1-70B-Instruct \
--dataset_name ProverQA --split easy \
--output_dir result/ \
--mode Direct \
--start 0 --end 500
--base_url http://localhost:6417/v1 --api_key EMPTY \
Parameters:
model_name
: LLM for evaluation.dataset_name
: Dataset to evaluate (ProverQA
,FOLIO
,ProntoQA
,ProofWriter
)split
: Subset (easy
,medium
,hard
for ProverQA;dev
for others)output_dir
: Directory to store the output file.mode
: Evaluation strategy (Direct
orCoT
)start
/end
: Index range for evaluation.base_url
/api_key
: For local modelstrained_model
: Path to fine-tuned model (optional)
Compute metrics using metric.py
on the generated result files.
State-of-the-art models show significant room for improvement on ProverQA:
Model | Easy | Medium | Hard |
---|---|---|---|
Proprietary Models | |||
GPT-4o (CoT) | 94.2% | 79.4% | 50.0% |
Claude-3.5-Sonnet (CoT) | 95.2% | 83.6% | 56.4% |
o1-preview-2024-09-12 | 89.8% | 78.8% | 66.2% |
Open Models | |||
Llama3.1-70B (CoT) | 90.4% | 73.2% | 46.8% |
Llama3.1-8B (CoT) | 75.6% | 46.6% | 33.6% |
Mistral-Large (CoT) | 92.6% | 75.8% | 52.2% |
DeepSeek-R1 | 91.8% | 78.4% | 66.6% |
Key observations:
- 📉 Challenging Hard Subset: Even top models struggle with 6-9 step reasoning
- 🧠 Reasoning Models Excel: O1 and DeepSeek-R1 show strong performance on complex problems
- 🔗 CoT Helps: Chain-of-thought prompting provides significant improvements
- 📊 Fine-grained Difficulty: Clear performance degradation across difficulty levels
If you use ProverGen or ProverQA in your research, please cite:
@inproceedings{
qi2025large,
title={Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation},
author={Chengwen Qi and Ren Ma and Bowen Li and He Du and Binyuan Hui and Jinwang Wu and Yuanjun Laili and Conghui He},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=C25SgeXWjE}
}
- ✅ Open Source: Framework and dataset are publicly available
- 📚 Compliance: Dataset sources follow public repository licenses (MIT for names, WordNet for keywords)
- 🤖 No Human Data: No human participants involved in data collection
- 🛡️ Safety: Dataset contains no harmful or biased content
- 🎓 Academic Purpose: Designed for research in advancing logical reasoning capabilities
For questions, issues, or collaborations:
- Chengwen Qi: [email protected]
- Ren Ma: [email protected]
- Bowen Li: [email protected]
GitHub Issues: https://github.com/opendatalab/ProverGen/issues