A Python tool for automatically generating question-answer datasets from documents using Ollama for local LLM inference.
This tool allows you to create custom training datasets for fine-tuning language models by:
- Extracting text content from documents (PDF, TXT)
- Generating relevant questions based on the document content
- Creating answers for each question using only information from the source document
- Formatting the results into various dataset templates
- 📄 Support for PDF and text files
- 🔍 Automatic question generation based on document content
- ✅ Answer generation strictly from document context
- 📊 Multiple export formats (default, gemma, llama)
- 🚀 Uses Ollama for local LLM inference
- 📝 Comprehensive logging
- Python 3.6+
- Ollama installed and running locally
- Required Python packages:
- pymupdf (fitz)
- pandas
- ollama
- Optional Python packages:
- argparse
- pathlib
- logging
-
Clone this repository:
git clone https://github.com/gokhaneraslan/llm-dataset-generator.git cd llm-dataset-generator
-
Install the required dependencies:
pip install -r requirements.txt
-
Make sure Ollama is installed and running:
# Install Ollama (if not already installed) # See https://ollama.com/download for installation instructions # Start Ollama server ollama serve
-
Pull the model you want to use (e.g. Qwen 2.5 7B):
ollama pull qwen2.5:7b
Basic usage:
python main.py --file path/to/document.pdf
Advanced options:
python main.py --file path/to/document.pdf \
--questions 15 \
--template llama \
--model-gen qwen2.5:7b \
--model-ret qwen2.5:7b \
--gen-temp 0.2 \
--ret-temp 0.0 \
--output-dir ./training_data \
--log-level INFO
Argument | Short | Description | Default |
---|---|---|---|
--file |
-f |
Path to document file (.txt or .pdf) | Required |
--questions |
-q |
Number of questions to generate | 10 |
--template |
-t |
Dataset template format (default, gemma, llama) | default |
--model-gen |
-mg |
Ollama model for question generation | qwen2.5:7b |
--model-ret |
-mr |
Ollama model for answering questions | qwen2.5:7b |
--gen-temp |
Temperature for question generation | 0.1 | |
--ret-temp |
Temperature for answer generation | 0.0 | |
--output-dir |
-o |
Directory to save dataset files | datasets |
--log-level |
Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) | INFO |
Generate 15 questions using the Gemma template:
python main.py --file document.pdf --questions 15 --template gemma
Use a different model with custom temperature settings:
python main.py --file document.txt --model-gen llama3:latest --gen-temp 0.2 --ret-temp 0.1
Save output to a specific directory:
python main.py --file research_paper.pdf --output-dir ./custom_dataset
The script supports multiple output formats:
[
{
"input": "Question 1?",
"output": "Answer 1"
},
{
"input": "Question 2?",
"output": "Answer 2"
}
]
[
{
"content": "Question 1?",
"role": "user"
},
{
"content": "Answer 1",
"role": "assistant"
},
{
"content": "Question 2?",
"role": "user"
},
{
"content": "Answer 2",
"role": "assistant"
}
]
{
"conversations": [
{
"from": "human",
"value": "Question 1?"
},
{
"from": "gpt",
"value": "Answer 1"
},
{
"from": "human",
"value": "Question 2?"
},
{
"from": "gpt",
"value": "Answer 2"
}
]
}
-
"Failed to connect to Ollama"
- Make sure Ollama is installed and running with
ollama serve
- Make sure Ollama is installed and running with
-
"Model not found in Ollama"
- Pull the model first:
ollama pull model_name
- Pull the model first:
-
PDF extraction issues
- Ensure the PDF is not password-protected
- Try converting to text first if the PDF has complex formatting
The script logs information to both the console and a file named llm_dataset_generator.log
in the logs directory.
Contributions are welcome! Please feel free to submit a Pull Request.