MarkItDown-OCR-LocalRAG

A powerful, modular, and local-first Python toolkit that extracts, OCRs, reorganizes, and retrieves answers from documents (PDFs, images, scanned files) using MarkItDown, Tesseract OCR, and Local RAG techniques with the help of Ollama LLMs — no cloud, no APIs, 100% private.

📥 Installation Requirements

Before using the project, ensure you have the following installed:

1. Python Libraries

pip install -r requirements.txt

2. Ollama (for Local Language Models)

Download and install Ollama on your system. Make sure you pull a local model like gemma:2b or gemma3:12b:

ollama pull gemma:2b

gemma3:12b Recommended

3. Tesseract OCR Engine

You must install Tesseract separately because pytesseract is only a Python wrapper.

Windows:
- Download from: https://github.com/tesseract-ocr/tesseract
- Add the Tesseract installation path (e.g., C:\Program Files\Tesseract-OCR) to your system PATH.
Linux (Ubuntu):

sudo apt update
sudo apt install tesseract-ocr

Mac:

brew install tesseract

🚀 How It Works

The project operates in four intelligent phases:

1. Extract (Structured Extraction)

Tries to extract structured content from documents using MarkItDown, producing clean Markdown text when possible.

2. OCR (Fallback Strategy)

If MarkItDown fails (e.g., scanned PDF/image), automatically switches to Tesseract OCR to extract readable text.

3. Reorganize (Content Structuring)

Uses a local Ollama model (e.g., gemma3:12b) to reorganize the raw extracted text into highly readable, structured, and corrected Markdown format.

4. RAG (Local Question Answering)

Embeds the organized text with HuggingFace sentence-transformers.
Stores it in a FAISS vector database.
Creates a RetrievalQA chain that allows you to ask contextual questions about the content directly from your machine, fully offline.

🧩 Project Structure

markitdown-ocr-localrag/
├── src/
│   ├── __init__.py
│   ├── convert_file.py      # Extraction + OCR fallback
│   ├── rag_chat.py          # Local RAG system
│   └── markitdown_ocr_rag.py # Main orchestrator class
├── main.py                  # CLI entry point
├── requirements.txt         # Required packages
├── README.md                 # This file

⚡ How to Use

1. Clone and Setup

git clone https://github.com/AhmedZeyadTareq/markitdown-ocr-localrag.git
cd markitdown-ocr-localrag
python -m venv venv
source venv/bin/activate  # Linux/Mac
.\venv\Scripts\activate   # Windows
pip install -r requirements.txt

Make sure Tesseract OCR and Ollama are properly installed.

2. Run the Project

python main.py

You will be prompted to:

Enter the path to your document.
Enter your question about the document.

The system will automatically:

Extract ➔ OCR if needed ➔ Reorganize ➔ Build RAG ➔ Answer your question.

3. Example Usage in Python

from src.markitdown_ocr_rag import MarkitdownOCRLocalRAG
from src.rag_chat import start_rag_chat

pipeline = MarkitdownOCRLocalRAG()

organized_md = pipeline.extract_and_reorganize("example.pdf")

qa_chain = start_rag_chat(organized_md, pipeline.ollama_model)
answer = qa_chain.invoke({"query": "Summarize the key points."})["result"]
print(answer)

🔥 Why This Project is Powerful

Multi-Strategy Extraction: Always guarantees best-effort text extraction (structured first, OCR fallback second).
Local First: No internet dependencies, no API keys.
Privacy: All processing remains fully on your machine.
High Accuracy: Uses SOTA models from HuggingFace + Ollama.
Modular Design: Easy to extend, plug-and-play components.

👨‍💻 Developed By

Ahmed Zeyad Tareq

📌 Data Scientist & AI Developer | 🎓 Master of AI Engineering

📞 WhatsApp: +905533333587
GitHub | LinkedIn | Kaggle

📄 License

🌟 Support

If you like this project, give it a ⭐ on GitHub and share!
Got ideas for improvements? Feel free to open a Pull Request or create an Issue. 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MarkItDown-OCR-LocalRAG

📥 Installation Requirements

1. Python Libraries

2. Ollama (for Local Language Models)

3. Tesseract OCR Engine

🚀 How It Works

1. Extract (Structured Extraction)

2. OCR (Fallback Strategy)

3. Reorganize (Content Structuring)

4. RAG (Local Question Answering)

🧩 Project Structure

⚡ How to Use

1. Clone and Setup

2. Run the Project

3. Example Usage in Python

🔥 Why This Project is Powerful

👨‍💻 Developed By

Ahmed Zeyad Tareq

📄 License

🌟 Support

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

AhmedZeyadTareq/MarkItDown-OCR-LocalRAG

Folders and files

Latest commit

History

Repository files navigation

MarkItDown-OCR-LocalRAG

📥 Installation Requirements

1. Python Libraries

2. Ollama (for Local Language Models)

3. Tesseract OCR Engine

🚀 How It Works

1. Extract (Structured Extraction)

2. OCR (Fallback Strategy)

3. Reorganize (Content Structuring)

4. RAG (Local Question Answering)

🧩 Project Structure

⚡ How to Use

1. Clone and Setup

2. Run the Project

3. Example Usage in Python

🔥 Why This Project is Powerful

👨‍💻 Developed By

Ahmed Zeyad Tareq

📄 License

🌟 Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages