Skip to content

Commit 8c7f1a2

Browse files
committed
add PyTesseract
1 parent 96db272 commit 8c7f1a2

File tree

2 files changed

+126
-0
lines changed

2 files changed

+126
-0
lines changed

Toolbox/Libraries/PyTesseract.qmd

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
---
2+
title: "Pytesseract: OCR with Tesseract in Python"
3+
author:
4+
- name: Chris Endemann
5+
6+
7+
date: 2025-04-05
8+
date-format: long
9+
image: "../../../images/PyTesseract.jpeg"
10+
11+
categories:
12+
- Libraries
13+
- OCR
14+
- NLP
15+
- Computer vision
16+
- Text extraction
17+
- Multilingual
18+
- LSTM
19+
---
20+
21+
## About this resource
22+
23+
[Pytesseract](https://pypi.org/project/pytesseract/) is a Python wrapper for [Google’s Tesseract OCR engine](https://github.com/tesseract-ocr/tesseract), used for recognizing and extracting text from images. It works on a wide range of image types (e.g., JPEG, PNG, TIFF) and supports over 100 languages, including Chinese, Arabic, and Devanagari.
24+
25+
Tesseract uses a character-level LSTM model and runs entirely on CPU, making it easy to deploy in low-resource environments. While it’s not state-of-the-art for complex layout or scene text, it’s fast, scriptable, and widely supported — ideal for lightweight OCR use cases.
26+
27+
## Key features
28+
29+
- Reads printed text from standard image formats
30+
- Works with file paths, Pillow/PIL (Python Imaging Library), or OpenCV arrays
31+
- Supports multilingual text recognition
32+
- Outputs plain text, bounding boxes, PDFs, TSV, and XML formats
33+
- Fast CPU-based inference with no GPU dependencies
34+
35+
## Model architecture
36+
37+
Tesseract relies on an LSTM pipeline trained on character-level text. It performs well when the input is clean and straightforward — such as scanned documents or forms — but struggles with visual ambiguity, clutter, or layout-sensitive content.
38+
39+
For more robust use cases, newer models like [TrOCR](https://huggingface.co/microsoft/trocr-base-stage1), [Donut](https://github.com/clovaai/donut), and [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) use **Vision Transformers (ViTs)**. PaddleOCR in particular includes both CNN- and transformer-based backends. These models are better suited for tasks where text is visually entangled with surrounding context — like reading overlaid labels on maps or structured forms.
40+
41+
## Installation and usage
42+
43+
To use pytesseract, you need to install both the Tesseract OCR engine and the Python wrapper.
44+
45+
### Ubuntu / Debian
46+
47+
```bash
48+
sudo apt update
49+
sudo apt install tesseract-ocr
50+
pip install pytesseract
51+
```
52+
53+
### macOS
54+
55+
```bash
56+
brew install tesseract
57+
pip install pytesseract
58+
```
59+
60+
### Windows
61+
62+
1. Download and install the Tesseract binary from the [UB Mannheim builds](https://github.com/UB-Mannheim/tesseract/wiki)
63+
2. Note the install location, typically:
64+
```
65+
C:\Program Files\Tesseract-OCR\tesseract.exe
66+
```
67+
3. Either add this location to your system PATH, or set it manually in your script:
68+
69+
```python
70+
import pytesseract
71+
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
72+
```
73+
74+
4. Install the Python wrapper:
75+
76+
```bash
77+
pip install pytesseract
78+
```
79+
80+
### Basic usage
81+
82+
```python
83+
from PIL import Image # Pillow is the Python Imaging Library
84+
import pytesseract
85+
86+
# Extract plain text
87+
text = pytesseract.image_to_string(Image.open("example.png"))
88+
89+
# Structured output with positions and confidences
90+
df = pytesseract.image_to_data(Image.open("example.png"), output_type=pytesseract.Output.DATAFRAME)
91+
92+
# Character-level bounding boxes
93+
boxes = pytesseract.image_to_boxes(Image.open("example.png"))
94+
```
95+
96+
Replace `"example.png"` with your own image file containing text. Pytesseract supports both in-memory images and file paths.
97+
98+
## Pros and limitations
99+
100+
| Pros | Limitations |
101+
|------|-------------|
102+
| Easy to install and use on most systems | No GPU acceleration — slower on large datasets |
103+
| Multilingual out of the box | Cannot be fine-tuned or retrained |
104+
| Good for simple forms and documents | Struggles with complex layouts or visual context |
105+
| CPU-only — works in low-resource environments | Lower accuracy than transformer-based models on cluttered or noisy inputs |
106+
107+
Tesseract’s fast CPU performance and no-frills setup make it great for small-scale OCR, but it’s not optimized for high-volume pipelines or scene text recognition.
108+
109+
## When to use
110+
111+
- You need fast OCR on clean documents or small image batches
112+
- You want to automate extraction from scanned forms, labels, or tables
113+
- You’re working in a CPU-only or resource-constrained environment
114+
- You want a scriptable fallback tool before reaching for ViT-based OCR
115+
116+
## See also
117+
118+
- [GitHub repo: madmaze/pytesseract](https://github.com/madmaze/pytesseract) – Source code and examples
119+
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) – End-to-end OCR with detection, recognition, and layout modeling (CNN and ViT backends)
120+
- [TrOCR](https://huggingface.co/microsoft/trocr-base-stage1) – Transformer-based OCR with multilingual support
121+
- [Donut](https://github.com/clovaai/donut) – OCR + document understanding via vision-language modeling
122+
- [EasyOCR](https://github.com/JaidedAI/EasyOCR) – Lightweight OCR tool with CNN + LSTM backends
123+
124+
## Questions?
125+
126+
Working on OCR for maps, handwritten notes, or multilingual scans? Curious whether Tesseract is the right fit for your pipeline? Post in the [Nexus Q&A](https://github.com/UW-Madison-DataScience/ML-X-Nexus/discussions/categories/q-a) to share examples or get advice.

images/PyTesseract.jpeg

19.3 KB
Loading

0 commit comments

Comments
 (0)