Skill Detail

Tesseract OCR Document Extractor

Extracts structured text from scanned documents and images using Tesseract OCR with custom LSTM training data. Supports table detection via OpenCV contour analysis and PDF/A output generation.

Data Extraction & TransformationChatGPT Agents

Data Extraction & Transformation ChatGPT Agents Security Reviewed

Tool match: pandas ⭐ 73.6k GitHub stars

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill tesseract-ocr-document-extractor Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source

At a glance

Author

Tesseract OCR

Last updated

Mar 24, 2026

Quick brief

The Tesseract OCR Document Extractor processes scanned documents, photographs of text, and faxed PDFs using the Tesseract 5.x LSTM engine with configurable page segmentation modes (PSM). It applies OpenCV preprocessing including adaptive thresholding, deskewing, and noise reduction to maximize recognition accuracy on low-quality inputs.

How it works

What this skill actually does

Table extraction uses OpenCV contour detection and Hough line transforms to identify cell boundaries, mapping recognized text to structured row/column positions. The output supports JSON, CSV, and pandas DataFrame formats for immediate downstream processing.

Custom training data can be generated using tesstrain makefiles for domain-specific fonts and vocabularies (medical forms, legal documents, engineering drawings). The skill handles multi-language documents with automatic script detection, supports right-to-left languages including Arabic and Hebrew, and generates PDF/A compliant output with embedded text layers for archival compliance. Confidence scores per word enable automated flagging of low-certainty extractions for human review.

Best fit

When to reach for it

Best when the job fits Data Extraction & Transformation.
Works naturally with ChatGPT Agents setups.

Trust & provenance

Why this listing is credible

Built around the pandas toolchain.
Trust status: Security Reviewed.
73.6k GitHub stars on the linked upstream source.
Last updated Mar 24, 2026.

View source ↗