Skill Detail

Tesseract OCR Data Extractor

Extracts structured data from scanned documents using Tesseract OCR engine with LSTM models. Supports table detection via OpenCV contour analysis and outputs to CSV, JSON, or Pandas DataFrames.

Data Extraction & TransformationGemini

Data Extraction & Transformation Gemini Security Reviewed

Tool match: pandas ⭐ 73.6k GitHub stars

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill tesseract-ocr-data-extractor Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source Documentation

At a glance

Tools required

Tesseract OCR, OpenCV

Install & setup

sudo apt install tesseract-ocr

Author

Tesseract OCR

Last updated

Mar 24, 2026

Quick brief

The Tesseract OCR Data Extractor combines the Tesseract 5 LSTM OCR engine with OpenCV preprocessing for high-accuracy text extraction from scanned documents, receipts, invoices, and forms. It handles multi-language recognition with traineddata models for over 100 languages.

How it works

What this skill actually does

Image preprocessing pipeline includes deskewing via Hough line transform, adaptive thresholding with Otsu method, noise reduction using morphological operations, and resolution upscaling with Lanczos interpolation. These steps significantly improve recognition accuracy on low-quality scans.

Table detection uses OpenCV contour analysis to identify grid structures, separating cell boundaries through horizontal and vertical line detection with cv2.getStructuringElement kernels. Extracted table data maintains row-column relationships and exports to CSV, JSON, or Pandas DataFrames for downstream analysis.

The skill supports PDF input via pdf2image conversion with Poppler, batch processing of multi-page documents, and confidence scoring per recognized word. Custom vocabulary lists and user patterns improve domain-specific accuracy for medical, legal, and financial documents. Output includes bounding box coordinates for each text region enabling overlay visualization.

Best fit

When to reach for it

Best when the job fits Data Extraction & Transformation.
Works naturally with Gemini setups.
Requires Tesseract OCR, OpenCV.
Installation is straightforward: sudo apt install tesseract-ocr

Trust & provenance

Why this listing is credible

Built around the pandas toolchain.
Trust status: Security Reviewed.
73.6k GitHub stars on the linked upstream source.
Last updated Mar 24, 2026.

View source ↗ Documentation ↗