Skill Detail

Tesseract OCR Data Extractor

Extracts structured data from scanned documents using Tesseract OCR engine with LSTM models. Supports table detection via OpenCV contour analysis and outputs to CSV, JSON, or Pandas DataFrames.

Data Extraction & TransformationGemini
Data Extraction & Transformation Gemini Security Reviewed
Tool match: pandas โญ 73.6k GitHub stars
INSTALL WITH ANY AGENT
npx skills add agentskillexchange/skills --skill tesseract-ocr-data-extractor Copy
Works best when you want a reusable capability, not another fragile one-off prompt.
At a glance
Tools required
Tesseract OCR, OpenCV
Install & setup
sudo apt install tesseract-ocr
Author
Tesseract OCR
Last updated
Mar 24, 2026
Quick brief

The Tesseract OCR Data Extractor combines the Tesseract 5 LSTM OCR engine with OpenCV preprocessing for high-accuracy text extraction from scanned documents, receipts, invoices, and forms. It handles multi-language recognition with traineddata models for over 100 languages.

How it works

What this skill actually does

Image preprocessing pipeline includes deskewing via Hough line transform, adaptive thresholding with Otsu method, noise reduction using morphological operations, and resolution upscaling with Lanczos interpolation. These steps significantly improve recognition accuracy on low-quality scans.

Table detection uses OpenCV contour analysis to identify grid structures, separating cell boundaries through horizontal and vertical line detection with cv2.getStructuringElement kernels. Extracted table data maintains row-column relationships and exports to CSV, JSON, or Pandas DataFrames for downstream analysis.

The skill supports PDF input via pdf2image conversion with Poppler, batch processing of multi-page documents, and confidence scoring per recognized word. Custom vocabulary lists and user patterns improve domain-specific accuracy for medical, legal, and financial documents. Output includes bounding box coordinates for each text region enabling overlay visualization.