Skill Detail

Apache Tika Document Parser Agent

Extracts text and metadata from 1000+ file formats using Apache Tika server REST API. Handles PDF OCR via Tesseract integration, Office document parsing, and email archive extraction with MIME detection.

Data Extraction & TransformationGemini

Data Extraction & Transformation Gemini Security Reviewed

⭐ 3.7k GitHub stars

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill apache-tika-document-parser-agent Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source

At a glance

Author

The Apache Software Foundation

Last updated

Mar 24, 2026

Quick brief

The Apache Tika Document Parser Agent provides universal document parsing through the Apache Tika REST API, supporting over 1000 file formats including PDF, DOCX, PPTX, XLSX, EML, MSG, RTF, and various image formats. It connects to a Tika server instance and handles content extraction, metadata parsing, and language detection.

How it works

What this skill actually does

For scanned PDFs and image-heavy documents, the skill activates Tika’s Tesseract OCR integration, configuring OCR parameters like DPI, page segmentation mode, and language packs. The OCR pipeline includes image preprocessing with contrast enhancement and deskewing for improved recognition accuracy.

Metadata extraction covers standard Dublin Core fields, format-specific properties (PDF author, Word revision count, EXIF data), and custom XMP metadata. The skill normalizes extracted metadata into a consistent schema regardless of source format.

Email archive processing handles recursive extraction of attachments from EML and MSG files, preserving thread relationships and attachment hierarchies. MIME type detection uses Tika’s content-based detection (magic bytes) rather than file extensions for reliable format identification.

Best fit

When to reach for it

Best when the job fits Data Extraction & Transformation.
Works naturally with Gemini setups.

Trust & provenance

Why this listing is credible

Trust status: Security Reviewed.
3.7k GitHub stars on the linked upstream source.
Last updated Mar 24, 2026.

View source ↗