Skill Detail

Apache Tika Document Parser Agent

Extracts text and metadata from 1000+ file formats using Apache Tika server REST API. Handles PDF OCR via Tesseract integration, Office document parsing, and email archive extraction with MIME detection.

Data Extraction & TransformationGemini
Data Extraction & Transformation Gemini Security Reviewed
โญ 3.7k GitHub stars
INSTALL WITH ANY AGENT
npx skills add agentskillexchange/skills --skill apache-tika-document-parser-agent Copy
Works best when you want a reusable capability, not another fragile one-off prompt.
At a glance
Author
The Apache Software Foundation
Last updated
Mar 24, 2026
Quick brief

The Apache Tika Document Parser Agent provides universal document parsing through the Apache Tika REST API, supporting over 1000 file formats including PDF, DOCX, PPTX, XLSX, EML, MSG, RTF, and various image formats. It connects to a Tika server instance and handles content extraction, metadata parsing, and language detection.

How it works

What this skill actually does

For scanned PDFs and image-heavy documents, the skill activates Tika’s Tesseract OCR integration, configuring OCR parameters like DPI, page segmentation mode, and language packs. The OCR pipeline includes image preprocessing with contrast enhancement and deskewing for improved recognition accuracy.

Metadata extraction covers standard Dublin Core fields, format-specific properties (PDF author, Word revision count, EXIF data), and custom XMP metadata. The skill normalizes extracted metadata into a consistent schema regardless of source format.

Email archive processing handles recursive extraction of attachments from EML and MSG files, preserving thread relationships and attachment hierarchies. MIME type detection uses Tika’s content-based detection (magic bytes) rather than file extensions for reliable format identification.